Environment
- YugabyteDB <= 2.8.x
- YugabyteDB 2.12.x
- YugabyteDB 2.14.x
- YugabyteDB 2.16.x
Issue
When performing a restore operation on an affected version of Yugabyte software, metadata information is absent, resulting in incorrect placement of underlying tablets in a manner which can affect data availability in multi-AZ (multi-Availability Zone) or multi-Region deployment topologies in the event of an AZ or Region failure.
Impact
This issue only impacts databases or tables which have been restored.
Impacted clusters must have than 1 Availability Zone, and the total number of nodes in the cluster must be greater than the Replication Factor.
The issue impacts both YCQL and YSQL tables.
A subset of tables in a database or the entire database whichever was restored from a backup may result in incorrect placement of the underlying tablets and could potentially result in any of the following scenarios:
- Fewer than the assigned number (or zero) tablet replicas placed in an AZ or Region
- More than the assigned number of tablet replicas placed in an AZ or Region
- The load balancer may place tablet leaders in a region which is not preferred, leading to performance or latency issues.
Resolution
Upgrade to one of the following versions which contains a fix to resolve this issue:
Fixed Version | Release |
2.8.12.0-b4 |
|
2.12.12.0-b3 |
v2.12.12.0 - February 13, 2023 |
2.14.6.1-b1 |
|
2.16.1.0-b39 |
|
2.17.1.0 |
After the upgrade, the load balancer will automatically correct tablet placement to ensure availability zone resiliency. No action is needed by the user after this point.
The load balancer may be tuned to speed up tablet movement. Please reach out to Yugabyte Support for further questions about tuning the load balancer on clusters with high tablet counts or significant amounts of data.
Full recovery time depends on the number of tablets which must be moved to a proper Availability Zone or Region.
Additional information
Yugabyte uses load balancing operations to ensure that data placement is spread across nodes, according to replication factor (RF) and failure domain. This is the distributed nature of the database, and is critical to providing resiliency during an outage. Affected versions of Yugabyte may experience unexpected tablet placement during load balancing events for tables that were restored from backup. When in this state, any availability zone failure increases the risk of data unavailability or data loss. Upgrading to the latest release version will fix this issue for both affected tables and any future restores.
Identifying sub-optimal tablet placement
Prerequisites
- Have a working version of
yugatool
- Ensure you have
SQLite
on your system
Steps
-
Use the
yugatool
utility to run atablet_report
As documented in the following Knowledge Base article:-
Where the command is similar to the following:
./yugatool cluster_info \
-m $MASTERS \
$TLS_CONFIG \
--show-tableid \
--tablet-report \
> /tmp/tablet-report-$(hostname)-$(date -I).out
-
Process the tablet report information with the
tablet-report-parser.pl
utility to process the output, using the following version:https://github.com/yugabyte/yb-tools/blob/main/tablet-report-parser/tablet-report-parser.pl
Using the following command line:perl tablet-report-parser.pl <report_name>
-
To determine if there is any sub-optimal placement of tablets, examine the "Summary Report" in the output of tablet_report_parser.pl, looking for a line like:
3 Zones have unbalanced tablets (See "region_zone_tablets")
If the number of zones that have unbalanced tablets is greater than 0 , upgrade as soon as possible.
-
For a detailed, per-zone, distribution of the summary information run the following command, where the output will look similar to the following:
% sqlite3 -header -column tablet-report.sqlite "SELECT * from region_zone_tablets"
region zone tservers missing_replicas 1_replicas 2_replicas 3_replicas balanced ---------- ---------- ---------- ---------------- ---------- ---------- ---------- ---------- RG1 rg1-f-zone 4 2238 (126.5 GB) 6289/10696 2100 69 NO RG3 rg3-e-zone 4 3342 (196.1 GB) 4289/10696 2788 277 NO RG3 rg3-f-zone 4 3498 (193.0 GB) 4358/10696 2182 658 NO- Where:
Column Description missing_replicas
Number of tablet replicas that are ABSENT in this zone, and the number of bytes they would require. 1_replica
The number of tablets that have 1 replica in this zone "/"
The total number of replicas.2_replica ..etc
Number of tablets that have 'n' replicas in this zone balanced
"YES" or "NO". "YES" means that tablet replicas are equally distributed among regions. I.E. No missing or extra replicas.
- If some regions show "balanced=NO", upgrade immediately.
- If you have an unusual replication factor, or deliberately distribute tablets unevenly, and are able to explain the distribution, you can consider this view informational.
-
If further interpretation is required, contact Yugabyte Support
Comments
0 comments
Please sign in to leave a comment.