Issue
xCluster replication lag keeps increasing. This article will help you identify and mitigate issues with xCluster replication after it was successfully setup.
Identifying the error
yb-admin
You can use the yb-admin get-replication-status command against the target universe to view the status of the xCluster replication group:
$ yb-admin -master_addresses <target_master_addresses> get_replication_status RG1
statuses {
table_id: "000034bf000030008000000000004000"
stream_id: "379ee1f6891eed87a6403d805d41eae9"
errors {
error: REPLICATION_SCHEMA_MISMATCH
error_detail: "Producer Tablet IDs: 116ce6f8c30e473d8e0355b7f6bd1f87,04b27745aa07443084a6860e9223b940,aa16ec4c56634f46867bd5ede977d024,a916fa6aa0c04950adaa6a18500c38ac,f914b47cf4594a839234ab7547d596ec,514e06bbd3dd4a08a18542fe568879c4"
}
}
yb-master UI
You can also see xCluster replication status in the target universe yb-master /xcluster UI page.
Error types and remedies:
The following are the category of errors that you will encounter:
-
MISSING_OP_ID
- The write ahead log (WAL) segments required to catch up the target universe have been garbage collected from the source universe. The only way to recover from this error is to remove the table from replication and add it back again with bootstrap. For YSQL tables, the entire database will have to be bootstrapped.
If you expect the target universe to go down for extended amount of hours, consider using a higher default for --cdc-wal-retention-time-secs. This will result in more disk space utilization on the source universe for the WALs. In versions before 2024.2, a rebootstrap is needed for the new value to take effect, so you can instead increase --log-min-seconds-to-retain, but note that this will affect even tables that are not part of xCluster.
- The write ahead log (WAL) segments required to catch up the target universe have been garbage collected from the source universe. The only way to recover from this error is to remove the table from replication and add it back again with bootstrap. For YSQL tables, the entire database will have to be bootstrapped.
-
SCHEMA_MISMATCH
- The schema of the table on the target universe does not match the schema on the source universe. Run the corresponding DDL to update the target table to match the source.
-
MISSING_TABLE
- A colocated table is missing on the target database. Create the missing table on the target universe.
-
AUTO_FLAG_CONFIG_VERSION_MISMATCH
- The AutoFlags config has changed and the new version is not compatible. The target universe must always be on a higher version than the source universe. Upgrade the target universe.
-
SOURCE_UNREACHABLE
- The target universe nodes are unable to connect to the source universe nodes. Check for network errors.
-
SYSTEM_ERROR
- There was a generic system error. Check the yb-tserver UI to get more detailed information about the failure.
-
ERROR_UNINITIALIZED
- Replication state has not yet been initialized. If this error does not clear out after 20 minutes, check the health of the nodes, and tablets on the target universe.
-
OK
- xCluster replication is healthy.
yb-tserver UI
You can see more detailed xCluster replication status in the target universe yb-tserver /xcluster UI page. This page only shows the information about the tablet leaders that this tserver is hosting, so you may have to look at a few tservers to find the one with the error.
Healthy State
A healthy replication group will have the following output for the yb-admin command:
$ yb-admin -master_addresses <target_master_addresses> get_replication_status RG1
statuses {
table_id: "000034bf000030008000000000004000"
stream_id: "379ee1f6891eed87a6403d805d41eae9"
}
And the yb-master /xcluster UI will look as below:
Comments
0 comments
Please sign in to leave a comment.