Environment
- YugabyteDB Anywhere - 2.16 and above.
Issue
After Upgrading YBAnywhere to 2.16.1 Backups take more time to complete.
Application log shows most of the time is being spent in Phase 3 for the same Universe:
YW 2023-02-21T05:38:11.854Z [DEBUG] 085f7d70-02e5-4c10-8d38-6ecbd455179f from ShellProcessHandler in TaskPool-MultiTableBackup(d704042e-3bf2-4dfe-91d0-12fe350fce8d)-0 - 0:00:16.324224 : PHASE 3 : Upload snapshot directories
After Upgrade:
YW 2023-02-22T14:52:26.173Z [DEBUG] 10d36e82-2bda-45f8-905b-889efb2be5ad from ShellProcessHandler in TaskPool-MultiTableBackup(d704042e-3bf2-4dfe-91d0-12fe350fce8d)-5 - 9:49:02.058427 : PHASE 3 : Upload snapshot directories
Until the start of snapshot upload the script is going fine but after this it does not print any logs for 9 hours.
There's another Symptom for the same problem where the logs would have WARNINGS like below during the 9 hour window:
WARNING: Found a snapshot directory '/mnt/disk0/yb-data/tserver/data/rocksdb/table-<id>/tablet-<id>.snapshots/<id>' on tablet server '<tablet_server>' that is not present in the list of tablets we are interested in that have this tserver hosting it (..., ... ), skipping
Depending on the time taken for printing each message and the number of tablet leaders in the Universe, The backup can run for hours.
Resolution
Overview
The problem is due to the yb_backup.py
script spending too much time waiting to find the tablet leaders as part of the backup and printing WARNING messages if they identify a Follower( thats not needed for the backup). This is the part of the code that's causing the problem.
Steps
Workaround for this issue is to comment out the following lines in the yb_backup.py script, and retry the backup:
1. SSH/Login into the yugaware docker container.
(Follow the appropriate steps to login to the Yugaware host based on the deployment type used)
2. Edit the backup script
vi /opt/yugabyte/devops/bin/yb_backup.py
3. Comment out a group of lines starting at line 2428 to skip the step to print WARNING Messages: (please don't comment the continue line at the end)
if tablet_id not in tablets_by_tserver_ip[tserver_ip]:
# logging.warning(
# ("Found a snapshot directory '{}' on tablet server '{}' that is not "
# "present in the list of tablets we are interested in that have this "
# "tserver hosting it ({}), skipping.").format(
# snapshot_dir, tserver_ip,
# ", ".join(sorted(tablets_by_tserver_ip[tserver_ip]))))
continue
4. Re-run the backup.
Additional Information
We have also filed a JIRA internally to investigate the issue further. https://yugabyte.atlassian.net/browse/PLAT-7481
The steps need to be redone if YugabyteDB Anywhere is upgraded to a version that doesnot have the fix. (As the yb_backup.py file will get overwritten with a new version).
Comments
0 comments
Please sign in to leave a comment.