Environment
- YugabyteDB Anywhere - 2.16 and above.
- YugabyteDB - 2.16 and above.
Issue
YBController based backups can fail with the following error in the YBAnywhere application logs:
YW 2023-07-31T16:14:21.292Z [INFO] 29d75932-f5d3-4889-86c2-d9d2ac30cba8 from BackupTableYbc in TaskPool-CreateBackup(5021384f-8d72-4711-8de6-2473bcae2fa6)-0 - Backup 9d5110d2-2dc5-43ac-ba9d-af330b79c1b7 - Keyspace yugabyte - ParamsIdentifier 4ea0facc-e357-475e-a290-e33de87e9fce : Successfully submitted backup task to YB-controller server: 10.21.240.14 with taskID: 9d5110d2-2dc5-43ac-ba9d-af330b79c1b7_PGSQL_TABLE_TYPE_yugabyte_4ea0facc-e357-475e-a290-e33de87e9fce
YW 2023-07-31T16:14:21.332Z [INFO] 29d75932-f5d3-4889-86c2-d9d2ac30cba8 from YbcTaskBase in TaskPool-CreateBackup(5021384f-8d72-4711-8de6-2473bcae2fa6)-0 - Task id 9d5110d2-2dc5-43ac-ba9d-af330b79c1b7_PGSQL_TABLE_TYPE_yugabyte_4ea0facc-e357-475e-a290-e33de87e9fce status: Number of retries 0
YW 2023-07-31T16:14:21.332Z [ERROR] 29d75932-f5d3-4889-86c2-d9d2ac30cba8 from BackupTableYbc in TaskPool-CreateBackup(5021384f-8d72-4711-8de6-2473bcae2fa6)-0 - Backup 9d5110d2-2dc5-43ac-ba9d-af330b79c1b7 - Keyspace yugabyte - ParamsIdentifier 4ea0facc-e357-475e-a290-e33de87e9fce : Polling backup task progress on YB-Controller failed with error Task id 9d5110d2-2dc5-43ac-ba9d-af330b79c1b7_PGSQL_TABLE_TYPE_yugabyte_4ea0facc-e357-475e-a290-e33de87e9fce status: Failed with error COMMAND_FAILED
YW 2023-07-31T16:14:21.420Z [ERROR] 29d75932-f5d3-4889-86c2-d9d2ac30cba8 from YbcManager in TaskPool-CreateBackup(5021384f-8d72-4711-8de6-2473bcae2fa6)-0 - Aborting backup 9d5110d2-2dc5-43ac-ba9d-af330b79c1b7 task errored out with Cannot abort: task is already complete.
YW 2023-07-31T16:14:21.421Z [ERROR] 29d75932-f5d3-4889-86c2-d9d2ac30cba8 from YbcClient in TaskPool-CreateBackup(5021384f-8d72-4711-8de6-2473bcae2fa6)-0 - Exception: UNAVAILABLE: Channel shutdown invoked
YW 2023-07-31T16:14:21.421Z [ERROR] 29d75932-f5d3-4889-86c2-d9d2ac30cba8 from YbcManager in TaskPool-CreateBackup(5021384f-8d72-4711-8de6-2473bcae2fa6)-0 - Task 9d5110d2-2dc5-43ac-ba9d-af330b79c1b7_PGSQL_TABLE_TYPE_yugabyte_4ea0facc-e357-475e-a290-e33de87e9fce deletion failed with error: null
YW 2023-07-31T16:14:21.423Z [ERROR] 29d75932-f5d3-4889-86c2-d9d2ac30cba8 from TaskExecutor in TaskPool-CreateBackup(5021384f-8d72-4711-8de6-2473bcae2fa6)-0 - Failed to execute task type subtasks.BackupTableYbc UUID 0986438e-9a07-40ec-939a-9336a8ddfe27 details {"sleepAfterMasterRestartMillis":180000,"sleepAfterTServerRestartMillis":180000,"nodeExporterUser":"prometheus","universeUUID":"5021384f-8d72-4711-8de6-2473bcae2fa6","enableYbc":false,"installYbc":false,"ybcInstalled":false,"encryptionAtRestConfig":{"encryptionAtRestEnabled":false,"opType":"UNDEFINED","type":"DATA_KEY"},"communicationPorts":{"masterHttpPort":7000,"masterRpcPort":7100,"tserverHttpPort":9000,"tserverRpcPort":9100,"ybControllerHttpPort":14000,"ybControllerrRpcPort":18018,"redisServerHttpPort":11000,"redisServerRpcPort":6379,"yqlServerHttpPort":12000,"yqlServerRpcPort":9042,"ysqlServerHttpPort":13000,"ysqlServerRpcPort":5433,"nodeExporterPort":9300},"extraDependencies":{"installNodeExporter":true},"firstTry":true,"keyspace":"yu****te","sse":false,"storageConfigUUID":"c118a498-6fb6-4c5f-8ec6-ec8f3b0900c3","storageLocation":"gs://pplus-prod-yb-backups/univ-5021384f-8d72-4711-8de6-2473bcae2fa6/ybc_backup-2023-07-31T16:14:20-24133114/multi-table-yugabyte_4ea0facc-e357-475e-a290-e33de87e9fce","isFullBackup":true,"allTables":false,"disableChecksum":false,"disableMultipart":false,"backupType":"PGSQL_TABLE_TYPE","tableNameList":[],"tableUUIDList":[],"backupParamsIdentifier":"4ea0facc-e357-475e-a290-e33de87e9fce","schedulingFrequency":0,"minNumBackupsToRetain":3,"timeBeforeDelete":604800000,"enableVerboseLogs":false,"alterLoadBalancer":false,"transactionalBackup":false,"tableByTableBackup":false,"parallelism":8,"disableParallelism":false,"customerUuid":"e9a7735c-94c4-4850-b925-cc54af1c8435","baseBackupUUID":"9d5110d2-2dc5-43ac-ba9d-af330b79c1b7","backupUuid":"9d5110d2-2dc5-43ac-ba9d-af330b79c1b7","ignoreErrors":true,"useTablespaces":false,"oldOwner":"yugabyte","backupSizeInBytes":0,"fullChainSizeInBytes":0,"expiryTimeUnit":"DAYS","timeTakenPartial":0,"thisBackupSubTaskStartTime":0,"backupDBStates":{},"fullBackup":true,"targetXClusterConfigs":[],"sourceXClusterConfigs":[]}, hit error.
com.yugabyte.yw.common.PlatformServiceException: Task id 9d5110d2-2dc5-43ac-ba9d-af330b79c1b7_PGSQL_TABLE_TYPE_yugabyte_4ea0facc-e357-475e-a290-e33de87e9fce status: Failed with error COMMAND_FAILED
at com.yugabyte.yw.commissioner.YbcTaskBase.handleTaskCompleteStage(YbcTaskBase.java:97)
at com.yugabyte.yw.commissioner.YbcTaskBase.pollTaskProgress(YbcTaskBase.java:68)
at com.yugabyte.yw.commissioner.tasks.subtasks.BackupTableYbc.run(BackupTableYbc.java:188)
at com.yugabyte.yw.commissioner.TaskExecutor$AbstractRunnableTask.run(TaskExecutor.java:835)
at com.yugabyte.yw.commissioner.TaskExecutor$RunnableSubTask.run(TaskExecutor.java:1213)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at com.yugabyte.yw.common.logging.MDCAwareRunnable.run(MDCAwareRunnable.java:46)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Based on the message Successfully submitted backup task to YB-controller server: 10.21.240.14
(In the error snippet above) we can find the node where the backup task got submitted to and check the YBController logs on the Node: (under /home/yugabyte/controller/logs)
It will have messages like below:
E0731 16:14:21.220299 3247716 yb_tablet_splitting_ctl.h:85] Unable to execute /home/yugabyte/tserver/bin/yb-admin --master_addresses 10.21.240.8:7100,10.24.104.170:7100,10.24.106.48:7100,10.24.120.209:7100,10.24.121.4:7100 --certs_dir_name /home/yugabyte/yugabyte-tls-config disable_tablet_splitting 120000 yb_backup : 1;
E0731 16:14:21.232285 1614085 snapshot.cc:351] Command failed: status: OK; ret: 1
E0731 16:14:21.232450 1614085 snapshot.cc:352] Output:
E0731 16:14:21.232465 1614085 snapshot.cc:353] Error Output:
E0731 16:14:21.233794 1614085 snapshot.cc:351] Command failed: status: OK; ret: 1
E0731 16:14:21.233842 1614085 snapshot.cc:352] Output:
E0731 16:14:21.233868 1614085 snapshot.cc:353] Error Output:
Resolution
Overview
By default, YBController writes some control files into the /tmp/yb-controller-tmp directory
. For example a healthy system will have files like below:
ls -lRth /tmp/yb-controller-tmp/
/tmp/yb-controller-tmp/:
total 0
drwxrwxr-x. 2 yugabyte yugabyte 134 Aug 24 07:05 coprocess
drwx------. 2 yugabyte yugabyte 6 Aug 24 07:05 cleanup
/tmp/yb-controller-tmp/coprocess:
total 16K
-rw-rw-r--. 1 yugabyte yugabyte 0 Aug 24 07:05 snapshot.err.log
-rw-rw-r--. 1 yugabyte yugabyte 8 Aug 24 07:05 snapshot.out.log
-rw-rw-r--. 1 yugabyte yugabyte 275 Aug 24 07:05 snapshot
-rw-rw-r--. 1 yugabyte yugabyte 0 Aug 24 07:05 splitter.err.log
-rw-rw-r--. 1 yugabyte yugabyte 8 Aug 24 07:05 splitter.out.log
-rw-rw-r--. 1 yugabyte yugabyte 275 Aug 24 07:05 splitter
/tmp/yb-controller-tmp/cleanup:
total 0
However a faulty system will be missing some files like below:
[yugabyte@ip-172-16-214-22 ~]$ ls -lrth /tmp/yb-controller-tmp/
total 0
drwx------. 2 yugabyte yugabyte 6 Aug 24 03:20 cleanup
The default configuration for systemd’s tmpfiles.d in some distributions removes any files in /tmp
that are more than 10 days old. This configuration is stored in /usr/lib/tmpfiles.d/tmp.conf
.
If a system is not very active and if we haven’t run any backup/restore for more than 10 days the file timestamp under /tmp/yb-controller-tmp/
won’t be modified. Therefore the above system based cleanup job removes the old directories after 10 days.
Steps:
This issue has been fixed in below versions as part of PLAT-10007:
- 2.16.7 and later
- 2.18.3 and later
Yugabyte recommends upgrading to the versions with fixes.
Workarounds:
a) The missing files are regenerated during YB controller startup, so restarting the controller will address the missing files problem.
For cron enabled systems:
/home/yugabyte/bin/yb-server-ctl.sh controller stop
/home/yugabyte/bin/yb-server-ctl.sh controller start
For systemd enabled systems:
systemctl --user stop yb-controller
systemctl --user start yb-controller
b) We can add the line x /tmp/yb-controller-tmp
to /usr/lib/tmpfiles.d/tmp.conf
on the affected nodes which will also prevent the files under /tmp
from deletion. (Restart YBController after making the changes to regenerate the missing files)
$ cat /usr/lib/tmpfiles.d/tmp.conf (truncated output showing just the change made)
#Excluding the Yugabyte YB-Controller tmp process dir from deletion due to a known issue PLAT-10007
x /tmp/yb-controller-tmp
c) Permanently resolve the issue by changing the temporary process directory used by YBController from /tmp/
to /{path_to_ybc-data}/ybc-data/tmp
by adding the flag below in YBController server.conf file and restarting YBController to make this change take effect. The server.conf is located in the yugabyte user’s home directory at controller/conf/server.conf
#cat controller/conf/server.conf (truncated output showing just the change made)
--tmp_dir=/{path_to_ybc-data}/ybc-data/tmp/
where:
- {path_to_ybc-data} is the path to the ybc-data directory. This can be found by
find / -type d -name "ybc-data" 2>/dev/null
so if {path_to_ybc-data} is /mnt/d0/ybc-data, then
--tmp_dir=/mnt/d0/ybc-data/tmp/
Additional Information
If we want more information logged in the YBController logs, Log into the problematic node and edit the YBController’s server.conf file to add the following line to enable verbose logging, The server.conf is located in the yugabyte user’s home directory at controller/conf/server.conf
#cat controller/conf/server.conf (truncated output showing just the change made)
--v=1
Verbose logging is the default setting in newer versions of the controller, so it is safe to leave this change in place indefinitely.
Comments
0 comments
Please sign in to leave a comment.