Environment
- YugabyteDB - 2.20.0 - 2.20.7.0
- YugabyteDB - 2024.1.0 - 2024.1.3.0
Symptoms
Various YugabyteDB universe related operation (not limited to the doc referenced) fails. Few examples are:
- Under Replicated Tablet Alerts in YBA (specifically complaining about only 1 tablet). Though the Master UI > Utilities > Replica Info may not show any under replicated tablet.
- Cluster is not Load Balanced.
- Any Universe task triggered from YugabyteDB Anywhere UI i.e Node Action, Universe Upgrade, Rolling Restart etc. fails with following error or similar.
Caused by: java.lang.RuntimeException: CheckUnderReplicatedTablets, timing out after retrying 11 times for a duration of 652891ms, greater than max time out of 600000ms. Under-replicated tablet size: 4. Failing...
Cause
Tablet bootstrap can run into a deadlock while processing bootstrap operations. The deadlock condition can preclude the load balancer from performing operations which maintain tablet distribution across a YugabyteDB universe.
Additional Information
The issue can happen anytime a tserver restarts and a tablet peer has an outstanding truncate operation to replay as part of opening the tablet.
Steps To Identify Issue
Master Leader UI
The following steps are to be executed on the Master Leader UI and directly on the Master Leader via SSH connection.
1. In Master Leader UI >> Utilities >> Logs Messages similar to the following can be observed in large quantities repeating:
I0203 20:08:42.007362 6642 cluster_balance_util.cc:503] tablet server db1306ff367c4b8182ed229fef773059 has a pending delete. Not allowing it to take more tablets
- The following messages can also be observed in the log:
I0203 20:31:42.374270 6642 cluster_balance.cc:704] tablet server db1306ff367c4b8182ed229fef773059 has a pending delete for tablets [adf6e7ddd2df4425a8b90b8312114cc9]
2. As an example, the following command can be executed on the Master Leader OS command line against the most recent yb-master.INFO file which counts the number of occurrences of the "pending delete" messages:
grep "pending delete" yb-master.INFO | cut -f5- -d" " | sort -n | uniq -c
- And the output can look similar to the following:
cluster_balance.cc:704] tablet server db1306ff367c4b8182ed229fef773059 has a pending delete for tablets [adf6e7ddd2df4425a8b90b8312114cc9]
cluster_balance_util.cc:503] tablet server db1306ff367c4b8182ed229fef773059 has a pending delete. Not allowing it to take more tablets
- In Master Leader UI >> Tablet Servers identify the tserver by id. For this example, the tserver ID is
db1306ff367c4b8182ed229fef773059
. Then browse to the UI associated with the tserver.
- On the tserver web UI, on the left hand side, select "Tablets" and for the tablet ID, in this example is
adf6e7ddd2df4425a8b90b8312114cc9
The tablet will be listed in the BOOTSTRAPPING
state.
T-Server UI
The following steps are to be executed on the Tserver UI and directly on the Tserver via SSH connection.
1. On the left hand side browse to Utilities > Threads, and the output will look similar to the following:
@ 0x7f926b3be3b7 __pthread_cond_timedwait
@ 0x562ec520cd81 std::__1::condition_variable::wait_until<>()
@ 0x562ec53dc557 std::__1::this_thread::sleep_until<>()
@ 0x562ec687791a yb::RWOperationCounter::DisableAndWaitForOps()
@ 0x562ec6878eef yb::ScopedRWOperationPause::ScopedRWOperationPause()
@ 0x562ec6200579 yb::tablet::Tablet::PauseReadWriteOperations()
@ 0x562ec61ff98b yb::tablet::Tablet::StartShutdownRocksDBs()
@ 0x562ec6226adc yb::tablet::Tablet::Truncate()
@ 0x562ec624c9f7 yb::tablet::TabletBootstrap::PlayAnyRequest()
@ 0x562ec624a93d yb::tablet::TabletBootstrap::ApplyCommittedPendingReplicates()
@ 0x562ec62448e5 yb::tablet::TabletBootstrap::PlaySegments()
@ 0x562ec6238087 yb::tablet::TabletBootstrap::Bootstrap()
@ 0x562ec62500ac yb::tablet::BootstrapTablet()
@ 0x562ec64fa5b6 yb::tserver::TSTabletManager::OpenTablet()
@ 0x562ec68b3598 yb::ThreadPool::DispatchThread()
@ 0x562ec68af753 yb::thread::SuperviseThread()
2. When browsing to Utilities > Logs tserver messages similar to the following will appear with in significant quantities:
W0203 21:20:10.263926 2994113 operation_counter.cc:176] Waiting for 1 pending operations to complete now for 1454923.708s
Resolution
Permanent Fix
- The issue described in this article is addressed by GHI-23243 present in the following versions:
- YugabyteDB - 2.20.8 or later
- YugabyteDB - 2024.1.4.0 or later
- YugabyteDB - 2024.2.0.0 or later
Workaround:
- Please contact Yugabyte Support by raising a case [here], as the resolution consist a complex procedure which needs to be attempted under supervision of Yugabyte Support only.
- Following Knowledge Article (Internal Only to Yugabyte) can be followed by Yugabyte Support.
Comments
0 comments
Please sign in to leave a comment.