YBDB Version- 2.14* and older
TServer experiences recurring crashes, and even a manual restart doesn't fix the issue.
The issue persists as TServer enters a crash loop, with the following error message consistently appearing in the /home/yugabyte/tserver/tserver.err log.
(NOTE: If you find tserver.err file missing, check if the Universe is using systemd. On systemd systems, you need to use journalctl to gather the equivalent output. For example, use something like journalctl -u yb-tserver.service --no-pager > tserver-journalctl.out)
*** Aborted at 1689644542 (unix time) try "date -d @1689644542" if you are using GNU date *** PC: @ 0x0 (unknown) *** SIGTERM (@0x2c402e0001cfdd) received by PID 18437 (TID 0x7efe76dcf100) from PID 118749; stack trace: *** @ 0x7efe75b8f9f3 __GI_epoll_wait @ 0x29d41ad boost::asio::detail::epoll_reactor::run() @ 0x29cf16c boost::asio::detail::scheduler::run() @ 0x29cbdb1 yb::tserver::(anonymous namespace)::TabletServerMain() @ 0x29c4ee9 main @ 0x7efe75ac9825 __libc_start_main @ 0x26b202e _start
Then every 60 seconds it tries to restart automatically and fails with the following error:
*** Aborted at 1689649383 (unix time) try "date -d @1689649383" if you are using GNU date *** PC: @ 0x0 (unknown) *** SIGSEGV (@0x4e0) received by PID 53087 (TID 0x7f4bdb9c3700) from PID 1248; stack trace: *** @ 0x3596d33 yb::tablet::TabletPeer::SetCDCSDKRetainOpIdAndTime() @ 0x2b11c4a yb::cdc::CDCServiceImpl::UpdateTabletPeerWithCheckpoint() @ 0x2b0982d yb::cdc::CDCServiceImpl::UpdatePeersAndMetrics() @ 0x2b0b2e6 std::__1::__thread_proxy<>() @ 0x7f4beeabb694 start_thread @ 0x7f4beefbd41d __clone
This issue happens because of a rare race condition that may lead to a tserver crash if a background thread tries to update XCluster metrics for a tablet that is not yet fully initialized (e.g. during tserver startup or tablet load balancing).
Note: This issue is fixed in the YBDB version 2.14 branch starting in 2.14.12 and newer
* The following are the steps to resolve the issue:
Step1: Pause replication.
Step2: Run a rolling restart of the consumer to clear any pending
Step3: Run a rolling restart of the producer so the tablets bootstrap cleanly and all nodes in the system startup.
Step4: Re-enable replication.
GitHub Issues filed to address the race condition: