Environment

YBDB Version- 2.14* and older

Issue

TServer experiences recurring crashes, and even a manual restart doesn't fix the issue.

The issue persists as TServer enters a crash loop, with the following error message consistently appearing in the /home/yugabyte/tserver/tserver.err log.

(NOTE: If you find tserver.err file missing, check if the Universe is using systemd. On systemd systems, you need to use journalctl to gather the equivalent output. For example, use something like journalctl -u yb-tserver.service --no-pager > tserver-journalctl.out)

*** Aborted at 1689644542 (unix time) try "date -d @1689644542" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGTERM (@0x2c402e0001cfdd) received by PID 18437 (TID 0x7efe76dcf100) from PID 118749; stack trace: ***
    @     0x7efe75b8f9f3 __GI_epoll_wait
    @          0x29d41ad  boost::asio::detail::epoll_reactor::run()
    @          0x29cf16c  boost::asio::detail::scheduler::run()
    @          0x29cbdb1  yb::tserver::(anonymous namespace)::TabletServerMain()
    @          0x29c4ee9 main
    @     0x7efe75ac9825 __libc_start_main
    @          0x26b202e _start

Then every 60 seconds it tries to restart automatically and fails with the following error:

*** Aborted at 1689649383 (unix time) try "date -d @1689649383" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x4e0) received by PID 53087 (TID 0x7f4bdb9c3700) from PID 1248; stack trace: ***
    @          0x3596d33  yb::tablet::TabletPeer::SetCDCSDKRetainOpIdAndTime()
    @          0x2b11c4a  yb::cdc::CDCServiceImpl::UpdateTabletPeerWithCheckpoint()
    @          0x2b0982d  yb::cdc::CDCServiceImpl::UpdatePeersAndMetrics()
    @          0x2b0b2e6  std::__1::__thread_proxy<>()
    @     0x7f4beeabb694 start_thread
    @     0x7f4beefbd41d __clone

Root Cause

This issue happens because of a rare race condition that may lead to a tserver crash if a background thread tries to update XCluster metrics for a tablet that is not yet fully initialized (e.g. during tserver startup or tablet load balancing).

Resolution

Note: This issue is fixed in the YBDB version 2.14 branch starting in 2.14.12 and newer

* The following are the steps to resolve the issue:

Step1: Pause replication.

Step2: Run a rolling restart of the consumer to clear any pending GetChanges requests.

Step3: Run a rolling restart of the producer so the tablets bootstrap cleanly and all nodes in the system startup.

Step4: Re-enable replication.

References

GitHub Issues filed to address the race condition:

https://github.com/yugabyte/yugabyte-db/issues/11873

https://github.com/yugabyte/yugabyte-db/issues/2400

Updating xcluster metrics for a tablet that is not fully initialized make tserver crash

Environment

YBDB Version- 2.14* and older

Issue

Root Cause

Resolution

References

Comments

Environment

YBDB Version- 2.14* and older

Issue

Root Cause

Resolution

References

Related articles