Environment
- YugabyteDB - 2.4 or lower
Issue
- YCQL Write Operations error with
Write failed: Operation failed. Try again. (yb/docdb/lock_batch.cc:31): Timeout: 1.309s: Failed to obtain locks until deadline: 21460499.690s
- High latencies for read/write operations due to RPC threads stuck waiting on a lock.
- High
rpc_tp_TabletServer
thread queue build up causing contention and eventually a deadlock condition which is visible on https://<tablet-server-ip>:9000/threadz
Resolution
Yugabyte's CoreDB engineering is tracking the issue. Please refer to the following Github issues.
https://github.com/yugabyte/yugabyte-db/issues/11258
https://github.com/yugabyte/yugabyte-db/issues/4375
Workaround
1. A rolling restart may resolve this issue. This can be done from Yugabyte's admin console.
2. Set client and server YCQL timeout to the same value based on your application requirements. This values defaults to 60s
client_read_write_timeout_ms
request-timeout
DEFINE_int32(client_read_write_timeout_ms, 60000, "Timeout for client read and write operations.");
3. If the issue doesn't resolve after implementing steps 1 & 2, open a P1 ticket with Yugabyte Support
Root Cause
Yugabyte database internally retries the operation on expired or transactions that conflict with other transactions when executing queries on a transaction enabled YCQL table. In a situation where client timeout is set to lower than server timeout, the YCQL service will retry the operation continuously without aborting causing a high RPC queue to build up. This eventually leads to a deadlock condition where threads trying to get a lock are blocked and are waiting resulting in overall latency and slowness.
Comments
0 comments
Please sign in to leave a comment.