Environment:

Yugabyte CoreDB

Issue:

Read queries are timing out after a too many deletes on the table.

delete from frames where prefix='FDRPEdge3';
OperationTimedOut: errors={'10.64.10.55': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.64.10.55

Root Cause:

By default compactions are triggered automatically as new data arrives and memstores are flushed to create SSTable files. There are no scheduled compactions, although that is feature that is being considered as per GHI
https://github.com/yugabyte/yugabyte-db/issues/7614

This current scenario can cause issues if there is a lot of tombstone i.e dead data which was deleted for the partition and is yet to be compacted. This can cause timeouts when querying a table with a lot of tombstones, as the SELECT has to skip past possibly millions of dead rows before getting to valid rows.

Resolution

Overview

To resolve the above issue, manual compactions can be run on the table.

Steps to run manual compaction:

Step 1. Set force unresponsive_ts_rpc_retry_limiton master nodes only, as Flush/Compact requests to the tServer are rpcs that are expected to take a long time.
e.g.

~/tserver/bin/yb-ts-cli --server_address=<master 1> set_flag -force unresponsive_ts_rpc_retry_limit 0
~/tserver/bin/yb-ts-cli --server_address=<master 2> set_flag -force unresponsive_ts_rpc_retry_limit 0
~/tserver/bin/yb-ts-cli --server_address=<master 3> set_flag -force unresponsive_ts_rpc_retry_limit 0

Step 2. Run the manual compaction using the yb-admin command on one node only. In the below command we have set the timeout_in_seconds to 86400 i.e 1 day so that the command does not timeout and returns when compaction completes. However, you can set it longer if needed. Even if the command times out after the specified timeout_in_seconds duration it is ok, the compaction will run in the background. If the compaction completes before timeout_in_seconds then it will notify. If the command times out the only thing is it would be difficult to monitor the compaction or get notified when it completes.

Note: This only needs to be executed on one node in the cluster. It will run compaction on all the nodes in the cluster.

~/tserver/bin/yb-admin -master_addresses <master addresses> compact_table <keyspace> <table name> 86400

Step 3. In version 2.20 and above there is a new yb-admin function compaction_status, which will output compaction details for a table, i.e. When a compaction, was requested, whether a compaction is currently in progress, last compaction completed time, also with flag show_tablets, will show individual tablet details for each tserver.
Usage:

yb-admin -master_addresses <master addresses> compaction_status <ysql|ycql>.<keyspace> <table name> [show_tablets] (default false)

Examples

yb-admin -master_addresses $MST compaction_status ysql.comp comptest
No full compaction taking place
Last full compaction completion time: 2024-02-02 13:19:43.755769
Last admin compaction request time: 2024-02-02 13:19:43.560482

yb-admin -master_addresses $MST compaction_status ysql.comp comptest show_tablets
No full compaction taking place

tserver uuid: 13f4cdbff47149bfbdf83c873395a8f7
tablet id | full compaction state | last full compaction completion time

b5c53398c6dd459189e385c1a3e80ab8 IDLE 2024-02-02 13:19:43.755769
28e304837bcb43acac2d2a3628bec2fe IDLE 2024-02-02 13:19:43.949334
6ef0fd084ab24828a146ac0b3c38e9e6 IDLE 2024-02-02 13:19:44.282316
6be0ade66f6b4f66b3fa858a76458b5a IDLE 2024-02-02 13:19:44.622908
1d3fe7f97dc34347a227de4b9aa5aa6a IDLE 2024-02-02 13:19:44.966193
6bf50a4caabd4ce79dda0f6b2e611827 IDLE 2024-02-02 13:19:45.241848
cc94d9eaec564f0c8d56a22ac1508239 IDLE 2024-02-02 13:19:45.440681
6e730548b2824167a972b1af8a021da1 IDLE 2024-02-02 13:19:45.638369

Step 4. Once compaction is done revert the unresponsive_ts_rpc_retry_limit flag to default value.

~/tserver/bin/yb-ts-cli --server_address=$IP1 set_flag -force unresponsive_ts_rpc_retry_limit 20
~/tserver/bin/yb-ts-cli --server_address=$IP2 set_flag -force unresponsive_ts_rpc_retry_limit 20
~/tserver/bin/yb-ts-cli --server_address=$IP3 set_flag -force unresponsive_ts_rpc_retry_limit 20

Read queries are timing out after large deletes

Environment:

Issue:

Root Cause:

Resolution

Overview

Steps to run manual compaction:

Comments

Environment:

Issue:

Root Cause:

Resolution

Overview

Steps to run manual compaction:

Related articles