Environment

YugabyteDB - 2.18 and older
The fixed has been pushed to ybdb versions 2.18.8.0, 2.20.4.1, 2.20.5.0, >=2024.1.1.0 and >=2.23.0.0

Issue

During the snapshot deletion phase in backups, users may observe latency spikes for YCQL inserts.

Resolution

Overview

The issue arises during the snapshot deletion phase of YugabyteDB backups, whether incremental or full. This phase introduces significant latency spikes for YCQL inserts, sometimes exceeding 8–10 seconds at the 99th percentile. The root cause is related to suboptimal code behavior in the snapshot deletion process.

At the conclusion of a full backup, snapshots for all user tablet replicas involved in the backup are deleted. Each snapshot deletion initiates two asynchronous flushes for the corresponding tablet. While these flushes are intended to record metadata changes indicating the snapshot deletion, they inadvertently block writes to the affected tablet until the first flush completes. This is due to the RocksDB configuration, which prevents writes when more than one flush operation is outstanding for a replica.

Key contributing factors include:

Parallel Flush Operations:
- All tablet replicas begin their two asynchronous flushes simultaneously.
- The flush operations are constrained by a limit of two background flushes and a small number of available threads, creating bottlenecks.
Triggering Compactions:
- Each flush operation generates two extra SST files per tablet replica.
- The increased number of SST files often triggers compactions, which further strain the system.
High Concurrency:
- Deleting snapshots for all involved tablets at the same time results in numerous parallel flushes and compactions.
- These processes compete for the same throttling resources, exacerbating delays.
Core Utilization Bottleneck:
- While the process is disk-intensive, observations suggest that the limited utilization of only two CPU cores for flush and compaction threads may also contribute to the delays.

The combined effect of these factors is that writes to certain tablet replicas can remain blocked for extended periods, leading to latency spikes for YCQL inserts. A small subset of tablets may experience particularly prolonged delays due to their dependencies on the completion of other flush operations.

The redundant dual flushes during snapshot deletion, compounded by high concurrency and system throttling, result in significant latency spikes. The system's inability to efficiently handle simultaneous flushes and compactions amplifies the problem, leading to p99 latency exceeding acceptable thresholds.

The log of interest during the issue is:

Stopping writes because we have 2 immutable memtables (waiting for flush) max_write_buffer_number is set to 2

This bug is fixed in https://github.com/yugabyte/yugabyte-db/issues/22369

Latency spikes observed post incremental backups

Environment

Issue

Resolution

Overview

Comments

Environment

Issue

Resolution

Overview

Related articles