Environment

YugabyteDB Anywhere - 2.12.x or newer

Issue

Customer is seeing many “Stopping writes because we have 2 immutable memtables (waiting for flush)” messages in the tserver logs

Resolution

Overview

Verify the value for the rocksdb_compact_flush_rate_limit_bytes_per_sec parameter. The default compaction rate is 1024 MB/s (256 MB/s in previous versions).
A customer had modified this value and specified a very low value of 50 MB/s to throttle the compaction speed.
As a result the memtable filled up and could not be flushed to disk in time. Switching this to a value of 256 solved the problem.

Steps

1. Verify flags have been set to default values. List any flags that have been modified. Key flags to review are:

rocksdb_compact_flush_rate_limit_bytes_per_sec=256
memstore_size_mb=128   (do not increase past 256 mb)
global_memstore_size_mb_max=2048 or global_memstore_size_percentage=10

2. Do not change the max_write_buffer_number. It is not recommended by engineering to increase the default value of buffers because it can lead to unpredictable results.

3. Verify current disk IO performance on idle system

Example command:

sudo yum install fio$ mkdir /mnt/d0/fio_test$ fio --directory=/mnt/d0/fio_test --name fio_test_file --direct=1 --rw=randwrite --bs=4k --size=512MB --numjobs=4 --time_based --runtime=30 --group_reporting --norandommap --iodepth=128

Many times the issue is with the IOPS or MBPS. Cloud systems have both host and disk based limits. If bursting is used, you can also have unpredictable performance.
Check with iostat, sar or other monitoring tools and verify the disk performance in detail.

4. Verify fsync latencies
Open two windows: one to tail the tserver log, the other to trace the fsync process:

sudo perf trace -e fsync --duration 39 (duration threshold in milliseconds)

If the tserver log file has many messages such as “Time spent Fsync log took a long time” and they tie up with the fsync latencies, then you have journal contention and the only option is to add more disks to the Yugabyte universe.

5. Check system limits, e.g. ulimits. Issues can arise when parameters such as limitnproc are set too low.

6. Check if CGROUPs are in use, if customers attempt to control process usage, it can negatively impact Yugabyte performance.

sudo dmesg -T |grep cgroup

7. Check internal sources: slack, YB Support knowledge base, JIRA, github for additional hints as to what could be wrong.

Environment

Issue

Resolution

Overview

Steps

Related articles