Environment
- YugabyteDB
Issue
In YugabyteDB, each tablet has its own soft and hard limits for SST (Sorted String Table) files to keep things running smoothly. These limits are configured at the tserver level.
If a tablet’s SST file count goes over the soft limit (by default, 24), YugabyteDB starts gradually throttling writes to that tablet. As the count approaches the hard limit (default, 48), more writes are held back.
If the hard limit is exceeded, all writes to that tablet are temporarily blocked until the compaction process reduces the number of SST files. Once the SST file count drops below the hard limit, writes are allowed to resume.
This per-tablet system helps prevent performance issues and ensures your database stays reliable, even under heavy workloads.
Additional Information:
- The tserver process will log messages like below when the SST file limit is exceeded:
W0830 11:47:39.171617 371 tablet_service.cc:344] T ebe6d0386e1a4dbdb53a3de7e8b29321 P f00977277fca4887b11e618e6462c55a: Rejecting Write request, Service unavailable (yb/tserver/tablet_service.cc:343): SST files limit exceeded 34 against (24, 48), score: 0.82560586190452967: 1.211s (tablet server delay 1.211s) [suppressed 5 similar messages]
- The rejections dashboard, located under the DocDB section in YugabyteDB Anywhere or YugabyteDB Managed, displays the count of write rejections.
Resolution
Remediation Steps
In cases where writes are rejected due to exceeding SST file limits and you need to quickly restore write operations, you can follow these steps:
- Increase the value of the GFlag below to raise the soft limit and hard limit for SST files per tablet.
sst_files_soft_limit
sst_files_hard_limit
Please note that raising these limits is only a temporary
solution.
In most cases, it is essential to identify and address the
root
causes
of SST file buildup to prevent recurrence.
Why can't we set very high limits or unlimited SST files?
Setting very high limits or unlimited SST files can lead to several issues:
- Increased Latency: More SST files mean more files to search through during read operations, which can significantly increase read latency.
- Higher Resource Consumption: Managing a large number of SST files can consume more CPU and memory resources, potentially leading to overall performance degradation.
In summary, the sst_files_soft_limit is
a critical safeguard for both write and read performance, as well as
overall
system health. Setting it to unlimited would remove these protections
and
could lead to severe performance and stability issues.
Why do SST file limits get exceeded?
- High Write Throughput: A sudden spike in write operations can lead to rapid accumulation of SST files before the compaction process can keep up.
- Disk I/O Bottlenecks: Slow disk performance can delay compaction, causing SST files to pile up.
- CPU Bottlenecks: Limited CPU resources can slow down the compaction process, leading to an increase in SST files.
- Inadequate Compaction Configuration: Default compaction settings may not be optimal for all workloads, leading to inefficient SST file management.
- Inadequate TTL File Expiry Configuration: If TTL file expiry is not configured properly, it can lead to the accumulation of obsolete data, increasing the number of SST files.
- Hot Tablets: Certain tablets may receive a disproportionate amount of write traffic, leading to localized SST file accumulation.
- Memory Pressure: YugabyteDB relies on buffered I/O to ensure fast writes. When the node comes under memory pressure, the OS may be forced to start performing synchronous writes
Mitigation Steps
High Write Throughput
- Implement rate limiting or reduce the batch size of write operations to allow the compaction process to keep up.
- Scale out your cluster by adding more nodes to distribute the write load.
Disk I/O Bottlenecks
- Increase the IOPS capacity of your storage disks.
- Use faster storage solutions such as SSDs or NVMe drives.
CPU Bottlenecks
- Upgrade to more powerful CPUs or increase the number of CPU cores in your nodes.
- Scale out your cluster to distribute the CPU load.
Inadequate Compaction Configuration
- Consider optimizing the background compaction. You can read about it here
- Consider scheduling the full compaction. You can read about it here
Inadequate TTL File Expiry Configuration
- Review and adjust the TTL settings to ensure that obsolete data is being purged effectively. You can read about it here
- Avoid running full compactions on tables with TTL enabled.
Hot Tablets
- Identify hot tablets and redistribute the load by splitting them.
Important commands
Find tablets exceeding SST file limits
grep -r "tablet_service.cc.*SST files limit exceeded"|sed 's/.*tablet_service.cc.*] T//g'|sed 's/ P [0-9a-f].*//g'|sort |uniq -c
9112 04eb37a1f65c4cd5802ed846d371e207
251 07e925c9a1f4478bbfa1e56bc7025405
242 097709db61b04b77ac26fdc9ebbb1892
1396 0ea3cd8325c24507bed3246c5244cddd
80 117337c299934f83af232495fe55c4c9
Comments
0 comments
Please sign in to leave a comment.