Environment
- YugabyteDB Anywhere - 2.20.x, 2024.x
-
Linux Host Operating System Version:
- RHEL 8+ (kernel 4.18)
- Alma 8+ (kernel 4.18)
Issue
Unreleased Resident Set Size (RSS) memory can lead to tserver nodes crash due to Operating System Out of Memory (OOM), or go unresponsive, requiring reboot of host.
Resolution
Overview
This article provides guidance on modifying Transparent Huge Pages (THP) settings using a systemd-based approach to mitigate excessive RSS memory usage and reduce the risk of tserver instability due to memory pressure.
Systemd-based Modification of THP Settings
Follow the steps outlined in the documentation to deploy a systemd-based method of applying the appropriate Transparent Huge Page settings.
Reference Information
How to Review THP Settings
1. To get information about the current Transparent Hugepage (THP) settings run the following command from the OS command line:
find /sys/kernel/mm/transparent_hugepage/ -type f -exec sh -c 'printf "%-70s %s\n" "{}" "$(cat {})"' \;
The output will look similar to the following (with potential differences in reported values):
/sys/kernel/mm/transparent_hugepage/defrag always defer defer+madvise [madvise] never
/sys/kernel/mm/transparent_hugepage/khugepaged/defrag 1
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs 10000
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none 511
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan 4096
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap 64
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs 60000
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed 167670
/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans 778
/sys/kernel/mm/transparent_hugepage/enabled [always] madvise never
/sys/kernel/mm/transparent_hugepage/use_zero_page 1
/sys/kernel/mm/transparent_hugepage/shmem_enabled always within_size advise [never] deny force
/sys/kernel/mm/transparent_hugepage/hpage_pmd_size 2097152
To avoid the excessive RSS growth in newer kernels we recommend the following settings max_ptes_none=0
and defrag=defer+madvise
if THP is enabled to minimize unnecessary THP allocations. This aligns with TCMalloc's recommendations also when THP is enabled. See https://google.github.io/tcmalloc/tuning.html#system-level-optimizations for details.
/sys/kernel/mm/transparent_hugepage/enabled:
[always] madvise never
/sys/kernel/mm/transparent_hugepage/defrag:
always defer [defer+madvise] madvise never
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none:
0
How to Identify Approximate Excess RSS Memory (PromQL)
In the event of live issue analysis, run the following PromQL query in the Prometheus UI (Documentation for Prometheus access), over the period of 2 weeks (default metrics retention window):
- Issue the following query, which gives an approximate percentage of memory which may be reclaimable.
- The
NODEPREFIX
value can be found on the Nodes tab for the Universe. For example, if there is a node calledyb-prod-appname-n1
, the correct prefix value for this Universe isyb-prod-appname
.
100 -
(sum by (exported_instance) (
avg_over_time(
{node_prefix="NODEPREFIX",
saved_name=~"node_memory_(Cached|Buffers|MemFree)_bytes"}[1m]
)
or
avg_over_time(
{node_prefix="NODEPREFIX", export_type=~"tserver_export",
saved_name=~"(generic_current_allocated_bytes|tcmalloc_pageheap_free_bytes)"}[1m]
)
or
avg_over_time(
{node_prefix="NODEPREFIX", export_type=~"master_export",
saved_name=~"(generic_current_allocated_bytes|tcmalloc_pageheap_free_bytes)"}[1m]
)
)
/
sum by (exported_instance) (
avg_over_time(
node_memory_MemTotal_bytes{node_prefix="NODEPREFIX"}[1m]
)
)
* 100)
IMPORTANT: The results of the query are only an approximation of excessive RSS.
How to Identify Excess RSS Memory (Tserver memz endpoint)
A more accurate means to identify Excess RSS used by the tserver involves the consult of the T-Server admin UI associated with a node which may be affected. This method assumes the node is currently affected and has not yet been restarted/rebooted.
Browse to Utilities > Total Memory. And compare two values:
Actual memory used (physical + swap)
vs.
TOTAL: ( MiB) Bytes resident (physical memory used)
The example below shows a comparison of the two values and gaps which can emerge between the two.
In the above example, the excess RSS is ~8 GB and can be reclaimed after THP remediation discussed earlier.
Comments
0 comments
Article is closed for comments.