Environment
- YugabyteDB Anywhere - All versions
Issue
YugabyteDB Anywhere reports health check failures. However the alerts looks false positive.
For example YugabyteDB Anywhere reports health check failures with below message.
Fatal log files for yb-master process
yb-node-cluster-dev-sql-n6 -
Fatal log files (yb-master) - 'Error executing command timeout 20 bash -c 'set -o pipefail; find /home/yugabyte/master/logs/ -mmin -12 -name "*FATAL*" -type f -printf "%T@ %p\n"':'
However when checking the master logs there are no new FATAL files present. The alerts seems to be getting triggered false positively.
Resolution
Overview
In this scenario, although it looks like there are no new FATAL files created, the health check fails as the command to identify the FATAL files took longer to execute and finally got timed out. To determine the root cause perform the below steps:
Steps
1. Check if any new FATAL files are present in the log directory
2. If no FATAL files are created recently, one possibility for the health check failure is the listing of the directory took long time. This can happen if the log directory has large number of files. To validate this situation perform a listing operation on the logs directory and check how long it takes to complete.
ls -ltrh /home/yugabyte/master/logs/ | wc -l
3. If the listing is taking longer and if there are large number of files, then perform a manual clean up of the old log files. This situation can happen if master or tserver was failing in crashing loop and generating large number of log files. These log files are not cleaned up as they are not old enough to meet the clean up criteria. Once the manual clean up is performed, check if the subsequent health checks are failing or not.
Comments
0 comments
Please sign in to leave a comment.