Issue:
Unable to connect to Yugabyte DB. Below are some possible scenarios:
- Unable to connect to ycql or ysql
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.26.131.249:9042 (com.datastax.driver.core.exceptions.TransportException: [/10.26.131.249:9042] Cannot connect), /10.26.132.230:9042 (com.datastax.driver.core.exceptions.TransportException: [/10.26.132.230:9042] Cannot connect), /10.26.133.113:9042 (com.datastax.driver.core.exceptions.TransportException: [/10.26.133.113:9042] Cannot connect))
-
Yugaware showing nodes are unreachable.
-
Application requests to the DB nodes are failing.
Connection error: ('Unable to connect to any servers', {'10.88.16.74': ConnectionException('Failed to initialize new connection to 10.88.16.74: Error from server: code=1001 [Coordinator node overloaded] message=""',)})Connection error: ('Unable to connect to any servers', {'10.88.16.74': ConnectionException('Failed to initialize new connection to 10.88.16.74: Error from server: code=1001 [Coordinator node overloaded] message=""',)})
Solution:
1. Check the OS for general health
As root user, check output of dmesg -T
or /var/log/messages output. Check if the OOM killer is killing the process or any disk issue being reported.
Check system stats in order to see if the servers are CPU/IO bound:
sudo echo -n "CPUs: ";cat /proc/cpuinfo | grep processor | wc -l; echo -n "Mem: ";free -h | grep Mem | tr -s " " | cut -d" " -f 2; echo -n "Disk: "; df -h / | grep -v Filesystem;
2. Check disk space
Run df -h
in order to confirm the data directory mount for the DB is not full i.e 100%
Above is the output of df -h
. /mnt/data0
holds the database data. If /mnt/data0
is 100% full there is a critical issue. Yugabyte disk usage should be under about 75% as a general rule to ensure uptime in event of a node failure.
In case of high disk utilization, you can address freeing up disk space from some of the following locations:
- Logs directory - symlinked at /home/yugabyte/{tserver,master}/logs
- Cores - symlinked at /home/yugabyte/cores
- Drop unnecessary tables or data from the DB
Please contact Yugabyte support for further assistance regarding disk utilization issues.
3. Check ulimits
Ulimit error might be reported in /home/yugabyte/tserver/logs/yb-tserver.FATAL
log:
F20200622 20:32:03 ../../src/yb/consensus/log_util.cc:164] Check failed: _s.ok() Bad status: IO error (yb/util/env_posix.cc:1482): /app/yb-data/tserver/wals/table-b8b980327b404800bac6521d1d202568/tablet-ce1ecbaa1b29470da2e04cd3f3f6c20d/wal-000000008: Too many open files (system error 24) @ 0x7f2acf36ec0c yb::LogFatalHandlerSink::send() @ 0x7f2ace556346 google::LogMessage::SendToLog() @ 0x7f2ace5537aa google::LogMessage::Flush() @ 0x7f2ace556879 google::LogMessageFatal::~LogMessageFatal() @ 0x7f2ad733f314 yb::log::ReadableLogSegment::ReadableLogSegment() @ 0x7f2ad733fa43 yb::log::ReadableLogSegment::Open() @ 0x7f2ad7360974 yb::log::LogReader::Init() @ 0x7f2ad7361660 yb::log::LogReader::Open() @ 0x7f2ad734d8fa yb::log::Log::Init() @ 0x7f2ad734e2b4 yb::log::Log::Open() @ 0x7f2ad797b5aa yb::tablet::TabletBootstrap::OpenNewLog() @ 0x7f2ad798180d yb::tablet::TabletBootstrap::PlaySegments() @ 0x7f2ad79838c8 yb::tablet::TabletBootstrap::Bootstrap() @ 0x7f2ad798b23f yb::tablet::BootstrapTablet() @ 0x7f2ad829681d yb::tserver::TSTabletManager::OpenTablet() @ 0x7f2acf4014a4 yb::ThreadPool::DispatchThread() @ 0x7f2acf3fde2f yb::Thread::SuperviseThread() @ 0x7f2ac9c34694 start_thread @ 0x7f2ac937141d __clone @ (nil) (unknown)
The yb-tserver.INFO
logs in /home/yugabyte/tserver/logs
directory logs the user limit information when started. Below is a snippet from the logs:
I0326 18:17:39.587894 6300 tablet_server_main.cc:196] ulimit cur(max)... ulimit: core file size unlimited(unlimited) blks ulimit: data seg size unlimited(unlimited) kb ulimit: open files 1048576(1048576) ulimit: file size unlimited(unlimited) blks ulimit: pending signals 63251(63251) ulimit: file locks unlimited(unlimited) ulimit: max locked memory 64(64) kb ulimit: max memory size unlimited(unlimited) kb ulimit: stack size 8192(unlimited) kb ulimit: cpu time unlimited(unlimited) secs ulimit: max user processes 12000(12000)
The recommended settings for the user limits have been mentioned here. Please validate the values reported in the logs are as per the recommended values.
4.Check the health of the tserver
Check if the tserver process is up and running:
ps -ef | grep tserver | grep -v grep yugabyte
6300 1 22 Mar26 ? 1-07:39:02 /home/yugabyte/tserver/bin/yb-tserver --flagfile /home/yugabyte/tserver/conf/server.conf
Check if the master process is up and running, if this is a node which should run a master:
ps -ef | grep master | grep -v grep yugabyte
6102 1 0 Mar26 ? 00:26:24 /home/yugabyte/master/bin/yb-master --flagfile /home/yugabyte/master/conf/server.conf
If the tserver/master process is up and running, check crontab -l
for the appropriate path and run the /home/yugabyte/bin/yb-server-ctl.sh tserver start
or /home/yugabyte/bin/yb-server-ctl.sh master start
command and check if there’s an error reported on executing the command.
[yugabyte@yb-demo]$ crontab -l
#Ansible: cleanup core files hourly
0 * * * * /home/yugabyte/bin/clean_cores.sh
#Ansible: cleanup yb log files hourly
5 * * * * /home/yugabyte/bin/zip_purge_yb_logs.sh
#Ansible: Check liveness of master
*/1 * * * * /home/yugabyte/bin/yb-server-ctl.sh master cron-check || /home/yugabyte/bin/yb-server-ctl.sh master start
#Ansible: Check liveness of tserver
*/1 * * * * /home/yugabyte/bin/yb-server-ctl.sh tserver cron-check || /home/yugabyte/bin/yb-server-ctl.sh tserver start
In case the command reports the below i.e the process is running but the DB is not accessible:
$ /home/yugabyte/bin/yb-server-ctl.sh tserver start
yb-tserver already running
Check that the PID in /home/yugabyte/tserver/logs/yb-tserver.pid
and validate if the pid reported the ps
output and the one in the yb-tserver.pid
file is same.
$ cat /home/tserver/logs/yb-tserver.pid
17027
In case of discrepency between the values, please stop the on going process and remove the yb-tserver.pid
file and run the start command again.
5. Check tserver hearbeat status to master leader in UI:
- Login into Yugaware platform portal.
- Navigate to <Universe name> -> Nodes.
- Click on (Leader) link. It will launch the Master leader UI.
- On the Master leader UI, go to Tablet Servers tab and check the below stats:
- Time since heartbeat : The interval indicates if the tservers are able to report to master leader. The interval should not be too high.
- Status & Uptime : The tservers should be reported ALIVE.
6.Check/Provide the logs for relevant messages in log files
Log are stored at below location on the servers:
- yb-tserver logs: /home/yugabyte/tserver/logs/ : Used for YCQL connection issues, including SSL/TSL; storage or docdb layer issues; trouble with RAFT consensus at tablet layer.
- Postgres logs: under /home/yugabyte/tserver/logs/postgresql-[date].log : Used for YSQL connection issues; long running SQL queries; authentication and SSL issues with postgres
- yb-master logs: /home/yugabyte/master/logs/ : Logs are used for consensus issues; schema mismatch issues; the master stores all catalog information for yugabyte DB
- stdout and stderr log: /home/yugabyte/{tserver,master}/{tserver,master}.{err,out} : Used for startup issues, like configuration problems, for tserver and master. If there is a panic, the panic stack trace will reside in the {tserver,master}.err file
- Core files : /home/yugabyte/cores : Used if you need more data than the stack trace in {tserver,master}.err provides, then the full core file is available here
- Fatal logs: /home/yugabyte/{master,tserver}/logs/yb*FATAL.log : Used for issues which cause a FATAL error, causing a master or tserver restart, live in this file
- Configuration file: /home/yugabyte/{master|tserver}/conf/server.conf
- SOS reports if the OS is RHEL: https://access.redhat.com/solutions/3592
Please contact Yugabyte support for further troubleshooting with the above mentioned logs. It is critical the logs be collected around the time of the issue.
Comments
0 comments
Please sign in to leave a comment.