A YugaByteDB cluster consists of multiple nodes. Each node may host single or multiple services, and since its a distributed system, they all need to talk to each other. However, the cluster may be deployed in an environment which needs network configuration, and it may happen due to improper network configuration or for some other reasons that service on one node may not be able to talk to service on different node, which will cause issues in the YB cluster.
This article is about how to identify where the network issue is using various linux tools.
The various important ports on which various services run can be seen in the official docs here.
Some of the basic commands which come installed in a Linux node, or can be installed easily in case they are not installed are:
Lets look at them one by one:
This command is used to check whether a particular port on a particular IP can be connected to or not. The basic syntax is telnet <IP address> <Port>. For example, if you want to check whether a SSH port 22 on a machine, say having IP address 126.96.36.199 can be connected to or not, the command would be telnet 188.8.131.52 22
This command will attempt to create a connect a TCP connection to port 22 of machine with IP address 184.108.40.206. If it works you will see messages related to SSH. If not, you will see a message like following:
$ telnet 220.127.116.11 22
telnet: Unable to connect to remote host: Connection timed out
So, in case, a Master is not able to connect to a TServer server, which listens on port 9100, the command to check connectivity from Master to port 9100 of the TServer server will be
telnet <IP of TServer> 9100
In case this command is not installed, you can use the command sudo apt install telnet on Ubuntu machine, or yum install telnet on a CentOS/RHEL box. In case you cannot install this command, you can always try using netcat command, which is described next in this article.
This command is a bit similar to telnet. It is also used to check connectivity between 2 machines, more specifically from one machine to a port on another machine. The basic syntax of this command is netcat <IP address of remote machine> <Port>.
For example, if you want to check whether port 80 on a remote machine, say having IP address 18.104.22.168 is open or not, the command would be nc -zv 22.214.171.124 80.
Note that this command can also be invoked using netcat. So the above command can also be issued as netcat -zv 126.96.36.199 80.
One advantage netcat has over telnet is that it may be installed already on a newly deployed system, while telnet may not. Even in case netcat (or nc) is not installed, it can be installed using the command sudo apt install netcat on Ubuntu, or yum install nc on a CentOS/RHEL box.
This command traces the network path/hops between 2 hosts. It can tell you unto what network hop is connectivity possible, and from which hop onwards there is a problem. This command comes handy to troubleshoot network issue that may arise due to firewall sitting between 2 hosts.
The basic syntax of this command is traceroute <hostname/IP address>.
For example, if you want to check connectivity between your localhost and the machine with IP 188.8.131.52, the command would be traceroute 184.108.40.206. If there is a connectivity issue, the output would be like following:
traceroute to 220.127.116.11 (18.104.22.168), 30 hops max, 60 byte packets
1 ec2-4-23-11-5.us-west-6.computer.aws.com (22.214.171.124) 1.406 ms 1.875 ms 7.134 ms
2 * * *
3 * * *
4 * * *
5 * * *
The asterisks denote that no response was received from hop at number 2 onwards. So issue may be from that hop. This information is handy in escalating the issue, since it lets you pinpoint where the possible issue is.
Another possible output would be the following:
$ traceroute 126.96.36.199
traceroute to 188.8.131.52 (184.108.40.206), 64 hops max, 52 byte packets
1 192.168.0.1 (192.168.0.1) 11.636 ms 1.349 ms 1.316 ms
2 220.127.116.11.actcorp.in (18.104.22.168) 2.622 ms 2.604 ms 3.652 ms
3 broadband.actcorp.in (22.214.171.124) 3.051 ms 2.571 ms 2.397 ms
4 126.96.36.199.static-bangalore.vsnl.net.in (188.8.131.52) 2.845 ms 2.947 ms 2.767 ms
5 172.17.169.202 (172.17.169.202) 10.562 ms 9.976 ms 10.229 ms
6 * * *
7 * * *
8 * * *
Here we can see that up to hop 5, the connectivity definitely exists, and post that it doesn't. This also allows you to pinpoint where the probable issue is.
To know more about traceroute command and how it works, please read here.
In case this command is not installed, you can use the command sudo apt install traceroute on Ubuntu machine, or yum install traceroute on a CentOS/RHEL box.
netstat command is mostly used to list the network connections that exist on the box on which it is being run on. It also shows list of ports open for listening.
If netstat is run using root privileges, or with sudo access, the output will also show the mapping of process PID and the socket (IP address:Port number) the process is running on, handy to know whether a process is listening on a particular port (and/or IP) or not.
To see list of ports open for listening for TCP protocol, use the command netstat -tnple. For UDP, the command would be netstat -unple.
To see list of all TCP connections on the box, use the command netstat -planet. Similarly, for UDP, the command would be netstat -planeu.
You can grep on the output of the command to filter for various states of the connections, also IP Addresses.
In case this command is not installed, you can use the command sudo apt install net-tools on Ubuntu machine, or yum install net-tools on a CentOS/RHEL box.