Issue:
CQLSH connectivity failing:
Error executing command ['ssh', 'yugabyte@NodeA', '-p', '54422', '-o', 'StrictHostKeyChecking no', '-o', 'ConnectTimeout=10', '-o', 'UserKnownHostsFile /dev/null', '-o', 'LogLevel ERROR', '-i', u'/opt/yugabyte/yugaware/data/keys/a85d21a6-cb6f-404e-b05c-4396dd5654f9/yb-dev-nice-aws-us-east-2-key.pem', 'set -o pipefail; /home/yugabyte/tserver/bin/cqlsh xxx.xx.xx.xxx 9042 -e "SHOW HOST"']: timeout occurred
Also, node ssh failing due to timeout:
Error executing command ['ssh', 'yugabyte@NodeA', '-p', '54422', '-o', 'StrictHostKeyChecking no', '-o', 'ConnectTimeout=10', '-o', 'UserKnownHostsFile /dev/null', '-o', 'LogLevel ERROR', '-i', u'/opt/yugabyte/yugaware/data/keys/a85d21a6-cb6f-404e-b05c-4396dd5654f9/yb-dev-nice-aws-us-east-2-key.pem', 'set -o pipefail; /home/yugabyte/tserver/bin/cqlsh xxx.xx.xx.xxx 9042 -e "SHOW HOST"']: timeout occurred
Symptoms:
On checking node metrics, you would notice huge spike in CPU usage on all the nodes in the cluster and high latency with operations on the nodes.
From tserver logs i.e tserver-INFO.log during the time duration, it would shows read queries taking a long time i.e a poorly formed select query causing CPU to spike
I0814 05:58:47.719163 22836 yb_rpc.cc:440] W0814 05:58:47.740188 20819 long_operation_tracker.cc:141] Read took a long time: 18.206s I0814 05:58:47.740233 20819 yb_rpc.cc:439] Call yb.tserver.TabletServerService.Read NodeA:40031 => NodeB:9100 (request call id 2070858368) took 18206ms. Trace: I0814 05:58:47.740239 20819 yb_rpc.cc:440] W0814 05:58:47.814543 6029 long_operation_tracker.cc:141] Read took a long time: 18.067s
Also, Update Replica operations i.e a write operation is taking long time.
W0814 05:59:11.852062 5850 long_operation_tracker.cc:141] UpdateReplica took a long time: 1.852s W0814 05:59:11.855759 4888 long_operation_tracker.cc:141] UpdateReplica took a long time: 1.855s
Root Cause:
When running below command on the tserver log where SSH was timing out shows series of complex select commands ran multiple times for the log timeframe we are interested in.
As an example:
zgrep 'SELECT *' yb-tserver.<Hostname>.log.INFO.*.gz | sort | uniq -c | sort -rnk1 | head 3
6464 sql_string: "SELECT * FROM trade WHERE identifier=? and ts <=? and ts >= ? IF (fields->>\'ABC\' = \'0\' or fields->>\'XYZ\'=null) and (not fields->>\'AAA\' = \'2\') limit ?" 669 SELECT * FROM trade WHERE XYZ=? and ts <=? and ts >= ? IF (fields->>'ABC' = '0' or fields->>'XYZ'=null) and (not fields->>'AAA' = '2') limit ? 146 sql_string: "SELECT * FROM XYZ WHERE XYZ=? limit 1"
As seen above on series of repeated selects can exhaust CPU on the node and cause high load triggering a leader re-election. This would cause the node to go unresponsive and any operations like SSH to eventually timeout.
Solution:
Please review the application code to see the read queries being executed. The read queries might have to be revisited or the table may need remodeling.
The example above uses this schema:
CREATE TABLE trade (
identifier text,
ts bigint,
sequencenumber bigint,
fields jsonb,
PRIMARY KEY (identifier, timestamp, sequencenumber)
) WITH CLUSTERING ORDER BY (timestamp DESC, sequencenumber DESC)
AND default_time_to_live = 604800
AND transactions = {'enabled': 'false'};
In the above query that runs most poorly, we can see that this filter pushes down to an un-indexed column, so all rows in the DB must be touched - this means it is a very expensive query to execute.
IF (fields->>\'ABC\' = \'0\' or fields->>\'XYZ\'=null)
The solution in this case would perhaps be to add an index to "fields" or to filter on an indexed column if possible. If you have specific questions around query tuning for Yugabyte, reach out to the support team.
Comments
0 comments
Article is closed for comments.