Environment:
Yugabyte core DB - all versions
Issue:
CQL client encounters the following error when making a new connection to YB using YCQL connection:
'Unable to connect to any servers', {'10.20.30.40': ConnectionException('Failed to initialize new connection to 10.20.30.40: Error from server: code=1001 [Coordinator node overloaded, rejecting connection]
User may see the following entries in the tserver logs:
Remote error (yb/rpc/outbound_call.cc:440): Service unavailable (yb/rpc/service_pool.cc:225): UpdateTransaction request on yb.tserver.TabletServerService from 10.238.220.15:38209 dropped due to backpressure. The service queue is full, it has 5000 items.
Also, application logs may show the following error:
com.datastax.driver.core.exceptions.OperationTimedOutException
com.datastax.driver.core.exceptions.TransportException
Another error seen in application logs can be:
Connection error: ('Unable to connect to any servers', {'10.88.16.74': ConnectionException('Failed to initialize new connection to 10.88.16.74: Error from server: code=1001 [Coordinator node overloaded] message=""',)})Connection error: ('Unable to connect to any servers', {'10.88.16.74': ConnectionException('Failed to initialize new connection to 10.88.16.74: Error
Resolution:
This error comes when:
- The CPU is busy and the tserver cannot schedule work to a new client request
- the RPC queue is full, usually indicating a sudden influx of new connections or aggressive retry logic
Solutions include:
- Reduce processing load on the system by reducing number of queries per second
- Reduce number of connections to the system at the application side
- Size up the cluster if the load is expected to remain at this volume - frequently encountering this error is usually indication that the cluster is undersized
- If your application is using
executeAsync
from the Yugabyte CQL driver for queries, consider changing these to synchronous queries. Asynchronous queries will not require other queries to complete, so many hundreds or thousands my execute simultaneously, which may be a trigger for this issue.
Root Cause
- The YCQL RPC queue is full — this is the most common
- if memory
M
is in 85-100% of total, then we’ll reply withX
probability, whereX = (M-85) / 100
— less common, requires memory overload - if the request touches a rocksdb, which has a large number of files
F
, between soft and hard limit, default flags being 24 and 48 respectively, then we’ll again reject with X probability, withX = (F - soft) / (hard - soft)
— very uncommon
Each running instance of YBDB is composed of multiple services, each with its own objective/goal. Each of these services have a queue which is FIFO in nature and has a set limit, meaning it can hold only a fixed number of items/requests in it, say 5k or 10k. The limit has been put to prevent OOM in situations when processing becomes a bottleneck and the queue keeps on growing.
Also, each service has a priority, which is important since some tasks are more important than others. For example, the task of appointing a leader through consensus is more important that any other task, and if deprioritized and not given enough resources to do its job, it may cause unnecessary leader movements. So it may happen that if any particular type of service is running havoc and is not a high priority service, its queue/requests may be dropped.
Thus, having different queues for different services/tasks prevents from an innocent service's queue to suffer due to a bad service's behavior.
Once a request comes from, say CQL, to the TServer, the request is entered in the TServer's queue as a request. Note that the queue only contains the requests which haven't been picked up for processing.
Each service is assigned a number of RPC threads. Each RPC thread works on 1 item at a time. These RPC threads reads the request from the service queue and starts processing them. Now it may happen that the request may be of compression/decompression, which is CPU intensive and may take time. Assume that all RCP threads get busy in similar CPU intensive requests, taking their time to process the request. In mean time, new requests may get added to the service queue, thus building it up. Once the limit of number of request the queue can hold is reached, it will stop accepting new incoming requests and will will send out a 'queue-fill' error downstream, to the layer sending it the requests, and this layer will have its own logic of interpreting the error received and sending out its own error message to the clients.
So when a TServer's service queue, which is called 'tablet_server_svc_queue_length', which holds all incoming requests from say CQL, gets filled, it sends a ERROR_SERVER_TOO_BUSY error downstream, which CQL picks up and then in turn gives "Coordinator node overloaded, rejecting connection" error to the clients.
Comments
0 comments
Please sign in to leave a comment.