Environment:
Yugabyte Core DB
Issue:
This count() query is considered as an anti-pattern since it involves full table scan. In case of large tables we have seen the count() query times out. The article describes workarounds to run the query in Yugabyte DB
Resolution:
Getting count of records in YCQL:
In YCQL, the count() query can be executed using the YCRC tool. The tool is available on GitHub.
The tool provided uses the exposed hash_partition function in order to execute smaller, more manageable queries which have are individually less resource intensive, so they don't time out.
Below are the steps to set up and run ycrc tool:
Step 1: Download the ycrc tool - you can compile from source by going to the github repo listed above, or download the appropriate version attached to this KB
Step 2: Run ./ycrc --help
to confirm if the ycrc tool is working:
[test-yugadb-node2 ~]$ ./ycrc --help YCql Row Count (ycrc) parallelizes counting the number of rows in a table for YugabyteDB CQL, allowing count(*) on tables that otherwise would fail with query timeouts Usage: ycrc <keyspace> [flags] Flags: -d, --debug Verbose logging -h, --help help for ycrc -c, --hosts strings Cluster to connect to (default [127.0.0.1]) -p, --parallel int Number of concurrent tasks (default 16) --password string user password -s, --scale int Scaling factor of tasks per table, an int between 1 and 10 (default 6) --sslca string SSL root ca path --sslcert string SSL cert path --sslkey string SSL key path --tables strings List of tables inside of the keyspace - default to all -t, --timeout int Timeout of a single query, in ms (default 1500) -u, --user string database user (default "cassandra") --verify Strictly verify SSL host (off by default) -v, --version version for ycrc
Step 3: Once it is setup, run the YCRC tool to count the rows in a given keyspace. The below example shows YCRC command to count rows in all tables in example keyspace.
[test-yugadb-node2 ~]$ ./ycrc -c 10.128.0.26 example Checking table row counts for keyspace: example Checking row counts for: example.sensordata Partitioning columns for example.sensordata:(customer_name,device_id) Performing 4096 checks for example.sensordata with 16 parallel tasks Total time: 261 ms ========== Total Row Count example.sensordata = 60 Checking row counts for: example.emp Partitioning columns for example.emp:(emp_id) Performing 4096 checks for example.emp with 16 parallel tasks Total time: 250 ms ========== Total Row Count example.emp = 3
YCRC tool example:
./ycrc example -c 10.88.16.71 -u test --verify --password xxx --tables sensordata -s 7 -p 32 -t 3000 --sslca /opt/yugabyte/certs/tests.crt
The above example is used if you want to run YCRC tool using test
user on sensordata
table in example
keyspace. -sslca
flag needs to specified if client to node authentication is enabled.
In the above example, -s
and -t
are important. we can adjust the -t, --timeout int Timeout of a single query, in ms (default 1500)
- or - we can also experiment with the -p, --parallel int Number of concurrent tasks (default 16)
in order to allow for time or resources needed to allow the script to complete for large data set.
Basically, -s
i.e scale is “is the tool successful” and -p
i.e parallelism is “how fast”. If scale is high enough, it will return, but maybe you only burn 10% CPU and it takes like 10 min to return. If you then double parallelism, it might be 20% CPU but only 5min to return. Also increasing the -t
flag will help with large dataset.
The tool can be executed from any server i.e personal laptop also. The tool is available currently for linux and Windows. You can specify one or more than one IP addresses in -c
flag .
Getting count of records in YSQL:
Please refer to the the article linked here.
Comments
2 comments
there was a good article about the Count Problem in PostgreSQL.
https://www.citusdata.com/blog/2016/10/12/count-performance/
Can someone elaborate how this options apply to YugaByteDB? Are i.e. such stats available as well to be (ab)used for a row count or estimate, especially after the Operations the row count usually have no meaning any more, as it has usually changed anyway a second later
See also https://docs.yugabyte.com/v2.12/api/ysql/exprs/func_yb_hash_code/#distributed-parallel-queries
Please sign in to leave a comment.