Yugabyte Core DB
This count() query is considered as an anti-pattern since it involves full table scan. In case of large tables we have seen the count() query times out. The article describes workarounds to run the query in Yugabyte DB
Running count(*) query in YCQL:
In YCQL, the count() query can be executed using the YCRC tool. The tool is available on GitHub.
The tool provided uses the exposed hash_partition function in order to execute smaller, more manageable queries which have are individually less resource intensive, so they don't time out.
Below are the steps to set up and run ycrc tool:
Step 1: Download the ycrc tool - you can compile from source by going to the github repo listed above, or download the appropriate version attached to this KB
Step 2: Run
./ycrc --help to confirm if the ycrc tool is working:
[test-yugadb-node2 ~]$ ./ycrc --help YCql Row Count (ycrc) parallelizes counting the number of rows in a table for YugabyteDB CQL, allowing count(*) on tables that otherwise would fail with query timeouts Usage: ycrc <keyspace> [flags] Flags: -d, --debug Verbose logging -h, --help help for ycrc -c, --hosts strings Cluster to connect to (default [127.0.0.1]) -p, --parallel int Number of concurrent tasks (default 16) --password string user password -s, --scale int Scaling factor of tasks per table, an int between 1 and 10 (default 6) --sslca string SSL root ca path --sslcert string SSL cert path --sslkey string SSL key path --tables strings List of tables inside of the keyspace - default to all -t, --timeout int Timeout of a single query, in ms (default 1500) -u, --user string database user (default "cassandra") --verify Strictly verify SSL host (off by default) -v, --version version for ycrc
Step 3: Once it is setup, run the YCRC tool to count the rows in a given keyspace. The below example shows YCRC command to count rows in all tables in example keyspace.
[test-yugadb-node2 ~]$ ./ycrc -c 10.128.0.26 example Checking table row counts for keyspace: example Checking row counts for: example.sensordata Partitioning columns for example.sensordata:(customer_name,device_id) Performing 4096 checks for example.sensordata with 16 parallel tasks Total time: 261 ms ========== Total Row Count example.sensordata = 60 Checking row counts for: example.emp Partitioning columns for example.emp:(emp_id) Performing 4096 checks for example.emp with 16 parallel tasks Total time: 250 ms ========== Total Row Count example.emp = 3
YCRC tool example:
./ycrc example -c 10.88.16.71 -u test --verify --password xxx --tables sensordata -s 7 -p 32 -t 3000 --sslca /opt/yugabyte/certs/tests.crt
The above example is used if you want to run YCRC tool using
test user on
sensordata table in
-sslca flag needs to specified if client to node authentication is enabled.
In the above example,
-t are important. we can adjust the
-t, --timeout int Timeout of a single query, in ms (default 1500) - or - we can also experiment with the
-p, --parallel int Number of concurrent tasks (default 16) in order to allow for time or resources needed to allow the script to complete for large data set.
-s i.e scale is “is the tool successful” and
-p i.e parallelism is “how fast”. If scale is high enough, it will return, but maybe you only burn 10% CPU and it takes like 10 min to return. If you then double parallelism, it might be 20% CPU but only 5min to return. Also increasing the
-t flag will help with large dataset.
The tool can be executed from any server i.e personal laptop also. The tool is available currently for linux and Windows. You can specify one or more than one IP addresses in
-c flag .
Running count(*) query in YSQL:
Create a function and execute the query using the function which uses an implicit cursor.
CREATE OR REPLACE FUNCTION row_count(tbl regclass) RETURNS setof int AS $func$ DECLARE _id int; BEGIN FOR _id IN EXECUTE 'SELECT 1 FROM ' || tbl LOOP RETURN NEXT _id; END LOOP; END $func$ LANGUAGE plpgsql;
In this case, the query would be:
select count(*) from row_count('tablename')
Be aware this query may take some minutes to complete.
You may also increase the client side timeout to something higher, maybe 10 minutes using the below tserver gflag
The above works for v2.4 and above since it requires CURSOR feature.
Below is one more workaround for running count(*) in ysqlsh:
create table foo (id int primary key, fname text); insert into foo select i, 'jon' || i from generate_series(1, 1000000) as i; create table dual (foo int); insert into dual values (1); explain select count(*) from foo cross join dual; QUERY PLAN --------------------------------------------------------------------------- Aggregate (cost=15202.50..15202.51 rows=1 width=8) -> Nested Loop (cost=0.00..12702.50 rows=1000000 width=0) -> Seq Scan on foo (cost=0.00..100.00 rows=1000 width=0) -> Materialize (cost=0.00..105.00 rows=1000 width=0) -> Seq Scan on dual (cost=0.00..100.00 rows=1000 width=0)
Another option is to use yb_hash_code() to run different queries that work on different parts of the table, and control the parallelism at the application level.
Additional information on yb_hash_code() can be found here. In particular, this sub-section is relevant for running count(*) on tables.
Again, be warned - running this count(*) query on a large table while other activity, like a data load, is going on the cluster , is a bad pattern because this is an expensive operation.
We want to limit the count(*) as much as possible on Yugabyte for large tables. Please feel free to reach out to support for further queries.
there was a good article about the Count Problem in PostgreSQL.
Can someone elaborate how this options apply to YugaByteDB? Are i.e. such stats available as well to be (ab)used for a row count or estimate, especially after the Operations the row count usually have no meaning any more, as it has usually changed anyway a second later
See also https://docs.yugabyte.com/v2.12/api/ysql/exprs/func_yb_hash_code/#distributed-parallel-queries
Please sign in to leave a comment.