Environment
- Yugabyte Platform - 2.6 onwards
Issue
Usage of both cassandra-loader/unloader and dsbulk to load CSV data into Yugabyte and also extract data into CSV files. Currently cassandra-loader and cassandra unloader are the recommended tools for loading large data sets (GB's), for small data sets (MB's) the COPY command can be used.
Going forward, dsbulk will be the recommended tool for loading larger data sets, but is still currently being developed.
Limitations
Currently dsbulk does not support jsonb columns, or TLS configured systems.
Recommendations
For large data sets it is recommended to use dsbulk unless you have a requirement for jsonb columns or TLS support, as in tests it has proved to be more performant than cassandra-loader/unloader, but the level of performance will depend on volumes and structure of data. There is the added feature of row count with dsbulk. The level of logging is far superior for dsbulk in terms of the details given about each load, positions of failed records etc..
Resolution
Overview
Cassandra-loader/unloader
Installation
To install cassandra-loader/unloader for yugabyte use the distributions below. You will need to install wget first.
wget https://github.com/yugabyte/cassandra-loader/releases/download/v0.0.27-yb-2/cassandra-loader
wget https://github.com/YugaByte/cassandra-loader/releases/download/v0.0.27-yb-2/cassandra-unloader
A good place for installation is the platform nodes, as from here you can access any of the data nodes via private IP addresses.
Cassandra-loader/unloader has many options to configure the loading or unloading process, a full list of the options can be found in the link below.
https://github.com/brianmhess/cassandra-loader
But please note, do not use the distribution in the above link, as it is not the official Yugabyte supported distribution.
Examples
Cassandra-loader, in the example below, data will be loaded into the database from a CSV file with delimiter tab, but this can be changed to any relevant type of delimiter.
Please note the cassandra-loader will expect the table to exist already, it will not create the table if it does not exist.
./cassandra-loader \
-schema "<keyspace>.<table>(<col1>, <col2>, col3> .....)" \
-boolStyle 1_0 \
-numThreads 50 \
-numFutures 500 \
-rateFile rate_statistics.csv \
-batchSize 10 \
-rate 500 \
-queryTimeout 600000 \
-numRetries 10 \
-progressRate 10000 \
-host <host> \
-port <port> \
-user <user> \
-pw <password> \
-delim $'\t' \
-f <path to csv file> \
-charsPerColumn 90000000
Important options to note for throttling the process and also allowable sizes.
-charsPerColumn biggest column size allowed
-rate Maximum rate to insert data - in rows/sec.
-maxErrors Maximum number of rows that do not parse to allow before exiting.
-maxInsertErrors Maximum number of rows that do not insert to allow before exiting.
-batchSize Size of unlogged batches. If set to 1 then no batching.
The following files will be created after the load.
These log files will created in the working directory from where the command is run.
<csv file>.LOG a log of number of records loaded and number of failed records.
<csv file>.BADPARSE to list rows that did not parse due to syntax errors.
<csv file>.BADINSERT to list all records that did not get inserted due to data issues.
Also the link below also has a good generic example with a sample csv file.
https://github.com/brianmhess/cassandra-loader/tree/master/sample
Cassandra-unloder
./cassandra-unloader \
-boolStyle 1_0 \
-numThreads 1 \
-schema "<keyspace>.<table>(,col1>, <col2>, <col3> ...)" \
-host <host> \
-port <port> \
-user <user> \
-pw <passwd> \
-delim $'\t' \
-f <path to output csv file>
This will produce a csv file with data unloaded from the table.
TLS
For TLS enable environments, you will need to create a truststore and import the root cert, and use the options below.
keytool -keystore ybtruststore -alias ybtruststore -import -file <path to root cert>
You will be prompted to enter the truststore password
Also you will be asked if you trust the certificate. type yes.
Then in the cassandra-loader, cassandra-unloader, add the options
-ssl-truststore-path <path to truststore> \
-ssl-truststore-pw <truststore password> \
DSBULK
Installation
To Install dsbulk you will need to clone the following git repo.
https://github.com/yugabyte/dsbulk
Clone branch yb-dsbulk-1.6.
To build the binaries you will need to install a version of maven, then build via the command
mvn package -P release -DskipTests
This will generate a dsbulk-1.6.0.tar.gz file. It will have the binary that can be used.
Usage
The github repo will have a lot of examples around usage, but the 3 main basic uses are
# Load data
dsbulk load <options>
# Unload data
dsbulk unload <options>
# Count rows
dsbulk count <options>
All the available options can been seen via help
dsbulk --help
Examples
dsbulk to load data
dsbulk load -k <keyspsace> -t <table> -url <csv file. -delim '\t' -h <hostname>
will produce output similar to below
Operation directory: /home/centos/yugabyte-client-2.6/logs/LOAD_20220215-121114-257698
total | failed | rows/s | p50ms | p99ms | p999ms | batches
505 | 3 | 855 | 6.41 | 25.03 | 32.24 | 1.00
Operation LOAD_20220215-121114-257698 completed with 3 errors in .
Rejected records can be found in the following file(s): connector.bad
Errors are detailed in the following file(s): connector-errors.log
Last processed positions can be found in positions.txt
In the relevant log directory the following logs will be present
operation.log - summary, number of records loaded, number of failures, throughput, mem usage.
positions.txt - position of first failure record
connect.bad - all failure records
connector-errors.log - details of exceptions for all failures.
dsbulk to unload data
dsbulk unload -k <keyspace> -t <table> -url <dir for csv files> -delim '\t' -h <hostname>
Operation directory: /home/centos/yugabyte-client-2.6/logs/UNLOAD_20220215-122215-406294
total | failed | rows/s | p50ms | p99ms | p999ms
502 | 0 | 1,241 | 2.75 | 11.53 | 11.53
Operation UNLOAD_20220215-122215-406294 completed successfully in .
Log directory will contain a operation.log
dsbulk to count rows
dsbulk count -k <keyspace> -t newtest -h <hostname>
Operation directory: /home/centos/yugabyte-client-2.6/logs/COUNT_20220215-122035-715034
total | failed | rows/s | p50ms | p99ms | p999ms
502 | 0 | 1,572 | 4.90 | 16.78 | 16.78
Operation COUNT_20220215-122035-715034 completed successfully in .
502
Comments
0 comments
Please sign in to leave a comment.