How to Monitor Remote Bootstrap of a Tablet using logs and metrics
Environment
- YugabyteDB - all supported versions
- YugabyteDB Anywhere - any supported version when reviewing universe logs, support bundles, or metrics
Issue
Customers may need to monitor the progress of remote bootstrap for a specific tablet. This is commonly needed when a tablet replica is being added, replaced, or moved after node addition, node replacement, node decommissioning, failed peer recovery, or load balancing.
Remote bootstrap creates or replaces a tablet peer by copying tablet metadata, RocksDB files, and WAL data from an existing peer to the target YB-TServer. To monitor active remote bootstrap, use the target or source yb-tserver.INFO log and YugabyteDB Anywhere metrics.
Resolution
Overview
For a tablet remote bootstrap, the useful active-monitoring evidence chain in yb-tserver.INFO is:
- The target YB-TServer logs that it is initiating remote bootstrap from a source peer.
- The target YB-TServer begins the remote bootstrap session.
- The target YB-TServer downloads RocksDB/SST files and WAL segments.
- The target YB-TServer replaces the tablet superblock and opens the tablet.
- The source peer logs remote bootstrap session completion and
ChangeRolesuccess. - The target YB-TServer logs
Remote bootstrap for tablet ended successfully.
The fastest way to monitor one tablet is to follow the relevant yb-tserver.INFO file while filtering by tablet ID.
Setup
Set the target or source YB-TServer log file and tablet ID before running the helper function.
export TABLET_ID="<tablet_id>" export TSERVER_LOG_FILE="/path/to/yb-tserver.INFO"
Use the target YB-TServer log to monitor download progress and target-side completion. Use the source YB-TServer log to monitor source-side session completion and ChangeRole messages.
Bash Helpers
Follow one tablet live
If logs are still being written on the nodes, this function tails the relevant remote bootstrap messages, including RocksDB/SST file and WAL download progress.
rbs_follow_tablet() {
local tablet_id="${1:-${TABLET_ID:?TABLET_ID required}}"
local tserver_log="${2:-${TSERVER_LOG_FILE:?TSERVER_LOG_FILE required}}"
tail -F "${tserver_log}" 2>/dev/null |
grep --line-buffered -E \
"${tablet_id}.*(Initiating RemoteBootstrap|Beginning remote bootstrap session|Began remote bootstrap session|Downloaded file|Downloaded WAL segment|Remote bootstrap complete|Remote bootstrap: Opening tablet|Remote bootstrap for tablet ended successfully|Remote bootstrap: Failed|Unable to fetch data)|Remote bootstrap session with id .*${tablet_id}|ChangeRole succeeded for bootstrap session .*${tablet_id}"
}
Example:
rbs_follow_tablet "$TABLET_ID"
Use this when the operation is still in progress and you want to watch new log lines as they arrive.
Example: Trimmed rbs_follow_tablet output
The following trimmed example shows the important milestones to watch during an active remote bootstrap.
I0512 07:22:56.607738 997 ts_tablet_manager.cc:3524] T <tablet_id> P <target_peer_uuid>: Initiating RemoteBootstrap from Peer <source_peer_uuid> (<source_host>:9100) I0512 07:22:56.608563 997 remote_bootstrap_client.cc:215] T <tablet_id> P <target_peer_uuid>: Remote client base: Beginning remote bootstrap session from peer <source_peer_uuid> [LEADER] at <source_host>:9100 I0512 07:22:56.625867 997 remote_bootstrap_client.cc:428] T <tablet_id> P <target_peer_uuid>: Remote client base: Began remote bootstrap session <session_id> [Bootstrapping from LEADER] I0512 07:23:06.985605 997 remote_bootstrap_file_downloader.cc:259] T <tablet_id> P <target_peer_uuid>: Remote client base: Downloaded file: 000012.sst.sblock.0; Stats: Total time: 10358.753 ms, iterations: 102, Transmission rate: 98277, RateLimiter total time slept: 10005 ms, Total bytes: 1017809, ... I0512 07:23:06.986115 997 tablet_bootstrap_if.cc:95] T <tablet_id> P <target_peer_uuid>: RemoteBootstrap: Downloaded file 000012.sst.sblock.0 of size 1015874 in 10.35954113 seconds (skip_compression: 1) I0512 07:23:07.679858 997 remote_bootstrap_file_downloader.cc:259] T <tablet_id> P <target_peer_uuid>: Remote client base: Downloaded file: 000013.sst; Stats: Total time: 692.259 ms, iterations: 7, Transmission rate: 98324, RateLimiter total time slept: 667 ms, Total bytes: 67926, ... I0512 07:23:19.473927 997 tablet_bootstrap_if.cc:95] T <tablet_id> P <target_peer_uuid>: RemoteBootstrap: Downloaded file MANIFEST-000011 of size 1334 in 0.015693625 seconds (skip_compression: 1) I0512 07:23:41.163797 997 remote_bootstrap_client.cc:695] T <tablet_id> P <target_peer_uuid>: Remote client base: Downloaded WAL segment with seq. number 1 of size 1094700 in 11.1647 seconds I0512 07:34:50.765514 997 remote_bootstrap_client.cc:695] T <tablet_id> P <target_peer_uuid>: Remote client base: Downloaded WAL segment with seq. number 6 of size 33608006 in 342.746 seconds I0512 07:57:43.957592 997 remote_bootstrap_client.cc:695] T <tablet_id> P <target_peer_uuid>: Remote client base: Downloaded WAL segment with seq. number 8 of size 67255404 in 686.77 seconds I0512 07:59:24.881805 997 remote_bootstrap_client.cc:482] T <tablet_id> P <target_peer_uuid>: Remote client base: Remote bootstrap complete. Replacing tablet superblock. I0512 07:59:24.886072 997 ts_tablet_manager.cc:1726] T <tablet_id> P <target_peer_uuid>: Remote bootstrap: Opening tablet I0512 07:59:24.990051 997 ts_tablet_manager.cc:1747] T <tablet_id> P <target_peer_uuid>: Remote bootstrap for tablet ended successfully
How to read this example:
-
Initiating RemoteBootstrapandBeginning remote bootstrap sessionconfirm the target peer connected to a source peer. -
Downloaded file: *.sstandDownloaded file *.sst.sblock.0show RocksDB/SST data and SST metadata are being copied. -
Downloaded file CURRENTandDownloaded file MANIFEST-*lines, when present, show RocksDB metadata files are being copied. -
Downloaded WAL segment with seq. number <n>shows WAL segment transfer progress. Increasing sequence numbers and byte counts indicate the operation is still moving forward. -
RateLimiter total time sleptshows time spent throttled by the remote bootstrap rate limiter. -
Remote bootstrap complete. Replacing tablet superblock,Remote bootstrap: Opening tablet, andRemote bootstrap for tablet ended successfullyare the target-side completion markers.
Monitoring With YugabyteDB Anywhere Metrics
YugabyteDB Anywhere metrics can monitor whether remote bootstrap is currently active and whether data is moving. Use logs when you need to monitor one specific tablet ID. Use metrics when you need cluster-level or node-level visibility.
In the YugabyteDB Anywhere UI, open the universe, go to Metrics, and select the TServer tab. Then search for or select the remote bootstrap charts.
Remote Bootstrap Sessions
The Remote Bootstrap Sessions chart shows active server-side and client-side remote bootstrap sessions.
Interpretation:
- RBS Server Sessions indicates source-side sessions serving remote bootstrap data.
- RBS Client Sessions indicates target-side tablet peers currently undergoing remote bootstrap.
- A non-zero value means remote bootstrap is active on at least one YB-TServer.
- Per-node charts help identify which YB-TServers are serving data and which are receiving data.
Remote Bootstrap Bytes Transferred
The Remote Bootstrap Bytes Transferred chart shows transfer throughput for FetchData responses.
Interpretation:
- Non-zero throughput means remote bootstrap data is actively being transferred.
- A flat or near-zero throughput while sessions are active can indicate a stalled or very slow remote bootstrap.
- Per-node charts help identify which YB-TServers are sending or receiving most remote bootstrap traffic.
Screenshots
Overall
Outlier nodes
PromQL Examples
Replace $node_prefix with the universe node prefix or the equivalent dashboard variable in your Prometheus UI.
Remote Bootstrap Sessions
Cluster-level active sessions:
sum(
{
node_prefix=~"$node_prefix",
export_type="tserver_export",
saved_name=~"num_remote_bootstrap_sessions_serving_data|num_tablet_peers_undergoing_rbs"
}
) by (saved_name)
Top nodes plus selected-node average:
(
avg(
{
node_prefix=~"$node_prefix",
export_type="tserver_export",
saved_name=~"num_remote_bootstrap_sessions_serving_data|num_tablet_peers_undergoing_rbs"
}
) by (exported_instance, saved_name)
and
topk(
3,
avg(
{
node_prefix=~"$node_prefix",
export_type="tserver_export",
saved_name=~"num_remote_bootstrap_sessions_serving_data|num_tablet_peers_undergoing_rbs"
}
) by (exported_instance, saved_name)
) by (saved_name)
)
or
avg(
avg(
{
node_prefix=~"$node_prefix",
export_type="tserver_export",
saved_name=~"num_remote_bootstrap_sessions_serving_data|num_tablet_peers_undergoing_rbs"
}
) by (exported_instance, saved_name)
) by (saved_name)
Remote Bootstrap Bytes Transferred
Cluster-level throughput in MB/s:
sum(
rate(
{
node_prefix=~"$node_prefix",
export_type="tserver_export",
saved_name=~"proxy_response_bytes_yb_tserver_RemoteBootstrapService_FetchData|service_response_bytes_yb_tserver_RemoteBootstrapService_FetchData"
}[30s]
)
) by (saved_name) / (1024 * 1024)
Top nodes plus selected-node average in MB/s:
(
avg(
rate(
{
node_prefix=~"$node_prefix",
export_type="tserver_export",
saved_name=~"proxy_response_bytes_yb_tserver_RemoteBootstrapService_FetchData|service_response_bytes_yb_tserver_RemoteBootstrapService_FetchData"
}[30s]
)
) by (exported_instance, saved_name) / (1024 * 1024)
and
topk(
3,
avg(
rate(
{
node_prefix=~"$node_prefix",
export_type="tserver_export",
saved_name=~"proxy_response_bytes_yb_tserver_RemoteBootstrapService_FetchData|service_response_bytes_yb_tserver_RemoteBootstrapService_FetchData"
}[1h]
)
) by (exported_instance, saved_name) / (1024 * 1024)
) by (saved_name)
)
or
avg(
avg(
rate(
{
node_prefix=~"$node_prefix",
export_type="tserver_export",
saved_name=~"proxy_response_bytes_yb_tserver_RemoteBootstrapService_FetchData|service_response_bytes_yb_tserver_RemoteBootstrapService_FetchData"
}[30s]
)
) by (exported_instance, saved_name) / (1024 * 1024)
) by (saved_name)
Comments
0 comments
Please sign in to leave a comment.