Environment
- YugabyteDB - 2.8 and older
- YugabyteDB - 2.12
- YugabyteDB - 2.14
- YugabyteDB - 2.16
Issue
During rolling restart of a Yugabyte database initiated through YugabyteDB Anywhere (formerly Yugabyte Platform), an increase in query latency is observed. Once the rolling restart has completed, latency returns to normal.
Resolution
Overview
In versions of YugabyteDB Anywhere prior to 2.8.1, rolling restarts force the database software to shut down on each node in sequence. This happens without any action to drain incoming requests from the remote procedure call (RPC) queue on these nodes. As a result, when nodes hosting tablet leaders are restarted, pending RPCs are discarded and must be retried, leading to an increase in query latency.
- Starting in version 2.8.1, YugabyteDB Anywhere supports the use of a database feature called "tablet leader blacklisting" during rolling restarts. Once this feature is enabled, tablet leadership will be moved away from the node scheduled for restart, giving the RPC queues a chance to drain before the node is restarted.
- Starting in version 2.14.0, tablet leader blacklisting is enabled for all Universes by default. However, for Universes with many tablets, some tuning may be required to ensure leader movement completes within the configured timeout.
IMPORTANT: Systems using custom (CA-signed) TLS certificates and running YugabyteDB Anywhere versions 2.12.2 through 2.12.9 should be upgraded to version 2.12.10 or newer to address bug PLAT‑4658 prior to enabling this feature.
Steps
1. Upgrade YugabyteDB Anywhere to version 2.8.1 or newer. This feature is supported with database versions 2.4 and newer.
2. For systems running versions of YugabyteDB Anywhere 2.14.0 or newer, skip this step.
Use the YugabyteDB Anywhere REST API to set the runtime configuration setting yb.upgrade.blacklist_leaders
to true
to enable tablet leader blacklisting during rolling restarts.
For example, the command below will enable tablet leader blacklisting for all Universes associated with the corresponding YugabyteDB Anywhere instance (replace <platform_address> with the hostname or IP address of the YugabyteDB Anywhere instance, <cuuid> with the Customer ID value from the User Profile section of the YugabyteDB Anywhere user interface, and <auth_token> with a suitable REST API auth token):
curl --request PUT \
--url https://<platform_address>/api/v1/customers/<cuuid>/runtime_config/00000000-0000-0000-0000-000000000000/key/yb.upgrade.blacklist_leaders \
--header 'Content-Type: text/plain' \
--header 'X-AUTH-YW-API-TOKEN: <auth_token>' \
--data true
More information about getting and setting Yugabyte Anywhere runtime configuration variables is available in the Runtime Configuration section of the REST API documentation.
NOTE: Only the SuperAdmin user can modify runtime configuration variables at the global scope (00000000-0000-0000-0000-000000000000).
3. If necessary, adjust the yb.upgrade.blacklist_leader_wait_time_ms
runtime configuration setting. By default, YugabyteDB Anywhere will wait a maximum of 60000 ms (1 minute) for tablet leader migration to complete before restarting each node.
The amount of time required for all tablet leaders to migrate varies depending on the number of tablets on each node. By default, the database software will perform 2 tablet leader moves per second.
YugabyteDB Anywhere periodically checks the status of tablet leader migration and will restart a node immediately if all leaders have been migrated, so this setting can be safely increased to several minutes. This setting acts as a "backstop" that prevents rolling restart from hanging in the event that tablet leader migration does not complete in a timely manner.
4. If necessary, adjust the master GFlag load_balancer_max_concurrent_moves
. This flag controls the number of tablet leader moves that the database software will perform concurrently and therefore how fast tablet leader blacklisting will complete. If rolling restarts of a Universe are taking too long after enabling tablet leader blacklisting, adjust this flag as shown in the table below:
Node vCPUs | Node Memory | load_balancer_max_concurrent_moves |
4 or more | 8 GiB or more | 10 |
This GFlag is runtime settable and does not require a rolling restart.
For more information on how to set database GFlags, see the Edit configuration flags section of the documentation.
Comments
0 comments
Please sign in to leave a comment.