Environment
- Yugabyte Platform 2.6 or older
Issue
- Instance size change, for example, 8 core to 16 core on a 9 node Universe, fails from platform leading to the following error
Caused by: org.yb.client.TabletServerErrorException: Server[YB Master - <IP>:7100] ILLEGAL_STATE[code 9]: Leader is not ready for Config Change, can try again. Num peers in transit: 1. Type: REMOVE_SERVER.
- YB-Master UI shows both old and new nodes, in this case, 18 nodes instead of 9.
It also shows 4 master nodes instead of 3. - The database remains functional accepting traffic but the Cluster state is configured incorrectly.
- This issue is tracked by internal issue PLAT-1717
Resolution
This issue is resolved in version 2.8, addressed by this commit:
https://github.com/yugabyte/yugabyte-db/commit/aef32082b90cad815af60f39e6848da5d043a208
Workaround
Following are the manual steps that can be performed to restore the cluster configuration to a healthy state.
1. Confirm old nodes are blacklisted by navigating to the cluster config from UI. If not, use the below command for the old tablet servers to blacklist the nodes.
~/master/bin/yb-admin -master_addresses $MASTERS change_blacklist ADD node1:9100 node2:9100 node3:9100 node4:9100 node5:9100 node6:9100
2. Force a new Master by leader stepdown
~/master/bin/yb-admin -master_addresses $MASTERS master_leader_stepdown <UUID of old master>
3. Remove old nodes from the UI(as shown in the screenshot).
4. Remove old Masters from the Raft Group
~/master/bin/yb-admin change_master_config REMOVE_SERVER <old_master_ip> 7100
5. Update tserver_master_addrs
to new masters on all tablet servers to heartbeat to new master after old masters are removed from the quorum.
~/tserver/bin/yb-ts-cli --server_address=<>:9100 set_flag --force tserver_master_addrs <hostname:port>
Root Cause
ChangeConfig to add a master returns immediately to the caller and then proceeds to asynchronously bootstrap the added node. While this operation is not yet complete, the added node is in PRE_VOTER state and no new ChangeConfigs will succeed. Platform, however, turns around and issues a ChangeConfig to remove an old master - which returns Leader is not ready for Config Change, can try again - Platform then doesn't seem to try again and aborts the entire full move. This might not be an issue for small sys catalog tablets which bootstrap fast by the time Platform turns around to make the call but for a 44 MB sys catalog like above, the chance is higher.
Comments
0 comments
Please sign in to leave a comment.