In the older version of YugabyteDB, We have seen cases where large WAL entries persisted on the tablet leader but cannot be successfully replicated to followers, leading to tablets getting into an unhealthy/unusable state.
- Tablets cannot acquire a leader lease
- Snapshot creation fails
- Backup fails
This issue is already fixed in the below versions. We will use the below steps in Yugabyte DB versions previous to these releases.
- YugabyteDB 22.214.171.124
- YugabyteDB 126.96.36.199
- YugabyteDB 188.8.131.52
- YugabyteDB 184.108.40.206
- YugabyteDB 220.127.116.11
- Please use these steps only when the leader has accepted the message but followers are failing to accept which only happens in the YugabyteDB versions above mentioned above.
- If the leader is not accepting the large messages, then the User should different approaches like reducing the prefetch.
- Increasing this value way too much is not recommended (i. g. 512MB+)
To fix this issue, We will have to increase the
rpc_max_message_size value to more than the largest message.
- To find the largest message size, run the below command against all yb-tserver logs.
grep 'The frame had a length of' yb-tserver.INFO |grep 'tcp_stream.cc'|sed 's/.* length of //g'|sed 's/, but we .*)//g' | sort|uniq |tail
Example: In the below example, 300298043 is the largest message size.
grep -r 'The frame had a length of' |grep 'tcp_stream.cc'|sed 's/.* length of //g'|sed 's/, but we .*)//g' | sort|uniq |tail 299128732 300298041 300298042 300298043
- Once we have the largest message size, We can increase the
rpc_max_message_sizevalue by a few megabytes. For example:
yb-ts-cli --server_address hostname:9100 set_flag rpc_max_message_size 333256758 --force
- Once the tablets become healthy, Please revert the GFlag value to the default value which is 256MB, and advise the user to avoid having large messages.