Environment
- YugabyteDB Anywhere - < 2.25.1 and < 2025.1
Issue
When there are a lot of tablets per tserver node, prometheus takes time to scrape the metrics and then it times out since the default timeout is 10 secs.
This can be confirmed from prometheus by running this query:
scrape_duration_seconds{export_type="tserver_export", universe_uuid="<universe_uuid>"}
where <universe_uuid> is to be replaced with the universe that is having this issue.
Here we can see that, few nodes are hitting the scrape timeout of 10secs.
Resolution
Overview
This issue occurs because prometheus is taking long to scrape the metrics from each tserver node and is timing out.
This is tuned in newer releases of 2025.1 and up and is also available in preview releases starting 2.25.1 as a part of #24565
Workaround
1) Drop the unused databases or tables to reduce the number of tablets per tserver.
2) Increase the scrape interval and scrape timeout from 10secs to 20secs or as needed on YBA node.
Replicated YBA - /opt/yugabyte/prometheus_configs
YBA installer - /opt/yba-ctl/yba-ctl.yml
Please change scrape_interval
to 20s
and scrape_timeout
to 20s
.
Replicated YBA would need to be restarted after updating this yml.
For YBA installer, you can run yba-ctl reconfigure
after updating this yml.
Note: The workaround will not be preserved through upgrades and will have to be repeated if the system is upgraded from an affected release to another affected release.
Comments
0 comments
Please sign in to leave a comment.