Table of Contents
Environment
- YugabyteDB YSQL
Issue
When creating an index in YugabyteDB (or PostgreSQL), you may encounter the error:
ERROR: timed out waiting for postgres backends to catch up
DETAIL: 1 backends on database 13515 are still behind catalog version 2.
HINT: Run the following query on all tservers to find the lagging backends: SELECT * FROM pg_stat_activity WHERE backend_type != 'walsender' AND backend_type != 'yb-conn-mgr walsender' AND catalog_version < 2 AND datid = 13515;
This error indicates that some backend Postgres processes are still using an outdated catalog version, blocking the index creation.
Additional Information:
- The yb-master UI shows the task as waiting on catalog version.
Resolution
Overview
The most common cause is a session in an idle in transaction state or long-running queries or transactions. This happens when a user or process connects to the database, runs BEGIN, and then leaves the session open without committing or rolling back. These sessions prevent the catalog from advancing, blocking DDL operations like CREATE INDEX.
Steps
Identify the catalog version the build is waiting on
On the same node where the CREATE INDEX session is running, open a second YSQL session and run the following query. It joins pg_stat_progress_create_index with pg_stat_activity to expose the current phase, wait event, and the catalog_version the build is waiting on:
SELECT
i.relname AS indexname,
t.relname AS tablename,
p.phase,
p.tuples_total,
p.tuples_done,
CASE
WHEN p.tuples_total > 0
THEN ROUND((p.tuples_done::numeric / p.tuples_total::numeric) * 100, 2)
ELSE 0
END AS percent_complete,
now() - a.backend_start AS time_elapsed,
a.wait_event_type,
a.wait_event,
a.catalog_version,
a.datid,
a.datname
FROM pg_stat_progress_create_index p
JOIN pg_stat_activity a ON p.pid = a.pid
JOIN pg_class i ON p.index_relid = i.oid
JOIN pg_class t ON p.relid = t.oid;
Example output:
indexname | tablename | phase | tuples_total | tuples_done | percent_complete | time_elapsed | wait_event_type | wait_event | catalog_version | datid | datname
-----------------------+-----------+--------------+--------------+-------------+------------------+-----------------+-----------------+------------------+-----------------+-------+-----------
idx_employees_email_1 | employees | initializing | 3000000 | 0 | 0.00 | 01:48:36.856407 | RPCWait | WaitingOnTServer | 9 | 13515 | yugabyte
Interpret the row as follows:
-
phase = initializingwithwait_event = WaitingOnTServerconfirms the build is stuck in the catalog-propagation phase, not in the actual backfill. -
catalog_version(9in the example) is the version the backend is waiting for all other backends to reach. Use this value as the threshold in the next step. -
datid(13515in the example) is the OID of the database in which the index is being created. Catalog versions are tracked per database, so this value must also be carried forward as a filter in the next step to avoid false positives from sessions in unrelated databases. -
time_elapsedtells you how long the build has been blocked.
pg_stat_progress_create_index is local to the node that received the CREATE INDEX statement. Run this query from the same node as the CREATE INDEX session, otherwise it returns no rows.Find the lagging backends on all tservers
Connect to each tserver in the universe and list the backends whose catalog_version is below the value from the previous step, scoped to the database where the index is being built. Replace 9 with the catalog_version and 13515 with the datid you observed:
SELECT *
FROM pg_stat_activity
WHERE backend_type != 'walsender' AND backend_type != 'yb-conn-mgr walsender'
AND catalog_version < 9
AND datid = 13515; -- Replace with actual datid
Each row returned is a session that has not yet refreshed its catalog cache and is therefore blocking the index build. Note the pid, datname, usename, application_name, client_addr, state, backend_start, xact_start, and query columns for each row — you will need these to identify the owning application before taking action.
Apply a solution
Once the lagging backends are identified, choose one of the following solutions based on whether the blocking sessions are expected to complete on their own.
Solution 1: Increase Timeout
- If you know that the blocking sessions will complete in a particular time, you can increase the timeout for waiting for backends to catch up by setting the
yb_wait_for_backends_catalog_version_timeoutflag to a higher value (e.g., 10 minutes):
set yb_wait_for_backends_catalog_version_timeout to '10min';
Solution 2: Verify, Confirm and Terminate Blocking Backends
application_name, usename, client_addr, and query columns from pg_stat_activity to identify the owner and obtain explicit approval before proceeding.Once approved, terminate each blocking session by its pid:
SELECT pg_terminate_backend(<pid>);
Replace <pid> with the process ID from the pg_stat_activity output of the Find the lagging backends on all tservers step.
Retry Index Creation
After terminating the blocking sessions, retry your CREATE INDEX statement.
-
If the Issue Persists
- Re-run the diagnostic to check for new blocking sessions.
- Investigate which user or process is creating these sessions and address the root cause (e.g., application code leaving transactions open or long-running queries).
Additional Information:
- This issue is common in environments where applications or users leave transactions open unintentionally.
- For persistent issues, review the application connection and transaction management logic.
- If customer is okay with terminating backends without verification. We can run following query in all tservers.
SELECT
pg_terminate_backend(pid),
pid,
usename,
datname,
backend_type,
round(extract(epoch FROM (clock_timestamp() - query_start))::numeric, 2) AS duration_seconds,
substring(query FROM 1 FOR 60) AS short_query
FROM pg_stat_activity
WHERE backend_type != 'walsender'
AND catalog_version < 1134
AND datid = 16384
Comments
0 comments
Please sign in to leave a comment.