Environment:
YugabyteDB Anywhere Version: All
Summary
This article provides a guide for troubleshooting and resolving failures during a YugabyteDB Anywhere (YBA) instance type upgrade when the underlying cause is a capacity issue with the cloud provider (e.g., AWS, GCP, Azure). The primary symptom is a failed task in the YBA UI with an error message indicating InsufficientInstanceCapacity.
This is not a YugabyteDB error but rather an issue reported by the cloud provider, which means they do not have enough available instances of the requested type in the specified region or availability zone at that moment.
Symptoms
When you try to change the instance type for a universe using the "Edit Universe" workflow in YBA, the task fails.
- UI State: The universe status may show as
Resizing, and individual nodes that were part of the failed batch will show an inconsistent or failed state. Error Log: The task error log will contain a message similar to the following:
ybops.common.exceptions.YBOpsRuntimeError: Runtime error: Failed to start instance i-0df8b2d48a0d2c7fe: An error occurred (InsufficientInstanceCapacity) when calling the StartInstances operation (reached max retries: 4): Insufficient capacity..
Cause
The InsufficientInstanceCapacity error is returned directly from the cloud provider's API. It signifies that the provider cannot currently provision the requested virtual machine instance type in the selected availability zone. This can happen due to high demand for that specific instance type in that geographical location.
Because the YBA task failed mid-process, some nodes may have been successfully upgraded while others were not, leaving the universe in an inconsistent state that requires manual intervention to resolve.
Solution
When an upgrade task fails due to capacity issues, the universe is left in a partially modified state. The following steps outline the process to recover the universe and complete the upgrade, potentially with an alternative instance type.
Note: Steps involving metadata modification should be performed with guidance from YugabyteDB Support to avoid potential issues.
Step 1: Consult with Your Cloud Provider
- Contact your cloud provider's support team (e.g., AWS Support).
- Confirm the capacity issue for the requested instance type (e.g.,
r6a.4xlarge). - Ask for a recommendation for an alternative instance type that has better availability in your target region (e.g.,
r5a.4xlarge).
Step 2: Reset the Failed Task in YBA (Requires Support)
Since the YBA task is in a failed state, a new operation cannot be started. The failed task must be cleared from the YBA metadata.
- Work with YugabyteDB Support to manually edit the YBA metadata and reset the failed task and inconsistent node states. This will bring the universe back to a stable state in the UI, even though the underlying instances are mixed.
Step 3: Add the New Instance Type to the YBA Provider Configuration
- In the YBA UI, navigate to the cloud provider configuration where your instances are managed.
- Add the new, recommended instance type (e.g.,
r5a.4xlarge) to the list of available instances for that provider and region.
Step 4: Re-trigger the Instance Type Upgrade
- Navigate back to your universe.
- Initiate a new Edit Universe task.
- In the instance type selection, choose the new, available instance type (
r5a.4xlarge). - Run the task and monitor its progress. It should now be able to acquire the necessary instances from the cloud provider and complete successfully.
Preventative Measures
To avoid encountering this issue in the future, especially in production environments:
- Proactive Planning: Before performing an instance type upgrade, consult with your cloud provider about the availability of the target instance types in your chosen regions.
- Avoid High-Demand Instances: If possible, choose instance types that are not subject to frequent capacity constraints. Your cloud provider's technical account manager can often provide guidance on this.
- Staggered Upgrades: For large clusters, consider upgrading nodes in smaller batches if the YBA version and cloud provider setup support it. This can reduce the immediate demand for a large number of instances.
Reference : SUPPORT-636
Comments
0 comments
Please sign in to leave a comment.