Root Cause Analysis
The incident was triggered by a sudden spike in CPU utilization on the master instances, which led to them becoming unhealthy. This resulted in service unavailability in the AP-SOUTH-1 region.
Impact:
Users in the AP-SOUTH-1 region were unable to access both the web and mobile applications.
Resolution
- The server configuration was upgraded to handle higher CPU loads.
- The number of replicas was increased to distribute traffic more effectively and improve overall system resilience.
- As a result, the CPU utilization returned to normal levels, and service functionality was fully restored.
Remediation Items
- Instance Scaling: Increase baseline configuration for master instances to better handle traffic spikes.
- Auto-scaling Review: Fine-tune auto-scaling thresholds and triggers to respond faster to resource bottlenecks.
- Monitoring Enhancements: Implement more granular CPU monitoring and alerting to detect early signs of saturation.