Issue in accessing web and mobile apps in ap-south-1 region

Incident Report for Zuper

Postmortem

Root Cause Analysis

The incident was triggered by a sudden spike in CPU utilization on the master instances, which led to them becoming unhealthy. This resulted in service unavailability in the AP-SOUTH-1 region.

Impact:

Users in the AP-SOUTH-1 region were unable to access both the web and mobile applications.

Resolution

The server configuration was upgraded to handle higher CPU loads.
The number of replicas was increased to distribute traffic more effectively and improve overall system resilience.
As a result, the CPU utilization returned to normal levels, and service functionality was fully restored.

Remediation Items

Instance Scaling: Increase baseline configuration for master instances to better handle traffic spikes.
Auto-scaling Review: Fine-tune auto-scaling thresholds and triggers to respond faster to resource bottlenecks.
Monitoring Enhancements: Implement more granular CPU monitoring and alerting to detect early signs of saturation.

Posted Apr 15, 2025 - 06:03 UTC

Resolved

April 14, 2025 6:55 PM IST | 01:25 PM UTC

We are pleased to inform you that we have implemented a fix for existing logged-in users as well. No re-login is required, and all users should now experience normal functionality.

-------------------------------

April 14, 2025 6:05 PM IST | 12:35 PM UTC

We have resolved the issue affecting new user logins. Existing users can restore functionality by logging out and logging back in. We are actively working on a fix to restore full functionality for users without requiring a re-login

-------------------------------

April 14, 2025 5:59 PM IST | 12:29 PM UTC

We have identified the potential cause of the issue impacting on AP-SOUTH-1 region and deploying a fix.

-------------------------------

April 14, 2025 5:44 PM IST | 12:14 PM UTC

We are currently experiencing an issue on accessing web and mobile apps on AP-SOUTH-1 region. Our team is actively working to resolve these problems as quickly as possible. Thank you for your patience.

Posted Apr 14, 2025 - 12:14 UTC