Issue in accessing web and mobile apps in ap-south-1 region

Incident Report for Zuper

Postmortem

Root Cause Analysis

The incident was triggered by a sudden spike in CPU utilization on the master instances, which led to them becoming unhealthy. This resulted in service unavailability in the AP-SOUTH-1 region.

Impact:

Users in the AP-SOUTH-1 region were unable to access both the web and mobile applications.

Resolution

  • The server configuration was upgraded to handle higher CPU loads.
  • The number of replicas was increased to distribute traffic more effectively and improve overall system resilience.
  • As a result, the CPU utilization returned to normal levels, and service functionality was fully restored.

Remediation Items

  • Instance Scaling: Increase baseline configuration for master instances to better handle traffic spikes.
  • Auto-scaling Review: Fine-tune auto-scaling thresholds and triggers to respond faster to resource bottlenecks.
  • Monitoring Enhancements: Implement more granular CPU monitoring and alerting to detect early signs of saturation.
Posted Apr 15, 2025 - 06:03 UTC

Resolved

April 14, 2025 6:55 PM IST | 01:25 PM UTC

We are pleased to inform you that we have implemented a fix for existing logged-in users as well. No re-login is required, and all users should now experience normal functionality.

-------------------------------

April 14, 2025 6:05 PM IST | 12:35 PM UTC

We have resolved the issue affecting new user logins. Existing users can restore functionality by logging out and logging back in. We are actively working on a fix to restore full functionality for users without requiring a re-login

-------------------------------

April 14, 2025 5:59 PM IST | 12:29 PM UTC

We have identified the potential cause of the issue impacting on AP-SOUTH-1 region and deploying a fix.

-------------------------------

April 14, 2025 5:44 PM IST | 12:14 PM UTC

We are currently experiencing an issue on accessing web and mobile apps on AP-SOUTH-1 region. Our team is actively working to resolve these problems as quickly as possible. Thank you for your patience.
Posted Apr 14, 2025 - 12:14 UTC