CareApp's API uses an AWS Auto Scaling Group to automatically scales the number of servers (instances) running with the current traffic. When CareApp is busy, more servers are automatically started, and when things quieten down, servers are automatically switched off.
Our infrastructure is configured to automatically spread instances across the two availability zones, ap-southeast-2a
and ap-southeast-2b
, in order to improve availability. However, our scaling group was configured with a minimum server count of 1
, meaning it was possible for the Auto Scaling Group (ASG) to have only 1 instance running in a single availability zone.
AWS experienced an incident that affected the ability for application instances to start and run. We saw this outage as the inability to launch app instances in ap-southeast-2b
, but ap-southeast-2a
did not seem to be affected.
At the time of the incident, traffic to CareApp was light, meaning the scaling group had scaled down to a single instance. Unfortunately, that instance was hosted in ap-southeast-2b
, and it became unresponsive. Although the instance was failing Load Balancer health checks, it was still passing EC2 health checks, and therefore was not removed from the scaling group.
Terminating the failing instance triggered the ASG to create a new instance, which started in the unaffected AZ and operated as normal.
To reduce the likelihood of this problem reoccurring, we have: