API Outage
Incident Report for CareApp
Postmortem

Background

CareApp's API uses an AWS Auto Scaling Group to automatically scales the number of servers (instances) running with the current traffic. When CareApp is busy, more servers are automatically started, and when things quieten down, servers are automatically switched off.

Our infrastructure is configured to automatically spread instances across the two availability zones, ap-southeast-2a and ap-southeast-2b, in order to improve availability. However, our scaling group was configured with a minimum server count of 1 , meaning it was possible for the Auto Scaling Group (ASG) to have only 1 instance running in a single availability zone.

AWS Outage

AWS experienced an incident that affected the ability for application instances to start and run. We saw this outage as the inability to launch app instances in ap-southeast-2b , but ap-southeast-2a did not seem to be affected.

CareApp Outage

At the time of the incident, traffic to CareApp was light, meaning the scaling group had scaled down to a single instance. Unfortunately, that instance was hosted in ap-southeast-2b , and it became unresponsive. Although the instance was failing Load Balancer health checks, it was still passing EC2 health checks, and therefore was not removed from the scaling group.

Immediate Resolution

Terminating the failing instance triggered the ASG to create a new instance, which started in the unaffected AZ and operated as normal.

Remediation

To reduce the likelihood of this problem reoccurring, we have:

  • Increased the minimum instance count for our ASG, so there will always have instances running in multiple availability zones.
  • Changed the configuration on our Auto Scaling Group so that instances are terminated (and therefore new ones started) as soon as an instance fails Load Balancer health checks, as well as EC2 health checks
Posted Dec 10, 2020 - 10:18 ACDT

Resolved
CareApp API unresponsive due to partial outage in AWS Sydney data centre
Posted Oct 22, 2020 - 09:44 ACDT