Asana experienced just under three hours of partial downtime on Monday, January 9, starting around 7:30am PST. The most immediate cause was too much load and not enough webservers. This was caused by a confluence of three factors:
- Our autoscaling process failed to increase our web servers from their weekend level to their Monday level.
- Our usage on Monday was substantially higher than normal, presumably due to people returning to work after New Years.
- When our web servers took on too much load, they fell over ungracefully and started operating much less efficiently than they normally do.
Once we became aware of the issue, we triggered a manual scale up of our web server fleet. We also started throttling traffic, which eventually resulted in Asana being available for all paying customers and 90% of free users. The incident ended once we had scaled up our web servers sufficiently and we stopped throttling incoming traffic.
The lead up
The prelude to the incident started on Friday, January 6, when our web server provisioner hung while trying to add a bad ec2 node to our fleet. Much of our auto provisioning is handled by an auto scaling group, but the final portions of our process involve a cron job that finds newly launched web servers, ensures that they are configured correctly, and tells our load balancers to start sending traffic to them. This cron job takes a lock when it runs to ensure that only one copy is running at a time. This meant that when the job hung while attempting to talk to a new web server, no new copies of the job ran either. We didn’t have appropriate timeouts set for the job, so it was prepared to hang indefinitely. This behavior was noticed by the other copies of the job that timed out while waiting for the lock, but none of these alerts were paging, and they went unnoticed over the weekend.
We normally scale down web servers over the weekend and scale them up on Sunday night in preparation for Monday morning, when our traffic increases again. Our intention is to scale our web servers down to a conservative level, so even if the scale up fails we are able to handle Monday’s load—albeit with degraded performance. Unfortunately, the second Monday in January is an especially high traffic day for us. We had, in fact, over-provisioned our fleet for the first week after New Year’s Day, but this being the second week, we were relying on our automated systems again. The combination of this being an extra high traffic day and our failing to scale up web servers meant that we were unable to handle all of the traffic we received.
When our web servers started receiving too much traffic, they responded by continuing to fork processes in order to handle all of the incoming requests. Eventually they ran into memory limits, and found themselves unable to fork processes anymore. They responded by trying to spin processes up from scratch, but this caused them to run low enough on memory that the OOM killer started killing processes.
This is bad to begin with, but it worsened when one of its victims was the master process. When the master process is dead, the web server will try to spin up a new one; but the web servers were in a state where this would basically never succeed. And with the master process dead, the cost of bringing up new processes was high enough to max out the CPU. So the web servers were briefly memory bound, but once the OOM killer became involved, each web server quickly became CPU bound.
Correcting the problem
Once we understood the problem, we took two corrective actions. The first was to un-stick the autoscaling job, and to monitor it to make sure it made progress as quickly as possible. The second was to throttle traffic. Bringing up new web servers has historically been a very slow procedure for us. However, over the past two months we’ve been working on improving this, and Monday’s event was a premature trial by fire. During the incident we scaled up webservers two separate times. The first time, we manually intervened to reduce the amount of code that needed to be copied to the new web servers. The intervention was a net positive, but we wasted time both before and during the intervention, and as a result, bringing up the new servers took one hour and forty minutes. The second time we brought up new servers took 22 minutes.
Our traffic throttling worked at the load balancer level by blocking a fraction of traffic from free users (Premium users were not throttled). Once we started throttling, our web servers became healthy almost immediately. This allowed our paying customers and most of our free users to get back to work before the incident was completely over.
As we do for all production incidents, we ran a process to determine why it occurred and what we can do to mitigate or eliminate future incidents. We apologize sincerely for the disruption and appreciate your patience as we worked through this incident.