Asana was unavailable for multiple hours on Monday, April 9, 2018. The web app was partially unavailable between 6:08am and 6:53am PDT, and fully unavailable between 6:53am and 7:53am. The API was partially unavailable between 6:08am and 7:42am, and fully unavailable between 7:42am and 9:15am. Both of these outages were reported on trust.asana.com, although our reporting of the API outage was incomplete.
Below is a summary of what happened and some of the measures we’re taking to prevent and mitigate incidents like this in the future.
A timeline of what happened
Requests on our webservers are handled by a large number of processes, each of which can serve multiple requests sequentially. Spinning up these processes from scratch is expensive, so by default we fork processes rather than spinning them up from scratch. In particular, we aim to create a single “zygote” process, and to create other processes by forking this zygote.
On the afternoon of Friday, April 6th, we deployed code to production that broke our ability to bring up new zygotes on webservers. This didn’t trigger alarms, and we have fallback behavior where webservers will simply bring up new processes from scratch whenever needed. The effect of this was that our webservers had higher CPU load than normal. The increased load went unnoticed over the weekend.
On Monday morning, around 6am, our traffic increased to the point that the increased CPU load on webservers became important. Webservers started being overloaded, and over the course of 45 minutes, Asana went from being impaired to being almost entirely unavailable.
An oncall engineer was first paged at 6:08am. They quickly escalated the issue to get a number of engineers involved. At 6:32am an engineer identified the increased forking failure rate as a likely cause of the problem, and at 6:54am the team rolled back the code to a previous release.
Once the code was reverted, webservers started recovering, and load on the main database started rapidly increasing. At 7:09am, the database ran too low on memory due to a large number of mostly-inactive connections and started swapping. The database’s throughput dropped, and it was unable to keep up with requests. In order to un-stick the database, engineers triggered a manual failover to a backup at 7:30am. Once the failover completed, the database had plenty of memory and became CPU bound instead. Clients of the database had inconsistent responses to the database being overloaded, and in particular, webservers backed off more aggressively than API servers, resulting in webservers being frozen out entirely. In response, we shut off the API servers. This took enough load off the database for it to recover, and by 7:53am the web app had recovered.
We then began restoring the API. This took substantially longer than expected, and we have not completed our investigation into why this is the case.
Measures we’re taking
We’ve already improved our alerting for problems like this, which would have allowed us to fix the problem long before it impacted Asana users. We’ve also increased the size of our main database, which gives us more headroom.
In the short term, we’re also planning on updating our internal documentation to provide clearer instructions for how to respond to overload—since certain actions took us longer than they should have—and fixing some tools that behaved poorly when the database was overloaded. Finally, we’re actively looking into why the reporting for API downtime was incomplete on trust.asana.com.
In the medium term, we plan to improve the back-off behavior for all of our database clients. This should make it easier for the system to recover on its own from an overloaded database. It should also give us better and more appropriate tools for shutting down problematic clients. For example, we should be able to stop the API from overloading the database without shutting it off entirely.
Finally, in the long term, we plan to replace most of the functionality of the main database. The main database currently serves as a monolith that performs multiple unrelated functions. We plan to split this up into multiple services to reduce the blast radius when one of these services is overloaded. This will also allow strategies such as read replicas when appropriate.