Details on the January 22 outage

Asana had an outage on January 22, 2019. This blog post is a short postmortem of that outage.

On Monday January 21, we had multiple brief events where a database that stores metadata shared across our application became overloaded and unresponsive. We are in the process of moving to a microservices architecture, and some traffic was being sent through the new system. We built in retries for this system, and could see the retry code was running repeatedly. We initially believed that the problem was exacerbated by these retries, as we saw high concurrency for code in a critical section, so we disabled the new architecture. Around this time, traffic was coming down off of its peak, and the site appeared stable. We did additional follow-up work to investigate the events, looking at both data and code. We couldn’t reproduce the behavior when testing the code in question with what we believed was reasonable concurrency.

On Tuesday January 22, we had a repeat of these issues, with greater total downtime. When the database became unresponsive, we attempted to fail it over to backup. This process left both unavailable for an extended period. We took steps to shed load, which allowed partial recovery[1], but as soon as we allowed full traffic to return, we saw the failure repeat. Our web app was fully unavailable from 11:46-13:30, partially unavailable from 13:30-15:16, fully unavailable from 15:16-16:22, and partially unavailable until 17:16 (times in UTC).

We determined that the cause was an expensive synchronized operation within a critical section in the database (a transaction within a stored procedure). We made application changes which moved some of the shared state onto a separate database instance, sharing the load across two instances.The site recovered after taking this action, and we fully undrained.

We spent the rest of Tuesday doing follow-up and this time we were able to reproduce the problematic behavior. The critical section became slightly more expensive (especially under high concurrency) when IDs went from 15 digits to 16 digits, because the stored procedure was doing arithmetic using text. To illustrate:

mysql> select "999999999999000" + 999;

| "999999999999000" + 999 |
|         999999999999999 |
1 row in set (0.00 sec)

mysql> select "999999999999000" + 1000;
| "999999999999000" + 1000 |
|                     1e15 |
1 row in set (0.00 sec)

This code was implemented many years ago, re-using an existing field (which was text) as a counter. We fixed the stored procedure by changing it to treat the value as a bigint. Without this bug, the outage would have been brief.

We also found that application code intended to prevent contention via a distributed lock was not working, and fixed the use of the lock. This problem was caused by an identity check in a duck-typed language (js): the code passed 1 into a library, and tested that the result that returned was identical (===) to 1. The problem is that the library converted the value 1 to the string “1”, and returned the latter; the identity check failed, and the code determined that the lock was not already owned. Without this bug, the concurrency would not have been high enough to cause downtime.

We’ve also added monitoring and additional configuration for emergency responses, and are working on additional follow-up.

Looking back, we see that there were similar transient issues with this database. We looked into these and had action items for further investigation, but noted that the events resolved quickly. And looking at the data, there often appear to be multiple (potential) causes. There are two lessons to take from this:

First, investigating transient issues is important, as they may indicate underlying problems. When there are transient problems in a system with a negative feedback loop in terms of load[2], this likely indicates that the system is near the point of failure, and random fluctuations in load caused it to exceed the threshold. In that situation, it is important to understand what about the load caused the transient problem. If we had not experienced an inflection point in performance due to the floating point conversion, we might have gotten lucky; but hoping for luck is not a strategy.

Second, when doing testing, it’s important to test to failure. If we had done this on Monday (rather than only testing what we believed to be reasonable load), we would have identified the problem earlier. It’s important to determine where failure occurs, and what it looks like, in order to understand the behavior of a system. Similarly, while code inspection might not have detected the string conversion given how the stored procedure was written, if we had tested extreme inputs we would have observed the change in behavior.

In the long-term, we’re working on systemic changes to ensure that Asana can maintain high availability as we scale. For example, we’re working to eliminate single points of failure like this database, and we’re introducing more clearly separated failure domains so that outages in general will affect only a small portion of users. Together, these will reduce overall downtime as well as the downtime experienced by any team.

Enabling teams to achieve effortless collaboration is our top priority and we deeply appreciate your patience and support as we continue to refine and improve Asana for our customers around the world.

[1]  When shedding load under such circumstances, it’s our policy to prioritize web traffic from paying customers. As a result, the web app became usable for paying customers about 2 hours before we had fully recovered.
[2] A system with a negative feedback loop responds to overload in a way that doesn’t make the problem worse, e.g. by reducing work done or delaying it until the overload condition passes. In a positive feedback loop, overload causes additional work and the system fails quickly once it passes the threshold.

Would you recommend this article? Yes / No