Asana had downtime twice in the past week, 2 hours on July 29 and 50 minutes on August 1. We are continuing investigation, but want to let you know what we understand so far, and what changes we have made and are making to prevent a recurrence in the short-term.
Both of these incidents were caused by high database contention. This contention occurred on a database that is shared across all teams and workspaces (e.g. user authentication methods, like OAuth applications). We’ve been working on breaking up our monolithic application into services, and one of our goals with that project is to replace this database with a more scalable distributed data store. That work is ongoing, and currently we do writes to this database for objects that are shared across teams and workspaces.
In the short term, we’ve reduced the write frequency for a property that appears to have triggered the start of both events (the OAuth application last accessed time, which was by far the most frequently updated property). We’ve also fixed some bugs and modified database configuration to reduce deadlocks/contention, and updated our emergency response procedures to improve how we recover from such outages in the future.
We are committed to Asana being a service that you can consistently rely on, and this week did not meet our standards. We’ll provide a follow-up post with more information about what happened as soon as the investigation is complete, along with what we plan on doing to address this class of problem in the future (both near- and long-term).