Last week, Asana went down three times over the course of three days. Here is the story of what happened:
Around two months ago, we noticed that our database (which is hosted on Amazon’s RDS) was growing quickly. To pre-empt any issues, we scheduled time on Saturday, September 8th, to resize it. We expected that this operation would improve Asana’s overall performance and result in little-to-no downtime.
All indications suggested that the resizing had gone off without a hitch, but on Monday, September 10th, as the rush of morning traffic from Europe and the East Coast reached its peak, our Amazon RDS Mysql database started struggling. After slowing the app to an intolerable pace, the database issues eventually brought Asana down.
We began our investigation within minutes of the first sign of slowness. None of our key database metrics seemed abnormal, but we did see an increase in average read and write latency from the database to the underlying EBS storage volumes.
In case some of our recent code updates had increased the load on the database, the first thing we did was revert every change we’d made since the previous Thursday. When the issues persisted, we began to suspect that Saturday’s resize operation had affected the performance of the EBS Volumes. Next, we triggered a manual failover to the secondary database instance. In the past, we’d had performance issues caused by contention from other RDS customers, and failing over to new hardware had helped address them. This time, the impact was minimal.
By 10:30am, Asana was back up after we successfully throttled traffic. But just as we started to let everyone back in, GoDaddy’s DNS went down, a totally unrelated incident that affected us (and millions of other sites on the internet), cutting off anyone who tried to access Asana by navigating to our homepage. Changes we made as we tried to address our own underlying problem triggered another very brief outage around noon.
We spent all of Monday attempting to diagnose the root cause of the morning’s outages and to decrease Asana’s dependence on the database. Despite our success with the latter, Asana slowed to a crawl again the next morning. We decided to roll back some of the infrastructure changes we’d made, just in case they were contributing to the problem.
When we initiated the roll back, the increased load from the deployment was just enough to cause connections to the database to pile up and at 8:16am, Asana went down again. Thanks to the improvements we had made the previous day, this downtime was less severe. We again throttled traffic and about 30 minutes later, Asana was back up.
By Wednesday morning, Asana was close to normal, although we were still seeing spurious performance problems that slowed the app down. The performance issues were severe enough that we decided to deploy additional changes that we expected would reduce database load. Unfortunately, the act of deploying these changes again pushed our database load over its threshold, leading to a 10 minute outage.
By Thursday, our all-hands-on-deck efforts over the last three days were showing results. We were beginning to see massive improvements in the database performance, but we were still seeing periodic spikes in database request times as a result of low level lock contention inside mysql.
So what the heck happened?
Though we cannot be absolutely certain, we believe that the combination of our growing number of users and the loss of the database cache that resulted when we resized the database caused a sudden, sharp increase in lock contention within mysql. The issues were compounded by the fact that Amazon uses proprietary technology to power the file system that RDS runs on, but doesn’t offer documentation about how this technology works during a resize operation. Further, the lack of root access to our database made the cause of the problems more difficult for us to understand.
But every crisis presents an opportunity, and last week’s outages spurred us to accelerate big improvements to both our infrastructure and our production monitoring capabilities.
Here’s what we’ve done so far:
- Worked aggressively with AWS and an outside database consultant to understand the problems.
- Added real-time analytics to our reporting infrastructure, enabling us to diagnose performance incidents and identify abnormal activity much more rapidly.
- Improved our ability to throttle requests, reducing the probability that future performance problems will escalate into full-blown downtime.
- Fixed a bug in our web servers that was making it more likely that requests would fail when we deployed new code under extreme load.
- Decreased the number of database writes and improved query performance to reduce internal lock contention within our database.
- Added hundreds of new counters to our production metrics to make it easier to identify the cause of performance issues in the future.
And here’s what we are still doing:
- Migrating to a new version of Mysql that we expect to decrease internal lock contention and improve our ability to troubleshoot problems.
- Exploring additional ways to decrease the frequency of requests between the database and the underlying storage layer.
- Finalizing plans to partition user data across multiple database instances so that we can add additional capacity as needed in the future.
- Accelerating plans to move off of GoDaddy to a new DNS provider.
We are intensely aware that consistent stability is critical to everyone that relies on Asana and that delivering great performance will play a key role in the fate of our company. Last week’s outages were some of the most challenging times our team has faced, but the setback – agonizing as it was – has made us stronger. Every part of the team, from engineering and product to marketing, sales and user operations pitched in to bring Asana back to normal and keep our customers in the loop. We’ve made our infrastructure more resilient and our monitoring capabilities more robust. We’ve deepened our understanding of Amazon’s RDS and built a stronger relationship with the AWS team.
We know last week’s downtime hurt your team, and we thank you for bearing with us. Please let us know if you have any questions, feedback or concerns.
We are always listening.