On last week’s downtime

Kris Rasmussen

Last week, Asana went down three times over the course of three days. Here is the story of what happened:

Around two months ago, we noticed that our database (which is hosted on Amazon’s RDS) was growing quickly. To pre-empt any issues, we scheduled time on Saturday, September 8th, to resize it. We expected that this operation would improve Asana’s overall performance and result in little-to-no downtime.

All indications suggested that the resizing had gone off without a hitch, but on Monday, September 10th, as the rush of morning traffic from Europe and the East Coast reached its peak, our Amazon RDS Mysql database started struggling. After slowing the app to an intolerable pace, the database issues eventually brought Asana down.

We began our investigation within minutes of the first sign of slowness. None of our key database metrics seemed abnormal, but we did see an increase in average read and write latency from the database to the underlying EBS storage volumes.

In case some of our recent code updates had increased the load on the database, the first thing we did was revert every change we’d made since the previous Thursday. When the issues persisted, we began to suspect that Saturday’s resize operation had affected the performance of the EBS Volumes. Next, we triggered a manual failover to the secondary database instance. In the past, we’d had performance issues caused by contention from other RDS customers, and failing over to new hardware had helped address them. This time, the impact was minimal.

By 10:30am, Asana was back up after we successfully throttled traffic. But just as we started to let everyone back in, GoDaddy’s DNS went down, a totally unrelated incident that affected us (and millions of other sites on the internet), cutting off anyone who tried to access Asana by navigating to our homepage. Changes we made as we tried to address our own underlying problem triggered another very brief outage around noon.

We spent all of Monday attempting to diagnose the root cause of the morning’s outages and to decrease Asana’s dependence on the database. Despite our success with the latter, Asana slowed to a crawl again the next morning. We decided to roll back some of the infrastructure changes we’d made, just in case they were contributing to the problem.

When we initiated the roll back, the increased load from the deployment was just enough to cause connections to the database to pile up and at 8:16am, Asana went down again. Thanks to the improvements we had made the previous day, this downtime was less severe. We again throttled traffic and about 30 minutes later, Asana was back up.

By Wednesday morning, Asana was close to normal, although we were still seeing spurious performance problems that slowed the app down. The performance issues were severe enough that we decided to deploy additional changes that we expected would reduce database load. Unfortunately, the act of deploying these changes again pushed our database load over its threshold, leading to a 10 minute outage.

By Thursday, our all-hands-on-deck efforts over the last three days were showing results. We were beginning to see massive improvements in the database performance, but we were still seeing periodic spikes in database request times as a result of low level lock contention inside mysql.

So what the heck happened?

Though we cannot be absolutely certain, we believe that the combination of our growing number of users and the loss of the database cache that resulted when we resized the database caused a sudden, sharp increase in lock contention within mysql. The issues were compounded by the fact that Amazon uses proprietary technology to power the file system that RDS runs on, but doesn’t offer documentation about how this technology works during a resize operation. Further, the lack of root access to our database made the cause of the problems more difficult for us to understand.

But every crisis presents an opportunity, and last week’s outages spurred us to accelerate big improvements to both our infrastructure and our production monitoring capabilities.

Here’s what we’ve done so far:

  • Worked aggressively with AWS and an outside database consultant to understand the problems.
  • Added real-time analytics to our reporting infrastructure, enabling us to diagnose performance incidents and identify abnormal activity much more rapidly.
  • Improved our ability to throttle requests, reducing the probability that future performance problems will escalate into full-blown downtime.
  • Fixed a bug in our web servers that was making it more likely that requests would fail when we deployed new code under extreme load.
  • Decreased the number of database writes and improved query performance to reduce internal lock contention within our database.
  • Added hundreds of new counters to our production metrics to make it easier to identify the cause of performance issues in the future.

And here’s what we are still doing:

  • Migrating to a new version of Mysql that we expect to decrease internal lock contention and improve our ability to troubleshoot problems.
  • Exploring additional ways to decrease the frequency of requests between the database and the underlying storage layer.
  • Finalizing plans to partition user data across multiple database instances so that we can add additional capacity as needed in the future.
  • Accelerating plans to move off of GoDaddy to a new DNS provider.

We are intensely aware that consistent stability is critical to everyone that relies on Asana and that delivering great performance will play a key role in the fate of our company. Last week’s outages were some of the most challenging times our team has faced, but the setback – agonizing as it was – has made us stronger. Every part of the team, from engineering and product to marketing, sales and user operations pitched in to bring Asana back to normal and keep our customers in the loop. We’ve made our infrastructure more resilient and our monitoring capabilities more robust. We’ve deepened our understanding of Amazon’s RDS and built a stronger relationship with the AWS team.

We know last week’s downtime hurt your team, and we thank you for bearing with us. Please let us know if you have any questions, feedback or concerns.

We are always listening.

  1. avatarMelissa Pak

    Thanks for this update. I really appreciate the transparency and explanation.

    I love Asana and have become so dependent on it that those 3 days were hell! But glad to see everything up and running again.

    Thank you!

  2. avatarScott Phoenix

    Another suggestion: schedule db resize operations so that you have time to spend a few hours repopulating the cache afterwards. It seems like asking for trouble to start Monday morning with an empty cache.

    1. avatarDustin Moskovitz Asana Team Member

      Scott, we did the resize saturday day, so that was just about the maximum time possible to put between the event and traffic coming back on Monday.

  3. avatarJon

    Great write-up, thanks for taking the time to putting it together and share it with the world.

    Would you be willing to go into the tech-details a little bit more in depth?

    You mention being much more comfortable with RDS after this, what specifically do you know now that you didn’t know before that makes you so much more comfortable?

    Also, what manner of metrics and instrumentation did you implement?

    If you can’t really divulge any of the information, that’s totally understandable.

    Thanks again,
    jon

  4. avatarBrian

    Team: I am a huge Asana advocate and will continue to be. The transparency explaining what happened is admirable, yet I still find it curious that the word “sorry” is not included anywhere in this post.

    1. avatarKenny Van Zant Asana Team Member

      We are, indeed, sorry for the disruption. We’ve made that statement many times over the past two weeks, so it was an oversight not to include it here, as this was meant to be a technical post-mortem/discussion.

  5. avatarRob

    Thanks so much for the transparency guys – really helps give us faith that “everything’s gonna be alright”.

    Well handled and good luck with continued learning and recovery.

    Asana is an amazing product and I am really looking forward to seeing where you take it.

    Agree with @Rob #1 – @Brian – get over yourself. This is tech. Tech fails at times. The Asana team apologised numerous times during the outages and were quick to respond on Twitter to keep people posted.

    What’s more, I bet you are not even on a paid plan – you get something for free and somehow think people immediately owe you something..

  6. avatarDave Mackey

    Appreciate the detailed explanation. Must admit I am a little shocked to hear that Asana uses a single central database…
    Also, one area that could significantly reduce load is if clients are made (e.g. windows, linux, android) that use cached versions of tasks and synchronize occasionally. This could reduce hits to the database by what – maybe 50-80%?

  7. avatarNick Stromwall

    Thanks for the update.

    Does this downtime accelerate or add to any plans for having off-line capabilities for asana?

    Thanks!

  8. avatarAlan Shurafa

    Really appreciate the transparency. Asana is such a mission critical app that any downtime is very serious.

    Multiple databases is imperative. Redundancy needs to be built into every level of the operation.

    I would like to once again suggest adding the ability for users to export and backup all Asana data so that they can be referenced during unplanned outages.

    Keep up the good work!

Leave a comment