by Kris Rasmussen
January 29th, 2013
Update 2/1/2013 10:40am
We’ve been hard at work investigating the cause of our recent stability issues, and wanted to share what we’ve found. We determined that the root cause of the downtime was a problem with our memcached servers, which caused sudden and extreme spikes in latency. These spikes were at their worst during peak traffic, and caused performance issues with many other services that Asana depends on.
By Wednesday this week, we traced our way to the underlying problem with the memcached servers, and made needed changes – as well as more than a dozen additional improvements to our infrastructure to mitigate the impact of future issues like this. So we’re confident, but we won’t know with certainty that we’re in the clear until our next record traffic period (almost certainly this coming Monday morning).
We’re disappointed that this issue was so hard to identify for us. To improve on that, we’ve also taken time to update our processes and tools, starting with changes that allow us to analyze our historical performance metrics with higher resolution and more granularity. These improvements have already paid dividends during this event. We also launched a new status page, so users could see a real-time update on service availability. Being big fans of transparency, we’ll continue to add richer data to this page as we go.
Thanks again for your patience during this phase. While we are enthusiastic about our surging user growth, we have an axiom that governs our priorities at all times: stability before speed before features. So we’ll keep giving this our full attention, and work to do better as we continue this journey.
—— Update 1/29/2013 2:30pm ——
Over the past week, and especially today, we’ve had a few periods where Asana was unavailable. Outages are not something we tolerate, and we are currently working hard to identify the root causes and get them fixed. The longest of these instances was a little more than 20 minutes long, while most were less than 5 minutes. Changes we’ve made to our infrastructure and code over the last several months have reduced the time we need to get back to 100% after an outage.
We apologize for the interruption of your work, and promise to be completely transparent about what we ultimately identify as the cause, how we fixed it, and what we’re doing to make our infrastructure and processes more robust to issues of this kind. While we do that, thanks for your patience.
- The Asana Team