by Kris Rasmussen
June 5th, 2012
Many of you noticed that the Asana service was occasionally unavailable for brief periods of time, lasting less than one minute, on Thursday and Friday last week. We apologize for the inconvenience these connectivity issues caused, and want to let you know what we are doing to prevent similar issues from occurring again in the future.
Asana’s infrastructure runs almost entirely on top of Amazon Web Services (AWS). AWS provides us with the ability to launch managed production infrastructure in minutes with simple API calls. We use AWS for servers, databases, monitoring, and more. In general, we’ve been very happy with AWS.
A month ago, we decided to use Amazon’s Elastic Load Balancer service to balance traffic between our own software load balancers. We did this for two reasons:
Scale: To evenly distribute requests between our own load balancers.
Reliability: To ensure that a single server failure does not result in a percentage of requests failing until we fix the server.
The Elastic Load Balancer was accomplishing both of these goals for nearly a month when it suddenly stopped forwarding all HTTP requests to our servers. The issue lasted less than a minute, but it was long enough to trip our monitoring system and briefly disrupt the workflow of users who have come to rely on Asana to get work done.
The first time this occurred, we assumed it was a random hiccup that was unlikely to happen again. When it occurred a second time, we got in touch with Amazon for assistance. Amazon thought they resolved the underlying problem, but it re-occurred twice more later that night and early in the morning on the following day. At that point, we decided to replace the Elastic Load Balancer with DNS Round Robin. Since doing so, the problems have gone away completely.
DNS Round Robin isn’t without its own set of issues, but none of them should impact your ability to access our service. We hope to use the Elastic Load Balancer again in the future, but not until Amazon provides us with enough information to properly diagnose the problem and address it.