By now, it’s a given that virtually every company that delivers a service over the web should have a robust automated testing system. We have a recipe that’s served us well over the last few years and we thought we’d share it.
First, we have these goals for how our automated testing system operates:
- Prevent developers from being impacted by other developers’ broken code.
- Give developers as fast a response as possible on whether a commit is good or bad.
- If a commit is bad, provide enough information for the developer to efficiently find and fix problem.
And here is our strategy for achieving those goals:
- Defend the master branch to keep it in a good state.
- Scale the test cluster to keep it running fast.
- Provide a good UI to help diagnose test failures.
- Identify spurious failures and deal with them intelligently.
- Set up a process to monitor the infrastructure and act quickly against issues.
If you’d like more background, you can read up on Quora about how our build pipeline works.
Defend The Master
At many companies, teams work on a shared branch and submit code to it. A continuous process runs tests over the branch and reports its status. Thus, a bad commit will “break” the branch so it no longer passes tests. This generally prevents release—or further downstream merging—of the branch. It also impacts developers who have synced the bad commit, as their local version is now broken as well.
Our commit workflow is designed to ensure that the shared development branch is never broken. Rather than merging commits and testing later, each developer has their own branch that is individually tested. We test their branch and only merge it into the master if the tests pass.
With this system, developers are rarely impacted by breakage from other developers, and the master branch is always in a “good” state that can be merged downstream or released at any time. If a developer commits bad code, it’s their own branch that’s broken and they are the most incentivized to fix it—by default, no one else will even notice.
Handling The Load
Our test suite boasts thousands of tests, including 1,500 tests of the UI itself. Each one must initialize the world state, set up the portion of the app to be tested, and make multiple requests from client to server to execute the test. This takes real time—usually several seconds per test. Without substantial investment in computing resources, test runs would be quite slow.
This means our test cluster is large. In fact, during the workday we operate 6 clusters of 60 test slaves each to run tests in parallel—not chump change.
Is that investment worth it? Definitely! The productivity benefits of having a speedy automated test system are tangible:
- We make smaller changes before putting them through tests, facilitating smaller batches (which are good for a lot of reasons)
- Less latency in getting a fix or improvement out to production
- Less latency in getting a working change into the master branch for other engineers to use or build on top of
- Less effort to optimize around and compensate for a slow test system.
- We can run automated tests of local changes to a branch even if they’re not ready to commit yet
So, we have an incentive to find a way to run our tests quickly.
Speeding It Up
We’re using Jenkins to manage the queueing of builds and distribution to different test clusters, but then we have custom scripts for those clusters to configure the slaves within that cluster and launch the tests themselves. We call this collection of clusters “Testville.”
If a testing job fails, we immediately retry it on a different test slave. In any distributed job system, when jobs hang or fail it is good practice to retry them on a different host to rule out machine-specific factors in the failure (an idea inspired by MapReduce).
A full test run on our system hits 60 test slaves and takes about 30 minutes. That’s been an acceptable turnaround, assuming the time the tests wait in the queue is short. But we’re not satisfied with this and are working to make it much faster.
One way to make things faster is by not running all the tests on every commit. We recently introduced a “fast-tracking” system that examines the commits being tested and, if possible, only runs the subset of tests that should be affected.
For example, changes submitted by designers to CSS or image files have virtually zero chance of impacting our current set of tests (though this may change with the impending introduction of “screenshot tests” to catch visual regressions!) So we don’t waste cycles running them, and those commits go through very quickly.
Similarly, changes to Python code typically don’t affect the application, so we don’t need to run tests of the UI. If we end up being too lenient, there’s a merged test phase that happens that always runs all tests.
Dealing With Failure
Tests fail all the time. It’s important that engineers are empowered to quickly identify the cause of a test failure and debug it. To facilitate that workflow, we’ve built a lightweight test results app using Sinatra, DataMapper, HAML, and Bootstrap. It stores metadata about all of our test runs and allows us to view them in a nice, clean UI—we call it “Testopia.”
Testopia offers the following key features:
- Failure information for as many failures as we were able to detect, so that engineers can see the forest and not play whack-a-mole with one problem at a time.
- The command-line an engineer can run to locally reproduce each test run that failed.
- Logs for each test run as well as any processes it depended on (for example, browser console log, selenium / webdriver, or a search backend).
- A screenshot taken at the time of each failure.
- A link to report each real failure as “spurious” i.e. the fault of something exogenous to the test (for example, poor test isolation). This marks the test as temporarily disabled so it does not impact other engineers, and creates a task in Asana for someone to investigate it promptly. Spurious failures can mean big problems in any testing system, and we’ll discuss more about our novel ways of mitigating them in a later post.
Monitoring and Alerting
Testville and Testopia are critical pieces of our build infrastructure, but they don’t work perfectly. When something goes wrong, we have monitoring that alerts us so we can diagnose and attend the problem quickly. Here are some things we have found useful to alert on:
Test queue length growing too large
This is a leading indicator that developers are going to start experiencing long latencies in test verdicts. Sometimes something in the cluster is going wrong and builds are backing up, and then it’s good to have a threshold that won’t fire just because the traffic is high, but will in case it gets egregiously long. For us, a too-long queue looks like something that puts the latest builds at about 3 hours latency. That would mean turnaround times are significantly degraded and code is not moving quickly enough through the system.
Sometimes a cluster will stop running. For a while, a common cause of this was that the head of the node would run out of disk space so Jenkins would silently auto-disable it! Without an alert, we would just eventually notice that the queue was getting long and then have to catch up. This alert lets us know right away when for any reason the cluster is not operating at capacity.
Testville or Testopia web server inaccessible
Obviously, we want the online testing tools to be available at all times. We have enough engineers in the organization that if they go down, we’ll hear a timely complaint. However, we’d rather be alerted in a consistent manner and preferably in advance of a teammate experiencing this inconvenience.
We have an Asana project to track both alerts and other testing issues. If an engineer has to respond to an issue, they add it to the project. Alerts automatically get added to the project, so nothing slips through the cracks. The team in charge of the test infrastructure follow the project so they get immediate inbox notifications of new issues, and they also periodically review the project to ensure things are getting addressed.
Loving the Earth
Last, but certainly not least, we run all of our test machines in a zero-carbon datacenter. We believe our company’s impact on the world extends beyond just the service we offer via the web. For one thing, our computing activities produce carbon.
We’d like to limit how much, and Amazon makes the tradeoff a no-brainer since they offer carbon-free computing at the same cost as the alternatives! For anyone running a compute cluster where the specific geolocation isn’t critical for latency, this option is well worth considering.
The Future of Testville
Since we built Testville, numerous advances have been made in the state-of-the-art of automated testing. We actually think it’s possible for our tests to run in an order of magnitude less time, which would be a massive improvement for our developer productivity.
First, our code base is large enough that we can break it into separate components and services (our web app, the iOS app, our data processing pipeline, etc.), each with their own build rules and automated tests. This would be a leap forward from the “fast tracking” described above, and would allow each build to truly run just the tests that concern it.
Interoperating technologies like GitHub and Travis CI are compelling alternatives that offer scalable operations, and we’re figuring out how to migrate towards them without losing some of the great benefits we enjoy from our own system. A key part of that migration is being able to test our UI without bringing in as many run-time dependencies, which will drastically cut execution times. Our UI testing is a topic we’ll have to cover in a future post—so stay tuned!
Do you have a sizable web application served by automated tests? What kind of setup do you use, and how is that working for you? Let us know in the comments!