From 20 to 2 million releases a year, part 1
Posted on by Alistair HannAs Skyscanner scaled from an engineering team of 30 with one website and three services to a team of 100 engineers, release frequency halved. This is the story of the turnaround as the company went on to grow to 400 engineers, with over 100 services, releasing at thousands of times the previous rate. This series of three blog posts will share that story of releasing thousands of times more frequently, and the implications of how our goal of 10,000 releases per day informs our tooling, processes, and how we think about writing software.
Part I – From 20 to 10 (aka ‘what went horribly wrong’)
I joined Skyscanner at the end of 2010 after it acquired the start-up I founded, Zoombu. Zoombu had a team of five and we were deploying our product at the end of every week-long sprint. Skyscanner had an engineering team of thirty and the product was going out every two weeks, like clockwork. That seemed reasonable to me – there was a lot more surface area, and it was far better than a household name Online Travel Agent that I knew was shipping every six months.
These blog posts are the story of what happened in the subsequent six years. To help put some context around that, the following graph is the number of UMVs Skyscanner has seen since December 2004 to February 2016. One interesting thing to note are the jumps in traffic after each December – in the travel industry things get quiet towards the end of the year, but as soon as Christmas and other festivities are over, people start planning and booking their travel and that is when we experience the biggest jump in load on our services (this isn’t the case in all countries, but it does characterise much of Europe and the Americas). Something else to note is that as a team we were growing at a similar rate, and that included the engineering team.
The second graph I want to share shows how many releases we were doing per month for the two years after October 2010 – you can see that there was a notable decline, with releases dropping to monthly and rather less than that. In practice, the graph doesn’t tell the true horror of the situation – we entered a phase of alternating releases between low risk releases and high risk releases. The idea was that lots of low risk changes could be bundled into one release, and riskier major feature releases would be scheduled into the other. Thus in reality, a minor change could take eight to twelve weeks to go out, if it emerged at the wrong time in the cycle.
It’s also worth calling out the big dips around December 2010, December 2011, and December 2012. Those were the ‘feature freezes’ we would start weeks before Christmas – the anticipated traffic surge meant we were fearful to release changes until we knew what the product could handle, and we also wanted a quiet period for supporting the site while people were enjoying their Christmas break.
So, why was our release frequency tanking? The clue is in the third graph – the number of days each release took. It sounds obvious to say the release frequency dropped because it took longer to release – the key thing to note is that as soon as we finished one release we would start the next one. We couldn’t get any faster, because it was taking so long to get from the release train being ‘closed’ to further changes, to the new release being deployed and validated in production. At which point ‘int’ would reopen and the whole process would start again. To quantify the situation, at peak pain:
- 59 Changes Per Release
- 9 Bugs found in Pre-Production Regression
- 21 Days end-to-end
- 125 Days of human effort
There was a lot of manual regression testing, if a bug was found during the regression, a patch had to be made and the regression test repeated. That regression was painful, people didn’t want to do it, only a small number of people knew how to fully test each component, and it was doubly painful if it had to be repeated.
We were operating canary releases – a new release would be deployed to a subset of all production traffic called ‘Staging’, and it would only be rolled out to serve all production traffic if the KPIs were healthy (in the email snippet below, there were concerns about the metrics of a release that had been on Staging for five days). The reason it would take so long to validate the new version was performing correctly was because of latency in metric collection, the volume of change, a behaviour where new deployments needed to ‘warm up’, and variability in the KPIs between production nodes meaning there wasn’t a clear benchmark of what was a healthy value of the KPIs.
An invitation for an emergency meeting in February 2012. A KPI (‘G1’) was down against staging, and people across the company needed to decide what to do.
Despite the bugs found in regression, nearly every second release required a re-release because we found an issue on production which had been missed in regression and was only identified after the successful rollout across all nodes. If the issue couldn’t wait until the next release, the answer was to deploy a ‘hot fix’ – this risky and painful procedure meant carrying out an accelerated version of the release process, only doing regression on a subset of the product. It wasn’t a decision taken lightly because of the risks, pain, and delaying the next scheduled release (I hid the hot fixes from the release frequency graph above, as they confuse the trend).
An early warning sign may have been that we had names for each of our releases. If a website release is significant enough to have a name, there’s probably something wrong. A bigger warning still, may have been that at one stage we were naming our releases after Muppets. Acknowledgement: “Cupcake Brilliance” by Bret Jorhan is licensed under CC BY 2.0.
A model I use to explain how we ended up in the painful situation is that there was a negative cycle going on – problems were feeding each other, thus a slow degradation rapidly got a lot worse as the cycle gained momentum. I have drawn a simple model of this below:
With a negative spiral feeding itself, we were releasing code more and more slowly, even though the size of our engineering team was rapidly increasing. We reached a point where we had to stop and completely change how we organized ourselves and how we deploy software. In Part II, I discuss how we changed our organization, services, and release processes in order to get a 100x improvement in release frequency.
Continue reading Part II of the series, “From 10 to 1000 releases per year – Microservices and Continuous Delivery“.
Related articles:
- Continuous Integration: where we are now by Stuart Davidson
- The present and future of app release by Tamas Chrenoczy-Nagy
-
Ryan Crawford