From 20 to 2 million releases a year, part 3Posted on by Alistair Hann
As Skyscanner scaled from an engineering team of 30 with one website and three services to a team of 100 engineers, release frequency halved. This is the story of the turnaround as the company went on to grow to 400 engineers, with over 100 services, releasing at thousands of times the previous rate. This series of three blog posts will share that story of releasing thousands of times more frequently, and the implications of how our goal of 10,000 releases per day informs our tooling, processes, and how we think about writing software.
Part III – From 1000 to 30,000 releases per year, and beyond
(aka Reversing the death spiral and turning it into a flywheel)
At the end of Part II I explained that we had got to the point of doing hundreds of deployments every month, thanks to moving to continuous delivery. Services were deployed at their own heartbeat and we had moved to a devops model.
The question then became: how fast should we be going? There is an annual publication from Puppet Labs called ‘The State of Devops’ – it is a summary of the learnings from a survey of more than 4,600 technical professionals from organisations of all sizes around the world. They have identified a link between ‘high performing’ software organisations and higher market capitalisation growth in publicly listed companies, and improved profitability and market share in private companies. Compared to other software organisations, the high performing software organisations are reporting:
- 200x more frequent deployments
- 24x faster recovery from failures
- 3x lower change failure rate
- 2,555x shorter lead times
There are further insights in the 2015 report into how in the high performing group of companies, as the number of developers in the organisation increases, the number of deployments per developer per day exponentially increases Vs flat levels for medium performers and reducing deployments per day for the low performers. A shift we have seen in our own company as we made the journey to continuous delivery – initially adding more engineers reduced our frequency, then we managed to keep it flat, but to make it go exponential we need to work out what is actually happening.
So why does this happen? In Part I of this series, I explained how our release frequency started plummeting as a negative cycle had started – every release made the next release slower. Adding more people to the organisation just compounded those effects. In the case of organisations that are releasing more often, I can see the reverse happening: a flywheel where changes keep reinforcing each other positively. I have drawn this out here:
As smaller changes are made, there is less risk and reverting is easier – hence the faster recovery rate and reduced incidents due to change. The smaller changes are also better for our users, as changes happen more gradually and they get new features earlier. From a process perspective – higher release frequency forces greater automation, and better instrumentation. This leads to fewer errors and greater confidence in the software. This all means happier developers and that means more code gets shipped. The 2016 ‘State of Devops’ report even measured the impact on the team – developers in the high performing organisations were more than twice as likely to recommend their organisation to a friend as a great place to work.
How fast should we go?
Given there is this positive flywheel, how fast should we be looking to make it turn? The number we came up with was 10,000 releases / day. The rationale for that is that we are moving to many micro services. Today it is around 100, in the future, it will be 500-1000 between primitives and orchestration services. In that world, a typical engineer’s code change might touch two to four services (let’s call it an average of three). So if an engineer is making one change per hour, touching three services, that is 24 changes per day per developer, and with a team of 400, that’s 9,600 changes per day.
Now, I don’t actually mind whether Skyscanner hits 10,000 or 5,000 or 20,000 releases per day. The great thing about a target like this is it forces us to think differently, and brings various decisions into focus. For example – we are moving from colo to AWS, and there was some discussion about what our integration environments were going to look like when we move to AWS (they are a total pain to maintain, but some teams are highly dependent on them as part of the Continuous Delivery pattern we original advocated). If there are 10,000 releases going out a day, the idea of a ‘stable’ integration environment to test in, ceases to make sense. So there will be no integration environment in AWS and teams will need to be integrating against the production versions of the APIs they use or mocks.
Another example is that if an engineer is deploying on every change, there cannot be any manual steps between that commit and production, it just isn’t efficient – so all the manual checks in the continuous delivery pipeline shown in Part II have to disappear.
One pioneering team on this journey was working in our Joint Venture with Yahoo! in Japan and they built two services to help them ship more quickly, one called Stevedore that loaded and unloaded Containers from production (for those who haven’t seen Season Two of HBO’s The Wire, Stevedores load and unload ships, and ships have containers…), and Cyan for Blue/Green testing the new deployments (Cyan resulting from the addition of Blue and Green light). The new system dictated a particular workflow:
The impact of this system was very positive and it has now been generalised into a system that is used across Skyscanner – we internally call that system ‘Slingshot’. The following diagram shows the original continuous deployment system at the top, and the ‘Slingshot’ continuous delivery system below:
The first thing to notice is no manual steps going between a commit and the software running across production. Every pull request automatically triggers a Drone CI build that runs tests within a Docker container, after a code review, the tests are run again and that Docker image is the artefact that will be deployed into production. A hook in Gitlab means the successful pull request triggers the Slingshot deployment:
- The image is retrieved and deployed to a new cluster
- Automatically, a pre-defined fraction of production traffic goes to the new cluster
- Rules applied to production metrics determine success or failure
- There is automatic roll-out to all of production or automatic roll-back
This way of working forces more good behaviour: there is no opportunity for any sneaky manual testing as part of deployment, it cannot happen; teams have to better understand and instrument operational metrics – which improves overall availability. We can only provide this free tooling if people use the default toolsets and patterns – so we get greater convergence and that also makes us more efficient (see http://codevoyagers.com/2016/08/16/not-all-the-technologies/). You can see those forces on the flywheel are beginning to turn and make us accelerate.
Where have we got to?
It turns out to be surprisingly hard finding out how many releases we are doing – with many systems and pipelines, there isn’t a single place to go and find that number. I recently took a poll of teams and pulled some data from Slingshot and we did 2,400 releases in the last month. While that is only around 120 releases per working day, it is an order of magnitude more than when we first implemented continuous delivery, and three orders of magnitude better than at the peak of our challenges. In order to show this progress on a graph, I’ve had to use a logarithmic scale:
The dotted line extrapolates the trend and implies we may get to 10,000 releases per day at the end of next year. While we have gone from 1,400 to 2,500 releases / month over the last six months, there are some impediments to getting to that rate of releasing software. In the teams that have adopted continuous delivery, where every commit is released to production; a team of 15 engineers is releasing 15-20 times per day, so one or two releases per engineer per day. The only way to ship more often is to commit in smaller chunks – so we need to change that part of how we work. Other teams are still working with the ‘legacy’ continuous delivery model and until they are able to migrate to the new standard toolset they cannot benefit from slingshot. In the meantime, the release frequency of these teams is typically daily or even weekly in a few cases.
There is also a wider consideration for the whole business – as we ship changes more frequently, we need to be able to make faster decisions. In a one week ‘sprint’ for a Scrum team, there will be multiple experiments and hundreds of deployments, each generating lots of data on which to base product and business decisions. Thus the product owner and the rest of the organisation have to be ready to make faster decisions to maximise the benefits of faster feedback.
Whether Skyscanner reaches 10,000 releases per day at the end of 2017 or not, I am convinced the trend will continue to be more frequent releases, happier customers, and happier developers.