From 20 to 2 million releases a year, part 2Posted on by Alistair Hann
As Skyscanner scaled from an engineering team of 30 with one website and three services to a team of 100 engineers, release frequency halved. This is the story of the turnaround as the company went on to grow to 400 engineers, with over 100 services, releasing at thousands of times the previous rate. This series of three blog posts will share that story of releasing thousands of times more frequently, and the implications of how our goal of 10,000 releases per day informs our tooling, processes, and how we think about writing software.
Part II – From 10 to 1000 releases per year – Microservices and Continuous Delivery
In Part I of this series, I explained that our release frequency was plummeting because there was a spiralling negative cycle, where every release was making the next release take longer. Our team was growing but the value we were delivering to our customers was grinding to a halt, and we had to change things.
We made three major changes, which were inextricably linked:
Moving to micro services
At the peak of the crisis, the Skyscanner product (excluding native apps) were two deployable units that were released in lock-step. One was ‘The Website’ and the other was the data services that powered it and the mobile apps. The reason I refer to the data services as one unit is that while there were strictly speaking three services, two were tightly coupled and had to run on the same server. A huge effort went in to split code bases into much smaller services – effectively forking the code, and progressively cutting out the dead wood that wasn’t needed for each of the sub-services. These micro services could be deployed independently. In the first move there were around fifty such services, and now there around one hundred.
Teams Owning Services
We entirely changed the way we organised product engineering. Previously we had ‘development’ and ‘operations’ teams. Instead we adopted a structure partly inspired by Spotify’s 2012 paper describing Squads and Tribes. This is a ‘you write it you run it’ environment with a single team taking on full responsibility for development, deployment and operation of the services it owns. The only remaining ‘ops’ functions moved to teams who built tooling to enable other teams to deliver faster, a very small infrastructure team responsible for networks, CDN, load balancers, virtualization etc., and a very small team providing ‘follow the sun’ front line support (e.g. for data centre outages) and incident management.
We only considered a team to have completed the transition when they could show they had completed an extensive checklist. The list included that the team:
- Can Independently release services into production
- Understand and monitor Metrics and KPIs
- Can run an A/B test in production
- Set own objectives and key results
Independent deployment of services was driven by a move to continuous delivery. The diagram below shows the ‘best practice’ pattern that all teams were encouraged to adopt.
There was a continuous integration environment with very frequent change, validating the software after ever commit. If that was successful, after a human ‘go ahead’ the code would make it to integration review – an environment with less frequent change, where the service integration would be validated by running a full integration test suite including all service dependencies and a full acceptance test suite. Again, after a manual gate, code could move to pre-production. This was relatively stable and contained probable release candidates where integration and acceptance tests would be run.
A manual gate between pre-production and production allowed for exploratory, performance and load testing in a production-like environment. The production release would then be to a canary which should be a stable mixture of production and the validated release candidates. Smoke testing was combined with monitoring service KPIs and business metrics to ensure the release was good. After a final manual check, the package could be released to all nodes in production where there is ongoing monitoring and alerting of the service KPIs and business metrics.
Making the changes to microservices, teams owning services, and continuous delivery took several months, with different teams completing it at different times – depending on how easy it was to decouple their service, the level of additional automated testing that needed to be added etc. One question is how this change was sold to the business – there were pioneers who hold a lot of credit for driving this change, but there was also a recognition that things couldn’t just carry on like this. The team had doubled in size and we were going slower. Thus the entire executive team and board were behind the change – it also helped to be able to refer to examples like Kevin Scott’s halting deployment at LinkedIn for two months, in order to move to Continuous Deployment.
Many challenges emerged during this switch – moving to a ‘you write it you run it’ devops model meant that teams had to learn about the discipline of deploying and operating production software, when they had previously been insulated from it. There was a steep
learning curve in terms of what were critical metrics to gate deployment and trigger alerting of the individual components, what is an appropriate time to release a change, etc.
Another challenge was that services were subject to a lot more change going on around them. There was a case of a small data change in one service triggering the entire website to go down, because of a cascade of errors triggered by the invalid data and subsequent abnormal responses. The left side of the figure below shows the old model at Skyscanner – Component A is deployed into production as part of a single release with component B – it can be fully tested against the version of Component B that it will be running alongside in production. If component B is unavailable, A will usually also be unavailable as they were often deployed on the same hardware or were otherwise tightly coupled. In this example we didn’t need to think too much about what happens to A if B changed or stops working – any issues would be picked up in the integration environment and regression (hopefully).
After moving to independently deployed micro services, the world was much closer to the right hand side of the figure. We have a situation where Component A has been deployed and isn’t changing, and Component B is being deployed independently. Now, a single release of Component A has to handle new Releases of Component B, including unproven canaries, and Component B being slow, and many other possible types of change.
As a result, teams had to start coding a lot more defensively. Hardening how they responded in a world where other services suffered from latency, invalid responses, unavailability, etc. and testing for those scenarios. The end result was more resilient software.
Slowly we learned the operational discipline, improved instrumentation, and increased the resilience of individual services. Overall availability dropped but then recovered. Teams started delivering software again and the very last release train in February 2014 was Hammer Bro, the Super Mario character (we had run out of Muppets and moved on to video game characters).
So where had this all got us – at the end of this change we were releasing 100’s of times per month. The confidence of teams was growing as was confidence in the teams – in Part I, I talked about the annual feature freeze in December (because we were uncomfortable about change while people were away and the ability to handle the new traffic surge in January) we no longer force a change freeze and haven’t done so for the past two years. What we do say is that squads must have support available for two days following any deploys (we are in the process of adapting our support model – something for a later post).
As teams’ confidence grew, some changes were made to the default continuous delivery pattern – some squads would deploy to the integration review and pre-production environments simultaneously, and a minority moved on to automatically rolling out canary releases across production if the metrics were healthy. These were signposts of the direction we would ultimately end up moving in. In Part III I will describe how continuous delivery increased our release frequency by another order of magnitude.
Now continue to Part III of the series, “From 1,000 to 30,000 releases per year, and beyond“.