Continuous Integration – Where we were, where we are now.Posted on by Stuart Davidson
Skyscanner is growing rapidly – which is awesome, don’t get me wrong. However, we’ve found (sometimes the hard way) that the processes we rely on and the tools that we set-up for one level of scale don’t work when that scale increases.
That’s why my old squad Release Engineering have gone through a metamorphosis – emerging at the start of the year as the newly formed Deployment Orchestration squad.
It’s not just a new name though. We had a really hard look at the problems that were facing us and came up with a new direction that we hope will be better equipped to handle the sort of growth that an internet economy company has to deal with. There’s a great deal of pressure for a company like ours to get this right – release engineering is now considered a strategic asset in an engineering company and the formation of Deployment Orchestration was a company-level OKR.
I’ve been asked to put a series of blogs together to explain the problems we were having as Release Engineering and what changes the new squad have made since it was created. I hope that they’ll strike a chord with teams who are struggling like we were and give an insight as to how we decided to move forward.
So let’s start with an area that most engineers should be reasonably familiar with – Continuous Integration.
Where we were…
If you’re responsible for the Continuous Integration (CI) tool in your company – something like Jenkins or TeamCity – you’ll probably recognise this story.
As the Release Engineering squad, we were responsible for the CI and CD tools that Skyscanner uses. As more engineers came on board and more squads were formed, we bought and added more agents to our TeamCity agent pools.
However, we started to realise that we were spending all of our time on tiny little changes – bespoke for each squads requirements. It didn’t happen overnight, but the realisation soon struck that we were perpetually fire-fighting problems rather than dealing with them at the root. There was never any breathing room to improve things – we were on the defence all the time and it felt like we had to do quite a bit of apologising to Squads who relied on our tools.
This boiled down to a fundamental clash of priorities when it came to running and managing a Continuous Integration system – something I’ll cheekily call ‘The Jenkins Paradox’.
The Jenkins Paradox
On one hand, we wanted to help Development Squads use the newest and most up-to-date versions of the tools on a day-to-day basis. Without these, we were not enabling our squads to innovate and leverage the new features or security improvements that were being made.
On the other hand though, we had to provide a platform that was stable and could reproduce builds when required. The more changes and the more versions of a tool we would install on the CI agent boxes though, the more chance there was of something fundamentally breaking.
We tried breaking the agent pools into groups and defining build environments for groups of teams – but that felt backward. Why were the Release Engineering Squad defining what tools the developers should use? Why should we feel that our requirements came first before the developers we were trying to enable?
These two priorities were fundamental to our job, but we couldn’t do one properly without ignoring the other.
Shifting the friction
Our solution to this problem as Deployment Orchestration was to look at container-based CI tools. By allowing developers to define their own build environment as a container, all we need to do is focus on the stability and scale of the platform.
Although we were initially nervous about how well this model might be accepted in engineering squads, it turns out they love it! It does mean a little more work on their side but they see the benefit as it removes a huge roadblock (us) to experimentation and progress.
Ultimately this was a responsibility that Development Squads wanted to have.
Where we are…
So what did we pick in the end? After much investigation, we went with a new CI tool called Drone. There are many reasons why we picked that tool, but here are our top three.
- Although other CI tools were starting to offer this as an option, Drone was one of the few tools that has been built with container-based build environments at its very core. We reckoned this felt like a better path rather than going with a tool that seemed to be forcing it into an existing model.
- We liked the fact that all of the configuration is stored in the repository alongside the code. Although this isn’t a new feature by any means, few of the more common CI tools offer it as standard.
- Drone is open source and it means that we can give back to the community. Many of the new squad come from a development background and we feel we’re empowered to fix broken things rather than sit back and wait on a support request.
We also found once we’d started using it that the plugin system was excellent and we’ve leveraged that for our zero-click-to-production deployment system we call Slingshot, but there’ll be more about that in another post!
We deployed drone in AWS with the ‘master’ in a container on a t2.medium controlled by ECS and the ‘nodes’ (think build agents) t2.small boxes running in an auto-scaling group. We’ve added CPU alarms on the build boxes which terminate an instance if they don’t use CPU for a certain amount of time and a small code improvement to our internal fork of the Drone codebase that removes nodes if they cannot be communicated with (we’ll submit that back just as soon as we’ve proven it works as intended…)
This is what it looks like! The blue line shows the number of EC2 instances we have running as nodes, the yellow line are the running jobs and the green line is a trace of any jobs that are waiting to be actioned. We’re still tweaking, but it means we’re saving money overnight rather than having dedicated build boxes.
- There is a lag between drop in load and terminating instances as we let systems run for at least an hour as we’ve paid for them anyway.
- We don’t need to worry about teams working out of timezone – if Squads need build capacity, the system will spin up the required nodes. However, we’ll probably look to spin up some nodes in preparation for the mid-week 9am GMT rush…
By implementing a container-based CI solution and moving the responsibility of maintaining build environments onto the developers, we’ve seen a bunch of benefits.
Our new Squad now has the breathing room to implement improvements – leading to decreased cost, increased visibility and new features in tools that we use (as well as less stressed Squad members!).
More fundamentally for Skyscanner though, the system allows developers rather than administrators to make technology decisions; to experiment without having to ask our permission and to react quickly to the marketplace without us getting in the way.