From 20 to 2 million releases a year, part 1

Posted on by Alistair Hann

As Skyscanner scaled from an engineering team of 30 with one website and three services to a team of 100 engineers, release frequency halved. This is the story of the turnaround as the company went on to grow to 400 engineers, with over 100 services, releasing at thousands of times the previous rate. This series of three blog posts will share that story of releasing thousands of times more frequently, and the implications of how our goal of 10,000 releases per day informs our tooling, processes, and how we think about writing software.

Part I – From 20 to 10 (aka ‘what went horribly wrong’)

I joined Skyscanner at the end of 2010 after it acquired the start-up I founded, Zoombu. Zoombu had a team of five and we were deploying our product at the end of every week-long sprint. Skyscanner had an engineering team of thirty and the product was going out every two weeks, like clockwork. That seemed reasonable to me – there was a lot more surface area, and it was far better than a household name Online Travel Agent that I knew was shipping every six months.

These blog posts are the story of what happened in the subsequent six years. To help put some context around that, the following graph is the number of UMVs Skyscanner has seen since December 2004 to February 2016. One interesting thing to note are the jumps in traffic after each December – in the travel industry things get quiet towards the end of the year, but as soon as Christmas and other festivities are over, people start planning and booking their travel and that is when we experience the biggest jump in load on our services (this isn’t the case in all countries, but it does characterise much of Europe and the Americas). Something else to note is that as a team we were growing at a similar rate, and that included the engineering team.


The second graph I want to share shows how many releases we were doing per month for the two years after October 2010 – you can see that there was a notable decline, with releases dropping to monthly and rather less than that. In practice, the graph doesn’t tell the true horror of the situation – we entered a phase of alternating releases between low risk releases and high risk releases. The idea was that lots of low risk changes could be bundled into one release, and riskier major feature releases would be scheduled into the other. Thus in reality, a minor change could take eight to twelve weeks to go out, if it emerged at the wrong time in the cycle.

Release frequency at Skyscanner over time, 2010-2013

It’s also worth calling out the big dips around December 2010, December 2011, and December 2012. Those were the ‘feature freezes’ we would start weeks before Christmas – the anticipated traffic surge meant we were fearful to release changes until we knew what the product could handle, and we also wanted a quiet period for supporting the site while people were enjoying their Christmas break.

Release frequency at Skyscanner over time, 2011-2013

So, why was our release frequency tanking? The clue is in the third graph – the number of days each release took. It sounds obvious to say the release frequency dropped because it took longer to release – the key thing to note is that as soon as we finished one release we would start the next one. We couldn’t get any faster, because it was taking so long to get from the release train being ‘closed’ to further changes, to the new release being deployed and validated in production. At which point ‘int’ would reopen and the whole process would start again. To quantify the situation, at peak pain:

  • 59 Changes Per Release
  • 9 Bugs found in Pre-Production Regression
  • 21 Days end-to-end
  • 125 Days of human effort

There was a lot of manual regression testing, if a bug was found during the regression, a patch had to be made and the regression test repeated. That regression was painful, people didn’t want to do it, only a small number of people knew how to fully test each component, and it was doubly painful if it had to be repeated.

We were operating canary releases – a new release would be deployed to a subset of all production traffic called ‘Staging’, and it would only be rolled out to serve all production traffic if the KPIs were healthy (in the email snippet below, there were concerns about the metrics of a release that had been on Staging for five days). The reason it would take so long to validate the new version was performing correctly was because of latency in metric collection, the volume of change, a behaviour where new deployments needed to ‘warm up’, and variability in the KPIs between production nodes meaning there wasn’t a clear benchmark of what was a healthy value of the KPIs.

An emergency meeting invitation at release time

An invitation for an emergency meeting in February 2012. A KPI (‘G1’) was down against staging, and people across the company needed to decide what to do.

Despite the bugs found in regression, nearly every second release required a re-release because we found an issue on production which had been missed in regression and was only identified after the successful rollout across all nodes. If the issue couldn’t wait until the next release, the answer was to deploy a ‘hot fix’ – this risky and painful procedure meant carrying out an accelerated version of the release process, only doing regression on a subset of the product. It wasn’t a decision taken lightly because of the risks, pain, and delaying the next scheduled release (I hid the hot fixes from the release frequency graph above, as they confuse the trend).

Muppet cupcakes - photo "Cupcake Brilliance" by Bret Jorhan, CC BY 2.0 Licence

An early warning sign may have been that we had names for each of our releases. If a website release is significant enough to have a name, there’s probably something wrong. A bigger warning still, may have been that at one stage we were naming our releases after Muppets. Acknowledgement: “Cupcake Brilliance” by Bret Jorhan is licensed under CC BY 2.0.

A model I use to explain how we ended up in the painful situation is that there was a negative cycle going on – problems were feeding each other, thus a slow degradation rapidly got a lot worse as the cycle gained momentum. I have drawn a simple model of this below:

A diagram showing how some problems actually fed each other

With a negative spiral feeding itself, we were releasing code more and more slowly, even though the size of our engineering team was rapidly increasing. We reached a point where we had to stop and completely change how we organized ourselves and how we deploy software. In Part II, I discuss how we changed our organization, services, and release processes in order to get a 100x improvement in release frequency.

Continue reading Part II of the series, “From 10 to 1000 releases per year – Microservices and Continuous Delivery“.

Related articles:

Sign up for email updates from the CodeVoyagers team

Using ElasticSearch as a data service for hotel offers

Posted on by Pau Freixes

ElasticSearch is a scalable and highly available distributed search engine
built upon Lucene . The Hotels Backend squad has bet on it as one of the key pieces for the new backend architecture, named Bellboy. In this architecture ElasticSearch is used as a data service to store hotel offers temporally, making them searchable for the user.

Each time that a user looks for hotels and their offers for a specific city, these offers are indexed into an ElasticSearch cluster. Hereby, all further user requests to search, filter and so on will end up as a query to this cluster.

Having ElasticSearch as a data service allows Bellboy to:

  • Scale horizontally in a decoupled way to the other pieces of the architecture.
  • Index any field to make it searchable.
  • Support almost any type of field. From strings, integers, lists to geo points.
  • Faceting system to count hotels and offers in a powerful way.

The following sections will explain how ElascticSearch is used by Bellboy to index hotel offers and make them searchable for the user.

Indexing offers to ElasticSearch

Each time that a new user searches for hotel offers for a specific city, a group of services retrieve the hotels set in that city and the offers provided by each partner. As a result of this process, Bellboy indexes these hotels and offers in a denormalization way, aggregating the hotel fields in each offer having as a result a unique type of document. This denormalization is the way to materialize the joins between the hotels and offers in a non-relational database such as ElasticSearch.

An offer is a JSON document composed by a set of regular fields coming from the hotel and the offer plus a set of internals fields, among them the search_id field. This field is a unique Id that identifies a group of documents that are related to a certain search. This field acts as a logic partitioner. Therefore, an ElasticSearch index is composed by many logic partitions, each one belonging to a specific search.

A new user search will trigger many concurrent tasks, each one related to a provider. The offers provided will come packed in batches of N documents, that will be indexed to ElasticSearch. Each batch operation will be routed with the value of the search_id field. This route will help ElasticSearch to store all documents belonging to the same search in the same shard, involving later only one of the shards in the query made by the user.

A search_id partition is composed from several hundred of documents for small entities to several thousand for large entities where different partitions never intersecting. Instead of use multiple shards with a query that seeks a bunch of documents between millions, Bellboy and more specifically ElasticSearch handles queries that can suit in a unique shard and its related node resources such us CPU, memory, and so on.

Having a unique shard queried gives Bellboy an accuracy for the term aggregation result. This term aggregation is used several times within the query to ElasticSearch and it is especially crucial to perform the hotel normalization.

However, the use of the document routing can produce hotspots: unbalanced document distribution that can have as a result shards much heavier than others.

The following picture shows the distribution of documents of each shard placed in a Bellboy ElasticSearch cluster using as many shards as nodes.


As can be seen in the previous picture the distribution is not perfect: one of the shards has roughly 30% less documents than the other shards. To mitigate this issue Bellboy multiplies per two the number of shards per node. The following picture shows the distribution using that


In this case, the shards have an almost equal amount of documents. Therefore, the function used to calculate the search_id behaves close to a uniform function avoiding hotspots.

Adding more nodes, incrementing the offers per second

The number of hotel offers indexed per second may vary due to different causes: traffic expected, amount of offers handled by a search, etc. Bellboy takes this into account to configure the number of nodes for the ElasticSearch cluster to meet the throughput requirements.

In this scenario ElasticSearch allows Bellboy to increase the throughput by adding more nodes to the cluster without modifying the general architecture.

The following picture shows the behaviour of the maximum throughput, offers per second, reached by an ElasticSearch cluster composed by two, three and four nodes.


As can be seen in the graphic, the number of offers per second grow almost linearly as the number of nodes increase. All performance tests were executed using c3.2xlarge instance types with ephemeral SSD disks. The instances were distributed across different availability zones. Each hotel offer was a JSON document that weighed around 1 kilobyte, with roughly 30 fields.

Evicting offers from ElasticSearch

Hotel offers have a limited lifespan of 15 minutes. Once this time has been reached these have to be refreshed by the pricing service. It will minimize the chances to publish outdated prices, giving to the user the last offers gathered from the providers when the 15 minutes have expired.

Deletes and updates operations are very inefficient in ElasticSearch; removing the outdated hotel offers allocated in an index through delete operations will impact directly in the resources of the machine.

Bellboy uses a technique called Index per time frame where instead of removing documents one by one, it erases a set of documents at the same time by removing the whole index. The cost of this operation in terms of resource usage is almost negligible. Every time that there is a new user request, it is routed to an index that will last 15 minutes

The following picture shows the indices placed in time order with three different searches. The searches were routed to the proper index when they were created. The index-n-2 does no longer exist and neither do its hotel offers.


Because Bellboy has to make sure that a set of hotel offers belonging to a search is alive for 15 minutes, the indices are kept for 30 minutes. Until the index is not removed automatically, the outdated searches in that index are still alive but will not be used.

From nearest real time to real time

When a request is created it has no hotel offers indexed yet. Before fetching the prices, Bellboy stores the number of minimum offers expected for all partners and hotels. This information will be used to check if the offers expected related to this request have been indexed.

Due to the nature of the ElasticsSearch near real-time behavior, the number of minimum offers stored in Bellboy at the beginning has to be compared against the result of a query to ElasticSearch. When these values match the indexing process has finished and the user will retrieve a consistent output.

Instead of running a query continuously, Bellboy places a specific aggregation to the user query. This ad-hoc aggregation stage will count the offers and carry out the matching between values and complete the request.

However, the number of offers per partner regarding each hotel can be less than expected at the beginning, due to accommodation restrictions or other issues. When this happens, the pricing service that is in charge of fetching the offers will decrease the expected hotel offers value.

Retrieving the offers and their aggregations

ElasticSearch implements a query interface addressed to do filtering and aggregation over the inverted index exposed by Lucene.

A user query becomes an ElasticSearch query composed by a filter stage and an aggregation stage. The filter stage will prune the amount of hotel offers using the search_id field, therefore the aggregation stage will only receive the offers belonging to a certain search. The aggregation stage is composed of many independent sub-aggregation stages that will be in charge of retrieving the list of hotel offers and counting the amount of hotels regarding their characteristics.

ElasticSearch allows Bellboy to build a complex and deep aggregation tree. Each aggregation stage can be seen as a mapping function and its nested aggregation stages as reduced functions, and this can continue indefinitely.

Query performance, keeping the latency predictable

The full query used by Bellboy for each user request is a long JSON document that can use almost 30 different aggregation stages. Despite the query size, ElasticSearch is able to execute it in less than 50 ms using any modern hardware. It is worth mentioning that the logic partition identified by the search_id field can be composed of a few hundred documents until several thousand for large entities.

The write throughput requirements and the amount of reads faced at one time are dependent variables that grow proportionally: for a 10000 write throughput, the amount of requests expected per second might be 10. Therefore, with twice this traffic the values expected will be 20000 writes per second and 20 requests per second.

The next table shows the read latency behavior taking into account different write loads, from 0% to 75%, and the reads per second incrementing proportionally.


With a 25% of the resources used by the indexing process, the read latency in all percentiles is kept below 100 ms. When the resources usage is doubled, the average latencies increment proportionally. Even though the percentiles 50th, 75th, and 90th have an acceptable behavior during the tests, the 99th percentile spikes until  1.5 seconds.

With the aim to keep the read latency below to 0.5 seconds, Bellboy configures the variable bulk threads to half of the CPUs available. This throttles the CPU resources of each ElasticSearch node dedicated to index offers. At the same time, the bulk queue size is configured with a number big enough that allows the allocation of the index operations that are waiting for a CPU time slice.

Placing the ElasticSearch Cluster at AWS

Bellboy places the ElasticSearch Cluster into an Auto Scaling group with a constant number of machines, and with a Load Balancer in front of it as the entry point. The following image shows this configuration:


The amount of nodes used by an ElasticSearch cluster deployed by Bellboy is always configured with an odd number and with the nodes placed in different availability zones. This configuration increases the service availability and helps ElasticSearch with the consensus resolution in case one of the nodes is down.

The Auto Scaling group will keep the number of nodes constant. If one node is down, it will be replaced by a new one that will be launched with the proper ElasticSearch configuration to become a member of the cluster.

During the time that there is one node less in the cluster, its state becomes Yellow – Elasticsearch has allocated all of the primary shards, but some/all of the replicas have not been allocated – but ElasticSearch can operate and Bellboy continues relying on this ElasticSearch cluster. The replicas of the primary shards placed in the node that is down are promoted. When the Auto Scaling group replaces the node with a new one, the shards are rebalanced automatically, placing some of the replicas and primaries in the new node.

Even though this operation might be IO and CPU expensive, the shards belonging to the indices of Bellboy are limitedly impacted. The following graphic shows the spike produced due to a rebalancing process because a new node was added to the Auto Scaling group to replace a broken one.


This ElasticSearch cluster will be reliable until its state becomes Red, meaning that a primary shard cannot be found. When this happens, the ElasticSearch cluster is no longer used and the traffic is redirected to another Bellboy stack.


ElasticSearch is a product that allows the full use of Lucene potential by applying it to several nodes. This allows for horizontal escalation with a higher document throughput using an API JSON interface. However, it is important to know ElasticSearch fundamentals and how they can fit in your product.

Sign up for email updates from the CodeVoyagers team

Video: Leading Distributed Teams at Scale

Posted on by David Low

Earlier this summer, Skyscanner’s SVP Engineering Bryan Dove gave a talk at Rocket Tech Summit in Berlin around the experiences of his current role, previous learnings from his career to date and how to meet the challenges faced by technical leaders.

The presentation (27 minutes) is a must-watch for anyone interested in leading the technology and people at a high-growth company like Skyscanner, with so many competing priorities, an ever-larger workforce and technology advancing as it always does.

As always, we welcome your comments, please get in touch at @CodeVoyagers on Twitter and let us know what you think.

Sign up for email updates from the CodeVoyagers team

Does research = usability research? I don’t think so.

Posted on by Laszlo Priskin


I often find myself in discussions in which people ask “what is the role of ‘design / user research’” or “how can ‘research’ support the product development process?”. In various discussions, it also often happens, that ’design / user research’ is mentioned as the synonym of ’usability research’. You can find amazingly well-crafted ‘101 guides on how to conduct usability studies’ and more and more organisations keep using those techniques naturally.

The phenomenon which suggests that ‘design / user research’ equals ‘usability research’ made me think. In the past few years, I was lucky to take various ’research challenges’ within Skyscanner’s Apps Tribe and in its Design Team. As our product grows, we face more and more complex problems. It strikes me that we need to understand the nature of human-beings more and more profoundly.

In this journey, Steve ‘Buzz’ Pearce and Bálint Orosz, two of my professional mentors at Skyscanner, inspired me to try out or develop new methods in order to reply to those fundamental questions that our travellers are faced with. This journey helped me in realising how diverse the world of ‘design / user research’ is and also helped me in realising that besides ‘usability research’ multiple other fields of research exist and they can also add significant value to product development processes.

Let me share with you a framework which I call the ‘Four Layers of Research’. It is actually more like a research mindset and it would be great to hear whether you can relate to it and also to hear what methods you use in the case of the below-mentioned ‘layers’.

The Four Layers of Research


1. Usability research

When building products, a highly important factor is whether or not people can actually use what we build. To illustrate… if they would like to move forward in our app, will they find how to take the ‘next step’ or if they would like to ‘go back’ one step, will they figure out how to do it? In this sense, usability research is all about making sure that the way in which we realised our solution is in line with what people expect and what feels natural for them.

Simply put, in usability research studies, we are not focusing on the question of whether people need the ‘Back button’ or not, we assume that they need it. The question we focus on is if, in the moment that they would like to go back, they know immediately, intuitively and without further thinking how to do that.

2. Valuability research

This area of research is all about understanding whether people actually need a ‘Back button’ or not. Valuability research could help a lot in validating or falsifying a solution we plan to build for our travellers.

‘Validating or falsifying’ and ‘plan to build’ seem to be key terms here. We all have many nice ideas on what to build for our users, but one of the most important questions is whether people really need that solution or not. In the case of valuability research studies, we consciously ignore whether our solution is usable or not, but we focus on the point whether our solution adds value and meaning to people’s every day life or not.

In other words, does our solution really resonate with our travellers’ needs, and does it really solve something valuable for them or not. Valuability research could be a powerful tool in the ‘product discovery phase’, more specifically at that stage before we start building anything, at that stage when we’re just about planning to build ‘something’.

Honestly speaking, for me, separating ‘usability’ and ‘valuability’ questions in research studies is extremely hard. In the case of prototypes, there are so many things that create ‘noise’ and makes it hard to identify what’s the reason that our solution fails in user discussions. The ultimate question is always there – did our solution fail to work because users don’t need it or because we created something absolutely unusable for them?

To overcome this stage, emotional intelligence best practices and the deep analysis of people’s emotions and mental state helps me a lot. Can you recall memories of when a user realised the value of a feature you work on and starts talking about it honestly and passionately? Shining eyes could be a good sign that you might have built something lovable (on the other hand, I try to keep in mind Peter Schwartz’s thought: “It is always worth asking yourself: “how could I be wrong?”).

3. Contextual research

There are two different types of situations that regularly come up in our product development processes:

  • what are those needs of our users that they have not yet realised, but would really love if we figured it out for them?
  • or we come up with new directions / new products / sets of features for a group of people and we start believing that it would add lots of value for them, but how could we validate or falsify our assumptions and how could we learn more about their context and their environment in order to fit into those naturally?

In such situations, it could be best to ‘live and breathe’ with those people for whom we would like to build ‘that next big thing’. It’s often mentioned as ‘ethnographic research’ or we can call it ‘contextual research’ as well.

At this early stage of a new product or feature seed, it could easily derail us if we don’t experience directly, but instead assume we understand, how the people for whom we plan to build feel, live and behave. In the short run, it’s of course faster, cheaper, easier and more comfortable to ‘imagine’ how that group of users could feel and behave. But in the mid-run, it adds lots of value (and helps decrease risks) if we jump into the context of those users and try to understand every aspect of their life and their emotions. In product development, we always refer to the importance of ‘the users’ context’ and to the importance of their emotional and mental state. The most meaningful way for us is if we just spend time amongst our users, talking with them, living with them and, in this way, obtaining a deeper understanding of them.

Being with them also enables us to understand them consciously. This is one form of what we call research bringing ‘people’s context in the house’ and opens up opportunities for product design to come up with solutions that really resonate with people. To be very pragmatic, contextual research can help you to understand how people live and what emotions they have and you may spot a need that leads you to design a ‘Back button’ (then test its valuability and its usability aspect).

4. Conceptual research

Have you ever found that you have a very fundamental question, everybody refers to it around you, but you never had the chance to spend enough time with it, go deep enough with it and to understand how it’s embedded in human nature? We love these fundamental questions such as ‘what is trust’, ‘what is personalisation’ or ‘who our travellers are’. These help us question the status quo by going deeper and deeper day by day.

To illustrate this with a tangible example, in the case of our trust research, we turned to respected professors and subject matter experts, in the fields of social sciences and behavioural psychology. We examined various concepts, tried to embrace as many thoughts as possible about the abstract notion of ‘trust’ and thought how we could apply our learnings to the world of digital products. Then we distilled our learnings into a practical tool we called the ‘Trust Map’. The Trust Map enabled us to analyse our iOS application through the lens of trust (based on feedbacks we captured from our travellers). In the framework of a workshop, we came up with various ideas on how to move forward. Of course, we had tons of ideas, but as we had those many ideas on a sheet of paper, we started to realise how those ideas were connected with each other and we could synthesise them into topics. Now, we had a ‘set of topics’ on the table and we think that if we further explore them, they can help us build more meaningful and trustworthy relationships with our travellers in a more ‘human way’.

So how did conceptual research help us? We translated this abstract substance called ‘trust’ into opportunities in our product. And as a squad or a design working group picks up one of these topics, they can start a focused ‘product discovery process’: do some contextual research to gather some real-life experiences, then craft and prototype solutions, test whether those solutions are valuable for users, iterate on them and if they are confident about the value of their solution, then test its usability. At the end of the journey, release and learn. And iterate and learn and iterate.


László Priskin, Design / User researcher at Skyscanner. László is based in Budapest, Hungary, working as a team member on Skyscanner’s renewed mobile app available on Android & on iOS. He started sharing his thoughts, because he passionately believes in the power of discussion. He thinks whatever is written above will be outdated in a few weeks’ time, because building products means that we inspire each other, criticize each other and continuously expand our ways of thinking. László is happy to get in touch with you in the comments below, on Linkedin or Twitter as well. Views are his own.


This blogpost is part of our apps series on Codevoyagers. Check out our previous posts:

The present and future of app release at Skyscanner

Cooking up an alternative: making a ‘knocking gesture’ controlled app

How we improved the quality of app analytics data by 10x

Designing Mobile First Service APIs

How We Migrated Our Objective C Projects to Swift – Step By Step

Analytics and Data Driven Development in Apps

Transitioning From Objective C to Swift in 4 Steps – Without Rewriting The Existing Code

Sign up for email updates from the CodeVoyagers team

From Flask to aiohttp

Posted on by Manuel Miranda

From Flask to aiohttp


This post is about how to have a global context shared during the flow of a request in aiohttp. It is structured as follows:

Why, the context

In Skyscanner hotels we are developing a new service with Python 3 (h*ll yeah!), asyncio and aiohttp among other tools. As you can imagine, the company architecture is full of different micro services and tracking user journey through them can be really painful. That’s why there is a guideline telling that all services should use something that allows us to track this journey between all services. This something is the X-CorrelationID header. So, to ensure proper traceability, what our service should do is:

  1. All calls to Skyscanner services should send the X-CorrelationID header.
  2. All log traces related to a request/response cycle should contain the X-CorrelationID.
  3. Return the X-CorrelationID used in the Response.

Request/Response cycle with X-CorrelationID header

From the diagram above, you can see that we will be reusing the header in many places (calls to services, log calls, etc…). Knowing that you may realize that this header should be stored somewhere accessible from everywhere in our code.

If you’ve ever worked with aiohttp, you may have seen that there is no way of sharing state or storing global variables within the request/response cycle. The way of sharing the information of the request is by propagating the ClientRequest object throughout all the function calls. If you’ve ever worked with Django, it’s the same pattern.

Obviously, we could do that and finish the post here, but this is totally against clean code practices, maintainability, DRY and some other best practices principles. So, the question is, how can we share this variable during all the request/response cycle without passing it explicitly?

At that point, I decided to do some research, ask other engineers about similar patterns, check code from other tools/frameworks, etc… After a while my brain looked more or less like that:


So yes, I came up with an interesting pattern which is how Flask is using threads to store local information like the request object. I won’t go deep into how Flask works, but just to give an idea let’s read a paragraph and see some code extracted from “how the context works” docs section (take your time):

The method request_context() returns a new RequestContext object and uses it in combination with the with statement to bind the context. Everything that is called from the same thread from this point onwards until the end of the with statement will have access to the request globals (flask.request and others).

So, that piece of code means that, everything that gets executed inside the context manager, will have access to the request object. Awesome isn’t it? by just executing from flask import request in any section of our code, if we are inside the context manager call, it will return us the request object belonging to the current request/response cycle!

Clean and simple right? After digging into that, my thought was, can we do that with aiohttp? The answer is yes, next section describes how we have implemented a similar behavior with aiohttp (only with the header).

How, the implementation

To recap the previous section: “We want a variable to be easily accessible from any part of our code during the request/response cycle without the need to pass it explicitly to all function calls”.

Python coroutines are executed within the asyncio loop. This loop is the one in charge of picking Futures, Tasks, etc and executing them. Every time you use an asyncio.ensure_future, await and other asynchronous calls, the code is executed within a Task instance which is scheduled inside the loop. You can think about Tasks as small units to be processed sequentially by the loop.

This gives us an object where we can store this shared data throughout the cycle. Here some things to keep in mind:

  • aiohttp request/response cycle is executed within a single Task.
  • Every time a coroutine is called with the await or yield from syntax, the code is executed in the same Task.
  • Other calls like asyncio.ensure_future, asyncio.call_soon, etc… create a new Task instance. If we want to share the context, we will have to do something there.

Seems we are onto something right? The object we want to work with is Task. After checking its API reference you can see there isn’t a structure, function call or anything that allows us to store context information but, since we are in python, we can just do task.context = {“X-CorrelationID”: “1234”}.

Integrating task context with aiohttp

If you’ve read the previous section, you know that we want to store the “X-CorrelationID” header to be easily accessible during all the request/response cycle to be able to use it during log calls, external services calls and return it in the response. To do that, we’ve coded this simple middleware:

Note the import context line. The module is just a proxy to the Task.context attribute of the current Task being executed:

Easy peasy right? by just calling context.get(key) we will get the value stored in Task.context[key] where Task is the current one being executed. By just calling context.set we will set the value for the given key.

Note that, from now on you will be able to do a context.get(“X-CorrelationID”) from ANY part of your code and it will return the needed value if existed. This for example, allows us to inject the X-CorrelationID in our logs automatically using a custom logging.Filter:

Same pattern used for injecting the header when needed to call an internal service from Skyscanner:

For simple flows which cover most of the use cases, this works so far so good!

The ensure_future & co

As previously commented, the ensure_future call returns a new Task. This means that the custom context attribute we were using before is lost during the call. For our code, we solve this by creating a new context.ensure_future call that wraps the original one:

This part is the one I’m less happy about because it’s not transparent to the user. In future versions this will be improved.

What(,) now?

I’ve moved this simple code to a github repository and called it aiotask_context. Right now it only includes the functions for getting and setting variables in the current task. Some future work I’m planning:

  • Implementing a mechanism to propagate the context when using asyncio.ensure_future, asyncio.call_soon and similar calls. Candidates are wrapping (meh) or monkey patching (uhm…).
  • Add more control to know if current code is being executed under a Task environment or not and act accordingly.
  • Include examples in the repository like the aiohttp middleware, request passing, log fields injection, etc…

Just to finish, if you have any questions or feedback, don’t hesitate to comment and ask!


Interesting links I used:

Other projects implementing this pattern (haven’t tried them):

Sign up for email updates from the CodeVoyagers team

The present and future of app release at Skyscanner

Posted on by Tamas Chrenoczy-Nagy

Since we’ve just released our 3-in-1 update for the Skyscanner app, it’s a great time to share some info with you about how we release iOS and Android apps at Skyscanner.  In this post we would like to give you a brief overview of our app release process, how we adopted the de facto industry standards and extended them to our needs. Beside this we also would like to give you a couple of ideas where we are planning to improve in the future.

We’ll cover:

  • Separating code drops from feature releases
  • Release trains
  • Feature flags
  • Faster release cycles
  • Future ideas and improvements

Decoupling releasing features from releasing the binary

Imagine a company where multiple teams are working on the same app. The teams own different functions, or different screens of the app. They would like to work and deliver new features independently. So if team A has some kind of delay in releasing a new feature, it shouldn’t block team B to release their own feature on time. The company should be able to release a new version of the app even if not all new features are ready for it.

It is also very important to be confident that the new feature is delivering value to our users. Firstly we release only to a small number of users, and depending on their reaction we continue the rollout to all.

To achieve these goals we had to decouple releasing a feature from releasing the app binary. We put the new features behind a feature flag and only turn them on if the feature is ready to be released in production. Meanwhile we release the new binaries in a fixed schedule, so teams don’t need to synchronize their feature delivery, the schedule is available and they can plan ahead easily.

Release train – shipping the new binary on schedule

Releasing the binary follows a fixed schedule, let’s call it a release train. So if you finished your new feature on time, you can release it with the upcoming release train. If you are not ready on time and you missed the train and then you will be able to release the feature with a next train.

On both iOS and Android we have a 2-week release cycle, which mean we ship a new binary bi-weekly. The releases on the two platforms are held on alternate weeks, so each week either a new iOS or Android release train starts.

Feature flags – the tool for actually releasing features

By using feature flags, we are able to specify which features the users should see when they run the application. The feature flags can be modified remotely, so we are able to turn on and off features without releasing a new binary.

Feature flags provide a lot of value at four different stages of development:

  1. Feature is under development: If the feature is still under development and it is not ready to be released to any users, the flag is only turned on for developers – so they can iterate on the feature and commit changes without causing any user disruption.
  2. Rolling out a new feature: If the feature is ready to be released, first we usually turn it on for only a small amount of users (for instance 1%), so we can measure the reaction of the users and the quality of the feature (by checking analytics data – key business metrics, errors, crashes). If everything seems to be ok, we increase the rollout, sometimes in multiple steps, for instance 10%, 100%.
  3. A/B testing: This scenario is similar to point 2 but with variations of what we release. We would like to experiment with a new feature before releasing it to all of our users – so sometimes we can ship multiple variants of the feature and measure which one performs the best.
  4. Kill switch: If we have a feature which has already gone live and something goes wrong (i.e. the app starts crashing because of it), we can turn it off remotely.

To release the 3-in-1 update we also used feature flags. Early versions of the feature have been presented in the app binary since last December. At its early stages it was only turned on only in our development builds, then in our test builds. After we finished the development we released to 1%, 5%, 20%, 50% and finally to 100% of our users using the feature flags. Meanwhile we continuously monitored the performance of the feature. After the 100% rollout we also kept the feature flag as a kill switch for a while, so if something really bad had happened we could have still rolled it back.

Releasing a new binary frequently

Increasing the frequency of releases is a key part of keeping the pace in delivery. Frequent releases help us to experiment and iterate on our features faster. We can get valuable data about user behaviour faster so we can react on them faster. As of now we have a 2-week release cycle on both android and iOS and we are planning to further increase the frequency.

Release flow

Each Wednesday morning we have a code freeze on either iOS or Android. After the code freeze we start a ~1 week long stability period. During this period our main goal is to get confident in the stability of our new release, so we are continuously testing the app and working on fixing critical/major bugs.

After we tested and fixed all critical bugs in the new release, we still cannot be confident that the app will work well for all of our users. There can be some issues which can be detected only in production. That’s the reason why on Android at least, we use the staged rollout feature of Google Play. During staged rollout the new release is rolled out to the users gradually (for only 1% first, then 10%, 100%). We are continuously checking key business metrics (like search rate, conversion rate), and also error rate, crash rate, etc. In case the metrics look good we increase the rollout, but in case we detect any major issue, we try to disable the feature using feature flags and ship an update with the next train. If it isn’t possible we release a hotfix to fix the problem. The staged rollout usually takes about one week.

After we validated the new version in production as well (using staged rollout) we release the app globally. But the story doesn’t end here. We keep monitoring the app metrics, user feedback/reviews, crashes and react if necessary.

What’s next? – There is a lot what we can improve

Shipping a new app version to our users in every two weeks is a good thing. It is good because we can quickly iterate and release new fetaures. But is it good enough?  Let’s do some maths on how long it takes for one new feature to get to the users.

Let’s say development of this feature takes two weeks. If you sum up the length of the release process , you can see it takes an extra two weeks to get that feature shipped to our users.  Beside this – if the feature isn’t finished on time and it misses the release train, that adds an extra 2 weeks again.

So shipping a new feature can take from 4 to 6 weeks to get to 100% rollout. If we need to iterate multiple times on the feature and make several experiments or adjustments, it is even longer. This is way too much compared to a web environment with a continuous delivery flow, where you can release the feature almost immediately. Can we achieve the same for apps? We believe so, in time.

Beside the development areas in our processes, we are facing several constraints coming from the nature of apps. One of these was the review process in App Store, which in the past took at least one full week.  Fortunately Apple has worked hard to bring this down – as little as 1-2 days now, so it should not act as a blocker.

Another constraint is that apps are shipped to the users in big packages and app updates are still a big thing – and even more so if we ever had to perform a rollback of something that couldn’t quickly be hidden with feature flags.

However we can  see some promising improvements in the industry which might result in making app releases a less heavy thing. Google’s Instant Apps feature is the best example and worth checking out.

So applying full continuous delivery for apps is not something which is possible for apps at this moment in time.  But we can still apply best practices from web  and use them to increase our release frequency to get that bit closer.

Hopefully the industry will also move in a direction which supports our goals, so if our processes and tooling are sound, we should be able to capitalise on any changes without major lifting.

Tamas Chrenoczy-Nagy is a product engineer working on improving release and development processes, tools for apps.

This blogpost is part of our apps series on Codevoyagers. Check out our previous posts:

The present and future of app release at Skyscanner

Cooking up an alternative: making a ‘knocking gesture’ controlled app

How we improved the quality of app analytics data by 10x

Designing Mobile First Service APIs

How We Migrated Our Objective C Projects to Swift – Step By Step

Analytics and Data Driven Development in Apps

Transitioning From Objective C to Swift in 4 Steps – Without Rewriting The Existing Code

Sign up for email updates from the CodeVoyagers team

Not “all the technologies”

Posted on by Alistair Hann

A question I am often asked about Skyscanner is what technologies we use. It’s the kind of question that I have asked of other businesses in the past – to validate my own choices, as much as anything else. My, slightly flippant, answer is “All the Technologies”, but we are moving away from that position.

Back in 2010, Skyscanner was largely a .net shop – a bunch of, lots of SQL server and one Python service. In the following years, three changes led to a Cambrian explosion in the range of technologies we use:

  • We acquired businesses with different tech stacks to us – PHP, Postgres, MySQL, etc.
  • We made a strategic choice that we would no longer build services on .net and ultimately move to entirely using Free and Open Source Software (FOSS)
  • We switched to a model with a high level of autonomy for teams
Just some of the technologies we use at Skyscanner
Some of the technologies we use at Skyscanner


The case against diversity

The last point in that list is a tricky one – at Skyscanner we have a model that borrows ideas from Spotify’s Squads and Tribes model. The idea is a collection of autonomous start-up like teams, each with complete ownership of one or more services, able to independently deploy those services, and setting its own roadmap and goals. The model of a collection of autonomous teams is powerful because the teams can execute unencumbered, independently shipping code and delivering value to customers.

A challenge occurs when there is a feature that cannot be shipped without changes to services owned by a different team. Clever shaping of teams and feature teams can help reduce this, but there will always be some feature that requires changes outside of the originating team’s services. One way of handling that situation is that the first team takes a dependency on another team building what they want them to. Unfortunately that breaks the idea autonomous teams delivering value to customers at their own heartbeat. The first team is now delayed by the second, and the second now needs to implement a feature that

may not have been in their roadmap, so they also lose their independence. Another way of handling that is the first team makes a change to the second team’s codebase, they make a series of pull requests and deliver the feature independently. That works well if there is an efficient internal open source model. If teams are all using the same technologies and tooling, that model is a lot more efficient.

When you move to a micro-service architecture with lots of independent services, there is a risk of solving the same problems many times. At Skyscanner we invest heavily in producing tooling to avoid these situations – so engineers can focus on writing new, valuable software rather than solving the same problems that everyone else has solved. Building and maintaining that tooling is difficult when there are dozens of platforms to support. Similarly, our event logging platform team may want to build SDKs to speed up adoption, and ideally they wouldn’t have to write six.

Finally, at Skyscanner we want people to have a variety of challenges. We encourage engineers to rotate between teams and take opportunities to work on different services, and as our products evolve we need to mould our organization to the oncoming work. It is a lot more efficient to move between teams if they are using a familiar tech stack and tooling.

Thus there are many savings to be made if we narrow the number of technologies being used. That doesn’t mean only having one technology stack – there are cases where it is advantageous to have a dynamic language for rapid scripting, or high performance from an interpreted language. For reference, outside of native mobile app development, our default platforms are now Java, Python, Node and React. The reason for Node are the advantages of more rapid development when there is a language consistency between client side and server side.

How do we get there?

In terms of how we get to that position, the stance we have always taken has been not to rewrite systems for the sake of it. There is no customer value in making a change like that. We are setting a direction though – all new services should use the ‘default’ technology set. Then whenever we change things or break services into smaller components, we err on using the default technology set where it means little incremental work.

One way to encourage the shift is through the free tooling teams get for embracing the standard tools. There is a very compelling reason to use what is standard. We are also part way through migrating from co-location to AWS and again we default to using the AWS native services wherever possible, which increases convergence as well as speeding up delivery.

We are not alone in this approach. At Google there are a limited number of languages that are supported for use in production (C++, Go, Java and Python) and something like Ruby is not supported. The practical implementation of that is a list of all the things that need to be available for a language in product (HTTP server, bindings to talk to production infrastructure etc.).

What about that autonomy thing?

The key thing about the model of distributed agile teams is that it is aligned autonomy. The teams are independent to execute, but they share the same purpose and goal – all our teams are working in travel, none are working in selling pet food (for example). That alignment has to happen for technology as well.

Getting the Benefits

We can already see benefits of narrowing our technology set. We are building much richer tooling for our engineers – I was speaking to an engineer earlier today and he was saying how he and two other engineers had created a new micro service from scratch and got it up and running in multi-region AWS serving production traffic in 45 minutes. One enabler of that was ‘Slingshot’ our zero-click-to-production deployment system – every commit is shipped to production, with automated blue/green deployment and rollback. Another was our micro-service shell support for Java that provides the basic event logging, operational monitoring, etc. in order that engineers only need to write new code that is unique to their service. There is a lot more we want to do with the shell, slingshot, and other tools. We can develop that tooling more quickly if we are only doing so for a limited number of platforms.

Getting the complete benefit will take more time – it will be years before we only run on the supported technology stack. That means there will continue to be pain when making changes to some other teams’ codebases that are not in the supported stack, but that pain will be constantly reducing as we converge on a more consistent platform.

Sign up for email updates from the CodeVoyagers team

Podcast: How to build a billion-dollar software company

Posted on by David Low

From our friends at the Skyscanner Travel Podcast, hosts Sam, James and Hugh talk in fascinating detail with our CEO Gareth Williams about how solving the problem of booking future journeys, created a whole new one.

Everything is on the table – from the original days dreaming of Daft Punk, how personalities and goals evolved over time, and how the important thing is to enjoy that journey…

Sign up for email updates from the CodeVoyagers team

Video: how code is changing our lives

Posted on by David Low

Many people quote the well-worn phrase, that ‘software is eating the world’.

But have you ever stopped to think about the impact of product engineering, or put more simply ‘code’, and how it affects our everyday lives?

With modern smartphones carrying over 600 times the computing power of a good desktop PC from barely 20 years ago – combined with the fact almost everyone will be carrying one – the power of code and computing to change our world has never been greater.

In this video taken at Tech Talent Week in London, Our SVP Engineering, Bryan Dove talks about the impact of code on society and particularly our own world of travel – and how we as product engineers can really make an impact on how we live.

Sign up for email updates from the CodeVoyagers team

Backing up an Amazon Web Services DynamoDB

Posted on by Annette Wilson

At Skyscanner we make use of many of Amazon’s Web Services (AWS). I work in the Travel Content Platform Squad, who are responsible for making sure that re-usable content, like photographs of travel destinations, descriptive snippets of text and whole articles can be used throughout Skyscanner. That might be on the Skyscanner website, in the mobile app, in our newsletter or in whatever cool ideas our colleagues might come up with. Recently we’ve been evaluating Amazon’s DynamoDB to see if it might be appropriate as the primary means of storing data for a new service. If we use DynamoDB as a primary store, and not just a cache for something else, we’ll need to keep backups. But it wasn’t clear how best to take these.

After investigating the options and trying them out I wrote a summary for my colleagues in an internal blog. This is a lightly adapted version of that summary. I’ll warn you now, it’s quite long!

Continue reading Backing up an Amazon Web Services DynamoDB

Sign up for email updates from the CodeVoyagers team