PootleConf 2016

Posted on by Robbie Cole

Pootle is an open source web application that Skyscanner uses to enable our third party translation suppliers to translate our content. Most people within the business have never seen it, nor even heard its name, but all the application strings in the company have been through our instance of it. Pootle is the power behind the throne of our translation process.

This year the developers of Pootle, Translate House, led by Dwayne Bailey, invited us to their developer conference in London, to share our use cases, our needs and wants, and discuss the future of this vital piece of software. Skyscanner sent engineers Sarah Hale, Eamonn Lawlor and myself (Robbie Cole) to represent.

Day 1

It began with introductions and backgrounds. We shared what we were broadly interested in talking about and learning over the conference, then launched into deep-dives about how we actually use Pootle.

First, Eoin Nugent from Yelp discussed their translations systems, then Skyscanner internationalisation lead Sarah Hale presented ours.

The Yelp translation process turned out to be pretty similar to Skyscanner’s process. Strings are developed by teams as English source text and then submitted, where they go into Pootle to get translated, then the translations are pushed back to consuming codebases. They still rely on developers synchronising the latest strings manually, while services in Skyscanner will absorb them almost immediately from our RESTful string service, but on the whole, we’re both doing pretty much the same thing. We’re not crazy, and we’re not alone, and that’s tremendously reassuring!

In life, all things begin with post-it notes.

In the afternoon our thoughts turned to open source contributions, to how we could give our tweaks back to the community. The Translate House developers presented their unit and integration testing suites and showed us how to debug Pootle, along with their strategy for coding style and standards enforcement.

Their approach is once again reassuringly aligned with ours. Much as we will not allow code into our own deployments that does not have unit tests, nor will they; when they expressed that they were worried the demand for testing would put people off contributing, we simply shrugged because comprehensive testing is second nature to us.

Getting the test suite up and running in our Windows environments wasn’t quite so smooth, however. After a few reworded command line calls we got a 100% failure rate in 2539 tests, but luckily Eamonn was able to track down a cross-platform file path bug that was at the heart of them all – and contribute the fix back to master!

Day 2

The second day brought us demonstrations of radical new features and architectural shifts in Pootle. We were first introduced to ‘Pootle File System’, or ‘Pootle FS’, a system that will streamline how Pootle synchronises data in and out of our repositories. This is exactly what we want – built-in functionality that will allow us to remove most of our intermediate bash scripts that connect Pootle to our internal services.

Any integration concerns that Pootle FS wouldn’t cover could then be solved by the next new feature, a plug-in architecture. Rather than hacking onto our own fork of Pootle, thus making up-versioning to get the latest improvements a painful process, developers will be able to build discrete plug-ins that can be installed only in their personal configuration – safely away from danger.

Caution: developers at work!

In the afternoon we descended into hackathonning. I had a stab at writing a custom format parser plug-in – the idea being that, instead of having to convert our resource files to an intermediate format for consumption, Pootle would be able to natively understand them and all the wonderful meta-data they contain.

I didn’t quite get all the way in the time available, but did get Pootle at least absorbing the information from our custom file format. We’ll definitely be picking this up again once we’ve migrated to the latest version and can take advantage of all this new good stuff. The theory, however, is already beautiful; this is a future of which we want to be a part.

We also had a discussion about scaling and Translate House’s current work to put Pootle into Docker images. Nothing is quite ready yet, but this area is under active development, so we’re hoping that once these become available our local development and testing capabilities will be much improved – and we’ll be able to launch Pootle instances straight into the cloud.

Day 3

On the final day we discussed the future of quality checks in Pootle. This is an area of prime interest for us, as we currently have a painful feedback loop – checks are applied after translation is completed, so any failures need to be fed back to the translators for a second pass. What we really want is to use Pootle’s built-in checks that are applied on the spot; when a translator clicks ‘submit’ they will be told immediately about any problems, so they can fix them while their head is still in the zone.

The quality checking framework is up for big changes, as it is currently quite difficult to release and configure checks (they’re housed in the ‘translate toolkit’, a separate project that is still managed by Translate House but isn’t so easy to release). We had a discussion about potential architectures for how checks could be created, configured and managed, trying to find the balance between ease of development, ease of use and future-proofing.

I proposed a super-modular system to separate constraints for running checks from the code behind the checks from their configuration, but my complexity had to be reined in because, yes, Pootle is generally used by actual human beings – often in community situations where people are not as comfortable with technology as we are. There is such a thing as too much flexibility!

With that, we wound down with some blue-sky thinking about what we’d like to see appear next in Pootle. Branded swag was exchanged (with some delightful personalised messages on the swag bags from Translate House!) and proceedings were called to a close to the satisfaction of all.

Super-swish Pootle swag.

Conclusion

All in all, it was a fascinating and productive two-and-a-half days. We got to hear about how Pootle works both in similar commercial environments to our own versus how it is used non-profit organisations to manage crowd-sourced community-driven translation.

We got to discuss the future of Pootle, the new features coming soon, and even shape the roadmap a bit by setting out our own use-cases. Best of all, we got to try some of it out, to experience first-hand the new world of improvements that we’ll soon be able to unlock.

It was great to meet some of the Pootle team and other engineers who use Pootle and learn about their processes. Now we’re even more excited to be upgrading our own instance of Pootle, getting to contribute our bits and pieces back to master – and we’re already lining up for PootleConf 2017!


Journey to the centre of Memcached

Posted on by David Oliveira

Journey to the centre of memcached

A few months ago our team noticed that Skippy (Skyscanner’s deep linking engine and one of the components our squad looks after) had started to log an increasing number of NotInCache exceptions. Those exceptions occur when we try to lookup a trip/hotel/car hire in the cache and it can’t be found, causing a new pricing request to be issued. This results in the redirect taking far longer than usual, so not a desirable situation at all.

After a few hours of investigation, we found that some of the data we were sending to Elasticache/memcached was actually being rejected with a "SERVER_ERROR Out of memory storing object" error message.

We’ve always relied on memcached as our caching platform and even though we knew there was a practical certainty we would exceed the cluster capacity, we thought memcached would deal with it, dropping the oldest items with low or no impact at all. But instead we found that we couldn’t store new data there, causing many recent items to not be found in cache and effectivel degrading the customer experience on skyscanner.net.

Initial investigations bore little fruit and we quickly concluded that our answers were hidden somewhere deep in the inner workings of memcached itself, thus our low-level investigations began.

How does memcached actually work?

Slabs

Memcached organises its memory in slabs, predetermined lists for objects of the same size range. This reduces the pain of memory fragmentation, since all the objects in a given slab have a similar size.

By default the memory is split in 42 slabs, with the first slab for items up to 96 bytes (on 64 bit architectures), the second one for items from 96 to 120 bytes, etc… and the last one for items from 753.1K to 1MB (maximum item size). The size range of the next slab is always increased by a factor of 1.25 (default) and rounded up to the next multiple of 8.

Pages

A page is a memory area of 1MB which contains as many chunks as will fit. Pages are allocated by slabs to store chunks, each one containing a single item. A slab can allocate as many pages as the ones available according to the -m parameter (maximum memory to use for items).

Chunks

A chunk is the minimum allocated space for a single item, i.e. a value with the string “super small” will be assigned to the first slab, which contains the items up to 96 bytes. However, every single item on that slab will use 96 bytes, even though their current size might be smaller. This mechanism obviously wastes some memory but it also reduces the performance impact of value updates.

A visual representation of how memcached organizes its data.Figure 1 – A visual representation of how memcached organises its data.

How do we run out of memory ?

Once we understood exactly how memcached organises information we could say that we would run out of memory when all the available pages are allocated to slabs.
However, memcached is designed to evict old/unused items in order to store new ones, so how does that work?

For every single slab, memcached keeps a list of the items in the corresponding slab sorted by use (get/set/update) time – the LRU (least recently used) list. So when memory is necessary to store an item in a given slab, it goes straight to the start of the LRU list of the corresponding slab and tries to remove some items to make space for the new one.

In theory it should be enough to remove a single item to make space for another one, however the item that we want to delete might actually be locked and therefore not able to be deleted – so we try the next one, and so on.

In order to keep a limited response time, memcached only tries to remove the first 5 items of the LRU – after that it simply gives up and answers with “SERVER_ERROR Out of memory storing object”.

How items are distributed amongst pagesFigure 2 – regardless of how the items are distributed amongst pages, they are correctly sorted by usage time on the LRU list.

So why would items get locked ?

Every item operation (get, set, update or remove) requires the item in question to be locked. Yes, even get operations require an item lock. That’s because if you’re getting an item from memcached, you want to prevent it from being modified/removed while it’s being sent through the network, otherwise the item’s data would get corrupted. The same applies to updates, increments, etc… The lock ensures data sanity and operation atomicity. Also some internal housekeeping processes might cause items to get locked but let’s not focus on that for now.

Testing it

In order to prove that we could run out of memory just by locking 5 items, we ran the following test:

  1. Launch a memcached instance
  2. Store 5 items of 1-96 bytes length (that’s important because it will map to the first slab)
  3. Request the 5 items we just stored but don’t read() their values (get X\r\n – being X each one of the values we used on the previous step) – that should keep the 5 items locked
  4. Store thousands of 1-96 byte items on memcached in order to fill up its available memory and force it to try to evict data on slab #1

On the last step of the test we started to see the "SERVER_ERROR Out of memory storing object" once the 5 oldest items (the ones on the top of the LRU) were all locked, making it impossible for memcached to release memory for the new items.

Measuring the pressure

To understand better the patterns of data we’re storing on memcached and to better visualise which slabs are getting more pressure to evict data we can use memcached-tool, a script that comes along with memcached source code. It also allows you to get an overview of the distribution of your data amongst slabs and a few other metrics that are quite important. We’ve improved memcached-tool by making it print slab and global memory efficiency. You can check it out here: memcached-tool-ng

Skyscanner cluster statsThe stats for one of the nodes of our cluster

From the screenshot above we can notice that there are definitely 2 groups of slabs under a lot of pressure: slab #2 for items between 97 and 120 bytes and slabs #13, #14 and #15 for items between 1.2kbytes and 2.3kbytes. Also the slabs around #13 and #15 have some considerable pressure. We can see this based on the number of items (~53 million, these 4 slabs) and evictions – if we have to evict items, it means we’re already already running out of memory. Also those slabs use a considerable amount of space (~46GB altogether).

Other important values are the OOM (out of memory), which tells us the number of times we weren’t able to evict data, and the Max_age which is the age of the oldest item in the given slab.

So is it possible to avoid the locks?

Our platform uses memcached quite intensively – we set roughly 2 items on memcached for nearly every single commercial flight combination in the world, multiplied by the number of airlines and agencies selling that flight, plus a few hundred thousand items of other business data. That can go over 500K memcached sets per second per cluster at peak times.

We also run our platform on about a thousand AWS instances making it almost 100k concurrent connections to memcached retrieving and storing data simultaneously.
This combination of circumstances makes it really easy to get into this kind of concurrency problems.

So in order to avoid being unable to store data on memcached due to locked items, we had to tweak a few settings:

  • lru_crawler: Enables a background thread to check for expired items and remove them – a good thing if you want to keep your slabs clean and avoiding evictions;
  • lru_maintainer: Splits the LRU’s in 3 sub-LRU’s (HOT, WARM and COLD); New items are put in the HOT LRU; Items flow from HOT/WARM into COLD; A background thread shuffles items withing the sub-LRU’s as limits are reached – it avoids having always the same items at the top of the LRU;

Other settings you might want to check:

  • -f (chunk size growth factor): Defines the growth factor between slabs (as mentioned on the Slabs topic); It defaults to 1.25; By changing it to a lower value you might end up with more slabs (spreading the pressure of the evictions) but be aware that you can’t have more than 63 slabs – so, for instance, if you change it to 1.10, you will have 63 small slabs but the last one is going to contain all the items between 42k and 1MB (being 1MB the maximum item size), which means every item will take 1MB, causing a really bad memory efficiency;
  • -I: Defines the maximum item size; For instance, if you don’t store items higher than 500kb, you might want to tweak this setting along with the -f and the slab_chunk_max settings so you can spread your data amongst more slabs;
  • expirezero_does_not_evict: Defines whether an item with expire time=0 is evictable or not; If that’s 1 (on), you will get OOM errors as the limits are reached; We didn’t have to tweak this as the default if off;
  • slab_reassign + slab_automove: If you have a long running memcached instance and your usage pattern (key/item size) has changed overtime, you might be incurring on many evictions. It happens because once memcached reaches its memory limits, it won’t be able to allocate more pages to slabs that need them, so the pattern of allocated pages per slab is set forever. These 2 parameters make memcached take pages from slabs without evictions, when a slab is seen having the highest eviction count 3 times, 10 seconds apart.

Conclusion

Memcached is a widely used distributed caching platform with a relatively small learning curve. It barely requires a configuration before you’re able to use it and its API is extremely simple and straight forward.

Despite its simplicity, the way it stores data and manages its memory limits might have some caveats, eventually leading you to some unexpected behaviour. Understanding how your data looks, how memcached organises it and seeing how your cluster is performing will definitely bring you one step further on the control of your whole platform and knowing its limits, avoiding bad surprises in the future.


From 20 to 2 million releases a year, part 3

Posted on by Alistair Hann

As Skyscanner scaled from an engineering team of 30 with one website and three services to a team of 100 engineers, release frequency halved. This is the story of the turnaround as the company went on to grow to 400 engineers, with over 100 services, releasing at thousands of times the previous rate. This series of three blog posts will share that story of releasing thousands of times more frequently, and the implications of how our goal of 10,000 releases per day informs our tooling, processes, and how we think about writing software.

Part III – From 1000 to 30,000 releases per year, and beyond

(aka Reversing the death spiral and turning it into a flywheel)

At the end of Part II I explained that we had got to the point of doing hundreds of deployments every month, thanks to moving to continuous delivery. Services were deployed at their own heartbeat and we had moved to a devops model.

The question then became: how fast should we be going? There is an annual publication from Puppet Labs called ‘The State of Devops’ – it is a summary of the learnings from a survey of more than 4,600 technical professionals from organisations of all sizes around the world. They have identified a link between ‘high performing’ software organisations and higher market capitalisation growth in publicly listed companies, and improved profitability and market share in private companies. Compared to other software organisations, the high performing software organisations are reporting:

  • 200x more frequent deployments
  • 24x faster recovery from failures
  • 3x lower change failure rate
  • 2,555x shorter lead times

There are further insights in the 2015 report into how in the high performing group of companies, as the number of developers in the organisation increases, the number of deployments per developer per day exponentially increases Vs flat levels for medium performers and reducing deployments per day for the low performers. A shift we have seen in our own company as we made the journey to continuous delivery – initially adding more engineers reduced our frequency, then we managed to keep it flat, but to make it go exponential we need to work out what is actually happening.

So why does this happen? In Part I of this series, I explained how our release frequency started plummeting as a negative cycle had started – every release made the next release slower. Adding more people to the organisation just compounded those effects. In the case of organisations that are releasing more often, I can see the reverse happening: a flywheel where changes keep reinforcing each other positively. I have drawn this out here:

As smaller changes are made, there is less risk and reverting is easier – hence the faster recovery rate and reduced incidents due to change. The smaller changes are also better for our users, as changes happen more gradually and they get new features earlier. From a process perspective – higher release frequency forces greater automation, and better instrumentation. This leads to fewer errors and greater confidence in the software. This all means happier developers and that means more code gets shipped. The 2016 ‘State of Devops’ report even measured the impact on the team – developers in the high performing organisations were more than twice as likely to recommend their organisation to a friend as a great place to work.

How fast should we go?

Given there is this positive flywheel, how fast should we be looking to make it turn? The number we came up with was 10,000 releases / day. The rationale for that is that we are moving to many micro services. Today it is around 100, in the future, it will be 500-1000 between primitives and orchestration services. In that world, a typical engineer’s code change might touch two to four services (let’s call it an average of three). So if an engineer is making one change per hour, touching three services, that is 24 changes per day per developer, and with a team of 400, that’s 9,600 changes per day.

Now, I don’t actually mind whether Skyscanner hits 10,000 or 5,000 or 20,000 releases per day. The great thing about a target like this is it forces us to think differently, and brings various decisions into focus. For example – we are moving from colo to AWS, and there was some discussion about what our integration environments were going to look like when we move to AWS (they are a total pain to maintain, but some teams are highly dependent on them as part of the Continuous Delivery pattern we original advocated). If there are 10,000 releases going out a day, the idea of a ‘stable’ integration environment to test in, ceases to make sense. So there will be no integration environment in AWS and teams will need to be integrating against the production versions of the APIs they use or mocks.

Another example is that if an engineer is deploying on every change, there cannot be any manual steps between that commit and production, it just isn’t efficient – so all the manual checks in the continuous delivery pipeline shown in Part II have to disappear.

One pioneering team on this journey was working in our Joint Venture with Yahoo! in Japan and they built two services to help them ship more quickly, one called Stevedore that loaded and unloaded Containers from production (for those who haven’t seen Season Two of HBO’s The Wire, Stevedores load and unload ships, and ships have containers…), and Cyan for Blue/Green testing the new deployments (Cyan resulting from the addition of Blue and Green light). The new system dictated a particular workflow:

The new deployment philosophy of Skyscanner

The impact of this system was very positive and it has now been generalised into a system that is used across Skyscanner – we internally call that system ‘Slingshot’. The following diagram shows the original continuous deployment system at the top, and the ‘Slingshot’ continuous delivery system below:

Skyscanner "Slingshot" release process

The first thing to notice is no manual steps going between a commit and the software running across production. Every pull request automatically triggers a Drone CI build that runs tests within a Docker container, after a code review, the tests are run again and that Docker image is the artefact that will be deployed into production. A hook in Gitlab means the successful pull request triggers the Slingshot deployment:

  • The image is retrieved and deployed to a new cluster
  • Automatically, a pre-defined fraction of production traffic goes to the new cluster
  • Rules applied to production metrics determine success or failure
  • There is automatic roll-out to all of production or automatic roll-back

This way of working forces more good behaviour: there is no opportunity for any sneaky manual testing as part of deployment, it cannot happen; teams have to better understand and instrument operational metrics – which improves overall availability. We can only provide this free tooling if people use the default toolsets and patterns – so we get greater convergence and that also makes us more efficient (see http://codevoyagers.com/2016/08/16/not-all-the-technologies/). You can see those forces on the flywheel are beginning to turn and make us accelerate.

Where have we got to?

It turns out to be surprisingly hard finding out how many releases we are doing – with many systems and pipelines, there isn’t a single place to go and find that number. I recently took a poll of teams and pulled some data from Slingshot and we did 2,400 releases in the last month. While that is only around 120 releases per working day, it is an order of magnitude more than when we first implemented continuous delivery, and three orders of magnitude better than at the peak of our challenges. In order to show this progress on a graph, I’ve had to use a logarithmic scale:

 

Skyscanner releases over time

The dotted line extrapolates the trend and implies we may get to 10,000 releases per day at the end of next year. While we have gone from 1,400 to 2,500 releases / month over the last six months, there are some impediments to getting to that rate of releasing software. In the teams that have adopted continuous delivery, where every commit is released to production; a team of 15 engineers is releasing 15-20 times per day, so one or two releases per engineer per day. The only way to ship more often is to commit in smaller chunks – so we need to change that part of how we work. Other teams are still working with the ‘legacy’ continuous delivery model and until they are able to migrate to the new standard toolset they cannot benefit from slingshot. In the meantime, the release frequency of these teams is typically daily or even weekly in a few cases.

There is also a wider consideration for the whole business – as we ship changes more frequently, we need to be able to make faster decisions. In a one week ‘sprint’ for a Scrum team, there will be multiple experiments and hundreds of deployments, each generating lots of data on which to base product and business decisions. Thus the product owner and the rest of the organisation have to be ready to make faster decisions to maximise the benefits of faster feedback.

Whether Skyscanner reaches 10,000 releases per day at the end of 2017 or not, I am convinced the trend will continue to be more frequent releases, happier customers, and happier developers.


Building products for large scale experimentation

Posted on by Dave Pier

At Skyscanner we have been running hundreds of AB tests to learn how to improve the site for our users. In order to build our experiments faster we have developed an in house system to separate code changes from experiment variants and in so doing provide a massive increase in flexibility. In essence we have turned our whole site into a set of Lego blocks that can be combined in an almost infinite number of combinations that anyone in the company can control from anywhere in the world.

If we step back a few months we would build our AB tests in the standard fashion. We would use our experiment platform,  Dr Jekyll, to assign users to a particular variant of an experiment. Each variant of the experiment would then be directly linked to a section of code. If a given user was in the control group they experienced the standard site, if they were in one of the variants they would receive an altered site and we can track the difference in behaviour. While this works well for areas of investigation that are well bounded it is quite inflexible for new areas where we will have multiple rounds of iteration with each round building on the learning of the previous round.

In order to allow AB experimentation to scale as well as maintain our lean/agile culture we have built in an extra layer of flexibility into Dr Jekyll. We can now tie our code segments to configurables. A config can be thought of as a link between the main body of the code and the parcel of data that it contains. This parcel might be a whole module of code needed in an experiment or it might simply be a boolean value or a string of text. We initially built these configs to allow us to change strings and values throughout the product for different markets and different situations. However, tying code segments to configs and tying multiple configs to a single experiment variant allows for an order of magnitude more flexibility in how we build for experimentation.

multiple small independent code segments to be combined into a single experiment variant

In this diagram we can see that configs allow multiple small independent code segments to be combined into a single experiment variant.

If we now modularise our code such each change we might want to make in an experiment is independant from any other then we create the lego blocks we need to build experiments. Let’s look at an example of where this becomes useful. We wanted to look at redesigning our booking panel from a price centric layout to one that prioritised information and alternative booking options. There were a number of changes that we felt that we needed to make in order to make this change

  1. Collapse the itinerary information
  2. Allow the provider list to show our new star ratings
  3. Move the itinerary information to the top of the panel
  4. Expand the previously closed provider list

In the traditional approach of building AB experiments it is tempting to build the single preferred option and compare only one variant with control. If it improves metrics then great give yourself a pat on the back and ship it. If metrics go down, then what happens? There is no way to know which of the changes had the effect. Do you start stripping back the changes to one controlled change at a time or make more changes until something works? In the new system we can build each one of these changes as a separate config and combine them in a single experiment and control for each of the changes (taking the appropriate statistical considerations for multiple tests). In this particular example we had wanted to check 4 variations but we could have tested 12 given the combinations possible. As it turned out when we saw the final version in the browser we decided to test a variant that we did not intend to build but was possible to create, with no additional development effort due to the available combinations and this was the one that was eventually shipped to production.

Variants from available combinations.

Since implementing the config layer we have found numerous use cases. MVP experiments that are inherently risky can be derisked by starting broad but shallow and then additional functionality built in additional layers of configs as the data from each round of experimentation allows us to refine our ideas. We can also use configs for feature flags by turning their features on but disconnecting them from the underlying experiment. This allows market by market flexibility that can be controlled independently from the core site.

An additional benefit of using a modular config approach has been that this abstracts the complexity of experiment design from the development of features. Developers can now build and test modules independently without needing to worry about which 5 changes need to hang together for a given variant. If we want to extend the experiment in the future then we simply add another config until we have the feature creating the user benefit we had hoped for in the first place.

Similarities with multivariate testing

This approach is similar to multivariate testing, but deliberately limited to specific combinations of code segments/changes. Multivariate testing runs ALL combinations of changes together in order to determine which combination of changes produces the optimal effect. An example would be changing a button placement, string and colour. If there are 3 versions of each placement, string and colour then that 3 x 3 x 3 combinations to test. The system we are describing here allows us to run a multivariate test if we wish BUT it also allows more modular AB testing as described above. The primary purpose is not to throw every possible combination at the wall and see what sticks but rather to reduce the time and cost between learning from one experiment and implementing the next iteration with a directed hypothesis.

More on experimentation at Skyscanner:

 


From 20 to 2 million releases a year, part 2

Posted on by Alistair Hann

As Skyscanner scaled from an engineering team of 30 with one website and three services to a team of 100 engineers, release frequency halved. This is the story of the turnaround as the company went on to grow to 400 engineers, with over 100 services, releasing at thousands of times the previous rate. This series of three blog posts will share that story of releasing thousands of times more frequently, and the implications of how our goal of 10,000 releases per day informs our tooling, processes, and how we think about writing software.

Part II – From 10 to 1000 releases per year – Microservices and Continuous Delivery

In Part I of this series, I explained that our release frequency was plummeting because there was a spiralling negative cycle, where every release was making the next release take longer. Our team was growing but the value we were delivering to our customers was grinding to a halt, and we had to change things.

We made three major changes, which were inextricably linked:

Moving to micro services

At the peak of the crisis, the Skyscanner product (excluding native apps) were two deployable units that were released in lock-step. One was ‘The Website’ and the other was the data services that powered it and the mobile apps. The reason I refer to the data services as one unit is that while there were strictly speaking three services, two were tightly coupled and had to run on the same server. A huge effort went in to split code bases into much smaller services – effectively forking the code, and progressively cutting out the dead wood that wasn’t needed for each of the sub-services. These micro services could be deployed independently. In the first move there were around fifty such services, and now there around one hundred.

Teams Owning Services

We entirely changed the way we organised product engineering. Previously we had ‘development’ and ‘operations’ teams. Instead we adopted a structure partly inspired by Spotify’s 2012 paper describing Squads and Tribes. This is a ‘you write it you run it’ environment with a single team taking on full responsibility for development, deployment and operation of the services it owns. The only remaining ‘ops’ functions moved to teams who built tooling to enable other teams to deliver faster, a very small infrastructure team responsible for networks, CDN, load balancers, virtualization etc., and a very small team providing ‘follow the sun’ front line support (e.g. for data centre outages) and incident management.

We only considered a team to have completed the transition when they could show they had completed an extensive checklist. The list included that the team:

  • Can Independently release services into production
  • Understand and monitor Metrics and KPIs
  • Can run an A/B test in production
  • Set own objectives and key results

Continuous Delivery

Independent deployment of services was driven by a move to continuous delivery. The diagram below shows the ‘best practice’ pattern that all teams were encouraged to adopt.

There was a continuous integration environment with very frequent change, validating the software after ever commit. If that was successful, after a human ‘go ahead’ the code would make it to integration review – an environment with less frequent change, where the service integration would be validated by running a full integration test suite including all service dependencies and a full acceptance test suite. Again, after a manual gate, code could move to pre-production. This was relatively stable and contained probable release candidates where integration and acceptance tests would be run.

A manual gate between pre-production and production allowed for exploratory, performance and load testing in a production-like environment. The production release would then be to a canary which should be a stable mixture of production and the validated release candidates. Smoke testing was combined with monitoring service KPIs and business metrics to ensure the release was good. After a final manual check, the package could be released to all nodes in production where there is ongoing monitoring and alerting of the service KPIs and business metrics.

Continuous delivery in action, with metrics around business KPIs

Making the changes to microservices, teams owning services, and continuous delivery took several months, with different teams completing it at different times – depending on how easy it was to decouple their service, the level of additional automated testing that needed to be added etc. One question is how this change was sold to the business – there were pioneers who hold a lot of credit for driving this change, but there was also a recognition that things couldn’t just carry on like this. The team had doubled in size and we were going slower. Thus the entire executive team and board were behind the change – it also helped to be able to refer to examples like Kevin Scott’s halting deployment at LinkedIn for two months, in order to move to Continuous Deployment.

Many challenges emerged during this switch – moving to a ‘you write it you run it’ devops model meant that teams had to learn about the discipline of deploying and operating production software, when they had previously been insulated from it. There was a steep

learning curve in terms of what were critical metrics to gate deployment and trigger alerting of the individual components, what is an appropriate time to release a change, etc.

Another challenge was that services were subject to a lot more change going on around them. There was a case of a small data change in one service triggering the entire website to go down, because of a cascade of errors triggered by the invalid data and subsequent abnormal responses. The left side of the figure below shows the old model at Skyscanner – Component A is deployed into production as part of a single release with component B – it can be fully tested against the version of Component B that it will be running alongside in production. If component B is unavailable, A will usually also be unavailable as they were often deployed on the same hardware or were otherwise tightly coupled. In this example we didn’t need to think too much about what happens to A if B changed or stops working – any issues would be picked up in the integration environment and regression (hopefully).

After moving to independently deployed micro services, the world was much closer to the right hand side of the figure. We have a situation where Component A has been deployed and isn’t changing, and Component B is being deployed independently. Now, a single release of Component A has to handle new Releases of Component B, including unproven canaries, and Component B being slow, and many other possible types of change.

Types of A/B testing allowed from components

As a result, teams had to start coding a lot more defensively. Hardening how they responded in a world where other services suffered from latency, invalid responses, unavailability, etc. and testing for those scenarios. The end result was more resilient software.

Slowly we learned the operational discipline, improved instrumentation, and increased the resilience of individual services. Overall availability dropped but then recovered. Teams started delivering software again and the very last release train in February 2014 was Hammer Bro, the Super Mario character (we had run out of Muppets and moved on to video game characters).

So where had this all got us – at the end of this change we were releasing 100’s of times per month. The confidence of teams was growing as was confidence in the teams – in Part I, I talked about the annual feature freeze in December (because we were uncomfortable about change while people were away and the ability to handle the new traffic surge in January) we no longer force a change freeze and haven’t done so for the past two years. What we do say is that squads must have support available for two days following any deploys (we are in the process of adapting our support model – something for a later post).

As teams’ confidence grew, some changes were made to the default continuous delivery pattern – some squads would deploy to the integration review and pre-production environments simultaneously, and a minority moved on to automatically rolling out canary releases across production if the metrics were healthy. These were signposts of the direction we would ultimately end up moving in. In Part III I will describe how continuous delivery increased our release frequency by another order of magnitude.


From 20 to 2 million releases a year, part 1

Posted on by Alistair Hann

As Skyscanner scaled from an engineering team of 30 with one website and three services to a team of 100 engineers, release frequency halved. This is the story of the turnaround as the company went on to grow to 400 engineers, with over 100 services, releasing at thousands of times the previous rate. This series of three blog posts will share that story of releasing thousands of times more frequently, and the implications of how our goal of 10,000 releases per day informs our tooling, processes, and how we think about writing software.

Part I – From 20 to 10 (aka ‘what went horribly wrong’)

I joined Skyscanner at the end of 2010 after it acquired the start-up I founded, Zoombu. Zoombu had a team of five and we were deploying our product at the end of every week-long sprint. Skyscanner had an engineering team of thirty and the product was going out every two weeks, like clockwork. That seemed reasonable to me – there was a lot more surface area, and it was far better than a household name Online Travel Agent that I knew was shipping every six months.

These blog posts are the story of what happened in the subsequent six years. To help put some context around that, the following graph is the number of UMVs Skyscanner has seen since December 2004 to February 2016. One interesting thing to note are the jumps in traffic after each December – in the travel industry things get quiet towards the end of the year, but as soon as Christmas and other festivities are over, people start planning and booking their travel and that is when we experience the biggest jump in load on our services (this isn’t the case in all countries, but it does characterise much of Europe and the Americas). Something else to note is that as a team we were growing at a similar rate, and that included the engineering team.

screen-shot-2016-10-22-at-18-16-28

The second graph I want to share shows how many releases we were doing per month for the two years after October 2010 – you can see that there was a notable decline, with releases dropping to monthly and rather less than that. In practice, the graph doesn’t tell the true horror of the situation – we entered a phase of alternating releases between low risk releases and high risk releases. The idea was that lots of low risk changes could be bundled into one release, and riskier major feature releases would be scheduled into the other. Thus in reality, a minor change could take eight to twelve weeks to go out, if it emerged at the wrong time in the cycle.

Release frequency at Skyscanner over time, 2010-2013

It’s also worth calling out the big dips around December 2010, December 2011, and December 2012. Those were the ‘feature freezes’ we would start weeks before Christmas – the anticipated traffic surge meant we were fearful to release changes until we knew what the product could handle, and we also wanted a quiet period for supporting the site while people were enjoying their Christmas break.

Release frequency at Skyscanner over time, 2011-2013

So, why was our release frequency tanking? The clue is in the third graph – the number of days each release took. It sounds obvious to say the release frequency dropped because it took longer to release – the key thing to note is that as soon as we finished one release we would start the next one. We couldn’t get any faster, because it was taking so long to get from the release train being ‘closed’ to further changes, to the new release being deployed and validated in production. At which point ‘int’ would reopen and the whole process would start again. To quantify the situation, at peak pain:

  • 59 Changes Per Release
  • 9 Bugs found in Pre-Production Regression
  • 21 Days end-to-end
  • 125 Days of human effort

There was a lot of manual regression testing, if a bug was found during the regression, a patch had to be made and the regression test repeated. That regression was painful, people didn’t want to do it, only a small number of people knew how to fully test each component, and it was doubly painful if it had to be repeated.

We were operating canary releases – a new release would be deployed to a subset of all production traffic called ‘Staging’, and it would only be rolled out to serve all production traffic if the KPIs were healthy (in the email snippet below, there were concerns about the metrics of a release that had been on Staging for five days). The reason it would take so long to validate the new version was performing correctly was because of latency in metric collection, the volume of change, a behaviour where new deployments needed to ‘warm up’, and variability in the KPIs between production nodes meaning there wasn’t a clear benchmark of what was a healthy value of the KPIs.

An emergency meeting invitation at release time

An invitation for an emergency meeting in February 2012. A KPI (‘G1’) was down against staging, and people across the company needed to decide what to do.

Despite the bugs found in regression, nearly every second release required a re-release because we found an issue on production which had been missed in regression and was only identified after the successful rollout across all nodes. If the issue couldn’t wait until the next release, the answer was to deploy a ‘hot fix’ – this risky and painful procedure meant carrying out an accelerated version of the release process, only doing regression on a subset of the product. It wasn’t a decision taken lightly because of the risks, pain, and delaying the next scheduled release (I hid the hot fixes from the release frequency graph above, as they confuse the trend).

Muppet cupcakes - photo "Cupcake Brilliance" by Bret Jorhan, CC BY 2.0 Licence

An early warning sign may have been that we had names for each of our releases. If a website release is significant enough to have a name, there’s probably something wrong. A bigger warning still, may have been that at one stage we were naming our releases after Muppets. Acknowledgement: “Cupcake Brilliance” by Bret Jorhan is licensed under CC BY 2.0.

A model I use to explain how we ended up in the painful situation is that there was a negative cycle going on – problems were feeding each other, thus a slow degradation rapidly got a lot worse as the cycle gained momentum. I have drawn a simple model of this below:

A diagram showing how some problems actually fed each other

With a negative spiral feeding itself, we were releasing code more and more slowly, even though the size of our engineering team was rapidly increasing. We reached a point where we had to stop and completely change how we organized ourselves and how we deploy software. In Part II, I discuss how we changed our organization, services, and release processes in order to get a 100x improvement in release frequency.

Continue reading Part II of the series, “From 10 to 1000 releases per year – Microservices and Continuous Delivery“.

Related articles:


Using ElasticSearch as a data service for hotel offers

Posted on by Pau Freixes

ElasticSearch is a scalable and highly available distributed search engine
built upon Lucene . The Hotels Backend squad has bet on it as one of the key pieces for the new backend architecture, named Bellboy. In this architecture ElasticSearch is used as a data service to store hotel offers temporally, making them searchable for the user.

Each time that a user looks for hotels and their offers for a specific city, these offers are indexed into an ElasticSearch cluster. Hereby, all further user requests to search, filter and so on will end up as a query to this cluster.

Having ElasticSearch as a data service allows Bellboy to:

  • Scale horizontally in a decoupled way to the other pieces of the architecture.
  • Index any field to make it searchable.
  • Support almost any type of field. From strings, integers, lists to geo points.
  • Faceting system to count hotels and offers in a powerful way.

The following sections will explain how ElascticSearch is used by Bellboy to index hotel offers and make them searchable for the user.

Indexing offers to ElasticSearch

Each time that a new user searches for hotel offers for a specific city, a group of services retrieve the hotels set in that city and the offers provided by each partner. As a result of this process, Bellboy indexes these hotels and offers in a denormalization way, aggregating the hotel fields in each offer having as a result a unique type of document. This denormalization is the way to materialize the joins between the hotels and offers in a non-relational database such as ElasticSearch.

An offer is a JSON document composed by a set of regular fields coming from the hotel and the offer plus a set of internals fields, among them the search_id field. This field is a unique Id that identifies a group of documents that are related to a certain search. This field acts as a logic partitioner. Therefore, an ElasticSearch index is composed by many logic partitions, each one belonging to a specific search.

A new user search will trigger many concurrent tasks, each one related to a provider. The offers provided will come packed in batches of N documents, that will be indexed to ElasticSearch. Each batch operation will be routed with the value of the search_id field. This route will help ElasticSearch to store all documents belonging to the same search in the same shard, involving later only one of the shards in the query made by the user.

A search_id partition is composed from several hundred of documents for small entities to several thousand for large entities where different partitions never intersecting. Instead of use multiple shards with a query that seeks a bunch of documents between millions, Bellboy and more specifically ElasticSearch handles queries that can suit in a unique shard and its related node resources such us CPU, memory, and so on.

Having a unique shard queried gives Bellboy an accuracy for the term aggregation result. This term aggregation is used several times within the query to ElasticSearch and it is especially crucial to perform the hotel normalization.

However, the use of the document routing can produce hotspots: unbalanced document distribution that can have as a result shards much heavier than others.

The following picture shows the distribution of documents of each shard placed in a Bellboy ElasticSearch cluster using as many shards as nodes.

shards_distribution

As can be seen in the previous picture the distribution is not perfect: one of the shards has roughly 30% less documents than the other shards. To mitigate this issue Bellboy multiplies per two the number of shards per node. The following picture shows the distribution using that
configuration.

shards_distribution_2

In this case, the shards have an almost equal amount of documents. Therefore, the function used to calculate the search_id behaves close to a uniform function avoiding hotspots.

Adding more nodes, incrementing the offers per second

The number of hotel offers indexed per second may vary due to different causes: traffic expected, amount of offers handled by a search, etc. Bellboy takes this into account to configure the number of nodes for the ElasticSearch cluster to meet the throughput requirements.

In this scenario ElasticSearch allows Bellboy to increase the throughput by adding more nodes to the cluster without modifying the general architecture.

The following picture shows the behaviour of the maximum throughput, offers per second, reached by an ElasticSearch cluster composed by two, three and four nodes.

write_throughput

As can be seen in the graphic, the number of offers per second grow almost linearly as the number of nodes increase. All performance tests were executed using c3.2xlarge instance types with ephemeral SSD disks. The instances were distributed across different availability zones. Each hotel offer was a JSON document that weighed around 1 kilobyte, with roughly 30 fields.

Evicting offers from ElasticSearch

Hotel offers have a limited lifespan of 15 minutes. Once this time has been reached these have to be refreshed by the pricing service. It will minimize the chances to publish outdated prices, giving to the user the last offers gathered from the providers when the 15 minutes have expired.

Deletes and updates operations are very inefficient in ElasticSearch; removing the outdated hotel offers allocated in an index through delete operations will impact directly in the resources of the machine.

Bellboy uses a technique called Index per time frame where instead of removing documents one by one, it erases a set of documents at the same time by removing the whole index. The cost of this operation in terms of resource usage is almost negligible. Every time that there is a new user request, it is routed to an index that will last 15 minutes

The following picture shows the indices placed in time order with three different searches. The searches were routed to the proper index when they were created. The index-n-2 does no longer exist and neither do its hotel offers.

time_based_index

Because Bellboy has to make sure that a set of hotel offers belonging to a search is alive for 15 minutes, the indices are kept for 30 minutes. Until the index is not removed automatically, the outdated searches in that index are still alive but will not be used.

From nearest real time to real time

When a request is created it has no hotel offers indexed yet. Before fetching the prices, Bellboy stores the number of minimum offers expected for all partners and hotels. This information will be used to check if the offers expected related to this request have been indexed.

Due to the nature of the ElasticsSearch near real-time behavior, the number of minimum offers stored in Bellboy at the beginning has to be compared against the result of a query to ElasticSearch. When these values match the indexing process has finished and the user will retrieve a consistent output.

Instead of running a query continuously, Bellboy places a specific aggregation to the user query. This ad-hoc aggregation stage will count the offers and carry out the matching between values and complete the request.

However, the number of offers per partner regarding each hotel can be less than expected at the beginning, due to accommodation restrictions or other issues. When this happens, the pricing service that is in charge of fetching the offers will decrease the expected hotel offers value.

Retrieving the offers and their aggregations

ElasticSearch implements a query interface addressed to do filtering and aggregation over the inverted index exposed by Lucene.

A user query becomes an ElasticSearch query composed by a filter stage and an aggregation stage. The filter stage will prune the amount of hotel offers using the search_id field, therefore the aggregation stage will only receive the offers belonging to a certain search. The aggregation stage is composed of many independent sub-aggregation stages that will be in charge of retrieving the list of hotel offers and counting the amount of hotels regarding their characteristics.

ElasticSearch allows Bellboy to build a complex and deep aggregation tree. Each aggregation stage can be seen as a mapping function and its nested aggregation stages as reduced functions, and this can continue indefinitely.

Query performance, keeping the latency predictable

The full query used by Bellboy for each user request is a long JSON document that can use almost 30 different aggregation stages. Despite the query size, ElasticSearch is able to execute it in less than 50 ms using any modern hardware. It is worth mentioning that the logic partition identified by the search_id field can be composed of a few hundred documents until several thousand for large entities.

The write throughput requirements and the amount of reads faced at one time are dependent variables that grow proportionally: for a 10000 write throughput, the amount of requests expected per second might be 10. Therefore, with twice this traffic the values expected will be 20000 writes per second and 20 requests per second.

The next table shows the read latency behavior taking into account different write loads, from 0% to 75%, and the reads per second incrementing proportionally.

write_vs_read_throughput

With a 25% of the resources used by the indexing process, the read latency in all percentiles is kept below 100 ms. When the resources usage is doubled, the average latencies increment proportionally. Even though the percentiles 50th, 75th, and 90th have an acceptable behavior during the tests, the 99th percentile spikes until  1.5 seconds.

With the aim to keep the read latency below to 0.5 seconds, Bellboy configures the variable bulk threads to half of the CPUs available. This throttles the CPU resources of each ElasticSearch node dedicated to index offers. At the same time, the bulk queue size is configured with a number big enough that allows the allocation of the index operations that are waiting for a CPU time slice.

Placing the ElasticSearch Cluster at AWS

Bellboy places the ElasticSearch Cluster into an Auto Scaling group with a constant number of machines, and with a Load Balancer in front of it as the entry point. The following image shows this configuration:

aws

The amount of nodes used by an ElasticSearch cluster deployed by Bellboy is always configured with an odd number and with the nodes placed in different availability zones. This configuration increases the service availability and helps ElasticSearch with the consensus resolution in case one of the nodes is down.

The Auto Scaling group will keep the number of nodes constant. If one node is down, it will be replaced by a new one that will be launched with the proper ElasticSearch configuration to become a member of the cluster.

During the time that there is one node less in the cluster, its state becomes Yellow – Elasticsearch has allocated all of the primary shards, but some/all of the replicas have not been allocated – but ElasticSearch can operate and Bellboy continues relying on this ElasticSearch cluster. The replicas of the primary shards placed in the node that is down are promoted. When the Auto Scaling group replaces the node with a new one, the shards are rebalanced automatically, placing some of the replicas and primaries in the new node.

Even though this operation might be IO and CPU expensive, the shards belonging to the indices of Bellboy are limitedly impacted. The following graphic shows the spike produced due to a rebalancing process because a new node was added to the Auto Scaling group to replace a broken one.

cost_rebalancing_shards_node_down

This ElasticSearch cluster will be reliable until its state becomes Red, meaning that a primary shard cannot be found. When this happens, the ElasticSearch cluster is no longer used and the traffic is redirected to another Bellboy stack.

Conclusion

ElasticSearch is a product that allows the full use of Lucene potential by applying it to several nodes. This allows for horizontal escalation with a higher document throughput using an API JSON interface. However, it is important to know ElasticSearch fundamentals and how they can fit in your product.


Video: Leading Distributed Teams at Scale

Posted on by David Low

Earlier this summer, Skyscanner’s SVP Engineering Bryan Dove gave a talk at Rocket Tech Summit in Berlin around the experiences of his current role, previous learnings from his career to date and how to meet the challenges faced by technical leaders.

The presentation (27 minutes) is a must-watch for anyone interested in leading the technology and people at a high-growth company like Skyscanner, with so many competing priorities, an ever-larger workforce and technology advancing as it always does.

As always, we welcome your comments, please get in touch at @CodeVoyagers on Twitter and let us know what you think.


Does research = usability research? I don’t think so.

Posted on by Laszlo Priskin

 

I often find myself in discussions in which people ask “what is the role of ‘design / user research’” or “how can ‘research’ support the product development process?”. In various discussions, it also often happens, that ’design / user research’ is mentioned as the synonym of ’usability research’. You can find amazingly well-crafted ‘101 guides on how to conduct usability studies’ and more and more organisations keep using those techniques naturally.

The phenomenon which suggests that ‘design / user research’ equals ‘usability research’ made me think. In the past few years, I was lucky to take various ’research challenges’ within Skyscanner’s Apps Tribe and in its Design Team. As our product grows, we face more and more complex problems. It strikes me that we need to understand the nature of human-beings more and more profoundly.

In this journey, Steve ‘Buzz’ Pearce and Bálint Orosz, two of my professional mentors at Skyscanner, inspired me to try out or develop new methods in order to reply to those fundamental questions that our travellers are faced with. This journey helped me in realising how diverse the world of ‘design / user research’ is and also helped me in realising that besides ‘usability research’ multiple other fields of research exist and they can also add significant value to product development processes.

Let me share with you a framework which I call the ‘Four Layers of Research’. It is actually more like a research mindset and it would be great to hear whether you can relate to it and also to hear what methods you use in the case of the below-mentioned ‘layers’.

The Four Layers of Research

four_layers_of_research-001

1. Usability research

When building products, a highly important factor is whether or not people can actually use what we build. To illustrate… if they would like to move forward in our app, will they find how to take the ‘next step’ or if they would like to ‘go back’ one step, will they figure out how to do it? In this sense, usability research is all about making sure that the way in which we realised our solution is in line with what people expect and what feels natural for them.

Simply put, in usability research studies, we are not focusing on the question of whether people need the ‘Back button’ or not, we assume that they need it. The question we focus on is if, in the moment that they would like to go back, they know immediately, intuitively and without further thinking how to do that.

2. Valuability research

This area of research is all about understanding whether people actually need a ‘Back button’ or not. Valuability research could help a lot in validating or falsifying a solution we plan to build for our travellers.

‘Validating or falsifying’ and ‘plan to build’ seem to be key terms here. We all have many nice ideas on what to build for our users, but one of the most important questions is whether people really need that solution or not. In the case of valuability research studies, we consciously ignore whether our solution is usable or not, but we focus on the point whether our solution adds value and meaning to people’s every day life or not.

In other words, does our solution really resonate with our travellers’ needs, and does it really solve something valuable for them or not. Valuability research could be a powerful tool in the ‘product discovery phase’, more specifically at that stage before we start building anything, at that stage when we’re just about planning to build ‘something’.

Honestly speaking, for me, separating ‘usability’ and ‘valuability’ questions in research studies is extremely hard. In the case of prototypes, there are so many things that create ‘noise’ and makes it hard to identify what’s the reason that our solution fails in user discussions. The ultimate question is always there – did our solution fail to work because users don’t need it or because we created something absolutely unusable for them?

To overcome this stage, emotional intelligence best practices and the deep analysis of people’s emotions and mental state helps me a lot. Can you recall memories of when a user realised the value of a feature you work on and starts talking about it honestly and passionately? Shining eyes could be a good sign that you might have built something lovable (on the other hand, I try to keep in mind Peter Schwartz’s thought: “It is always worth asking yourself: “how could I be wrong?”).

3. Contextual research

There are two different types of situations that regularly come up in our product development processes:

  • what are those needs of our users that they have not yet realised, but would really love if we figured it out for them?
  • or we come up with new directions / new products / sets of features for a group of people and we start believing that it would add lots of value for them, but how could we validate or falsify our assumptions and how could we learn more about their context and their environment in order to fit into those naturally?

In such situations, it could be best to ‘live and breathe’ with those people for whom we would like to build ‘that next big thing’. It’s often mentioned as ‘ethnographic research’ or we can call it ‘contextual research’ as well.

At this early stage of a new product or feature seed, it could easily derail us if we don’t experience directly, but instead assume we understand, how the people for whom we plan to build feel, live and behave. In the short run, it’s of course faster, cheaper, easier and more comfortable to ‘imagine’ how that group of users could feel and behave. But in the mid-run, it adds lots of value (and helps decrease risks) if we jump into the context of those users and try to understand every aspect of their life and their emotions. In product development, we always refer to the importance of ‘the users’ context’ and to the importance of their emotional and mental state. The most meaningful way for us is if we just spend time amongst our users, talking with them, living with them and, in this way, obtaining a deeper understanding of them.

Being with them also enables us to understand them consciously. This is one form of what we call research bringing ‘people’s context in the house’ and opens up opportunities for product design to come up with solutions that really resonate with people. To be very pragmatic, contextual research can help you to understand how people live and what emotions they have and you may spot a need that leads you to design a ‘Back button’ (then test its valuability and its usability aspect).

4. Conceptual research

Have you ever found that you have a very fundamental question, everybody refers to it around you, but you never had the chance to spend enough time with it, go deep enough with it and to understand how it’s embedded in human nature? We love these fundamental questions such as ‘what is trust’, ‘what is personalisation’ or ‘who our travellers are’. These help us question the status quo by going deeper and deeper day by day.

To illustrate this with a tangible example, in the case of our trust research, we turned to respected professors and subject matter experts, in the fields of social sciences and behavioural psychology. We examined various concepts, tried to embrace as many thoughts as possible about the abstract notion of ‘trust’ and thought how we could apply our learnings to the world of digital products. Then we distilled our learnings into a practical tool we called the ‘Trust Map’. The Trust Map enabled us to analyse our iOS application through the lens of trust (based on feedbacks we captured from our travellers). In the framework of a workshop, we came up with various ideas on how to move forward. Of course, we had tons of ideas, but as we had those many ideas on a sheet of paper, we started to realise how those ideas were connected with each other and we could synthesise them into topics. Now, we had a ‘set of topics’ on the table and we think that if we further explore them, they can help us build more meaningful and trustworthy relationships with our travellers in a more ‘human way’.

So how did conceptual research help us? We translated this abstract substance called ‘trust’ into opportunities in our product. And as a squad or a design working group picks up one of these topics, they can start a focused ‘product discovery process’: do some contextual research to gather some real-life experiences, then craft and prototype solutions, test whether those solutions are valuable for users, iterate on them and if they are confident about the value of their solution, then test its usability. At the end of the journey, release and learn. And iterate and learn and iterate.

***

László Priskin, Design / User researcher at Skyscanner. László is based in Budapest, Hungary, working as a team member on Skyscanner’s renewed mobile app available on Android & on iOS. He started sharing his thoughts, because he passionately believes in the power of discussion. He thinks whatever is written above will be outdated in a few weeks’ time, because building products means that we inspire each other, criticize each other and continuously expand our ways of thinking. László is happy to get in touch with you in the comments below, on Linkedin or Twitter as well. Views are his own.

***

This blogpost is part of our apps series on Codevoyagers. Check out our previous posts:

The present and future of app release at Skyscanner

Cooking up an alternative: making a ‘knocking gesture’ controlled app

How we improved the quality of app analytics data by 10x

Designing Mobile First Service APIs

How We Migrated Our Objective C Projects to Swift – Step By Step

Analytics and Data Driven Development in Apps

Transitioning From Objective C to Swift in 4 Steps – Without Rewriting The Existing Code


From Flask to aiohttp

Posted on by Manuel Miranda

From Flask to aiohttp

Contents

This post is about how to have a global context shared during the flow of a request in aiohttp. It is structured as follows:

Why, the context

In Skyscanner hotels we are developing a new service with Python 3 (h*ll yeah!), asyncio and aiohttp among other tools. As you can imagine, the company architecture is full of different micro services and tracking user journey through them can be really painful. That’s why there is a guideline telling that all services should use something that allows us to track this journey between all services. This something is the X-CorrelationID header. So, to ensure proper traceability, what our service should do is:

  1. All calls to Skyscanner services should send the X-CorrelationID header.
  2. All log traces related to a request/response cycle should contain the X-CorrelationID.
  3. Return the X-CorrelationID used in the Response.

Request/Response cycle with X-CorrelationID header

From the diagram above, you can see that we will be reusing the header in many places (calls to services, log calls, etc…). Knowing that you may realize that this header should be stored somewhere accessible from everywhere in our code.

If you’ve ever worked with aiohttp, you may have seen that there is no way of sharing state or storing global variables within the request/response cycle. The way of sharing the information of the request is by propagating the ClientRequest object throughout all the function calls. If you’ve ever worked with Django, it’s the same pattern.

Obviously, we could do that and finish the post here, but this is totally against clean code practices, maintainability, DRY and some other best practices principles. So, the question is, how can we share this variable during all the request/response cycle without passing it explicitly?

At that point, I decided to do some research, ask other engineers about similar patterns, check code from other tools/frameworks, etc… After a while my brain looked more or less like that:

Wordcloud

So yes, I came up with an interesting pattern which is how Flask is using threads to store local information like the request object. I won’t go deep into how Flask works, but just to give an idea let’s read a paragraph and see some code extracted from “how the context works” docs section (take your time):

The method request_context() returns a new RequestContext object and uses it in combination with the with statement to bind the context. Everything that is called from the same thread from this point onwards until the end of the with statement will have access to the request globals (flask.request and others).

So, that piece of code means that, everything that gets executed inside the context manager, will have access to the request object. Awesome isn’t it? by just executing from flask import request in any section of our code, if we are inside the context manager call, it will return us the request object belonging to the current request/response cycle!

Clean and simple right? After digging into that, my thought was, can we do that with aiohttp? The answer is yes, next section describes how we have implemented a similar behavior with aiohttp (only with the header).

How, the implementation

To recap the previous section: “We want a variable to be easily accessible from any part of our code during the request/response cycle without the need to pass it explicitly to all function calls”.

Python coroutines are executed within the asyncio loop. This loop is the one in charge of picking Futures, Tasks, etc and executing them. Every time you use an asyncio.ensure_future, await and other asynchronous calls, the code is executed within a Task instance which is scheduled inside the loop. You can think about Tasks as small units to be processed sequentially by the loop.

This gives us an object where we can store this shared data throughout the cycle. Here some things to keep in mind:

  • aiohttp request/response cycle is executed within a single Task.
  • Every time a coroutine is called with the await or yield from syntax, the code is executed in the same Task.
  • Other calls like asyncio.ensure_future, asyncio.call_soon, etc… create a new Task instance. If we want to share the context, we will have to do something there.

Seems we are onto something right? The object we want to work with is Task. After checking its API reference you can see there isn’t a structure, function call or anything that allows us to store context information but, since we are in python, we can just do task.context = {“X-CorrelationID”: “1234”}.

Integrating task context with aiohttp

If you’ve read the previous section, you know that we want to store the “X-CorrelationID” header to be easily accessible during all the request/response cycle to be able to use it during log calls, external services calls and return it in the response. To do that, we’ve coded this simple middleware:

Note the import context line. The module is just a proxy to the Task.context attribute of the current Task being executed:

Easy peasy right? by just calling context.get(key) we will get the value stored in Task.context[key] where Task is the current one being executed. By just calling context.set we will set the value for the given key.

Note that, from now on you will be able to do a context.get(“X-CorrelationID”) from ANY part of your code and it will return the needed value if existed. This for example, allows us to inject the X-CorrelationID in our logs automatically using a custom logging.Filter:

Same pattern used for injecting the header when needed to call an internal service from Skyscanner:

For simple flows which cover most of the use cases, this works so far so good!

The ensure_future & co

As previously commented, the ensure_future call returns a new Task. This means that the custom context attribute we were using before is lost during the call. For our code, we solve this by creating a new context.ensure_future call that wraps the original one:

This part is the one I’m less happy about because it’s not transparent to the user. In future versions this will be improved.

What(,) now?

I’ve moved this simple code to a github repository and called it aiotask_context. Right now it only includes the functions for getting and setting variables in the current task. Some future work I’m planning:

  • Implementing a mechanism to propagate the context when using asyncio.ensure_future, asyncio.call_soon and similar calls. Candidates are wrapping (meh) or monkey patching (uhm…).
  • Add more control to know if current code is being executed under a Task environment or not and act accordingly.
  • Include examples in the repository like the aiohttp middleware, request passing, log fields injection, etc…

Just to finish, if you have any questions or feedback, don’t hesitate to comment and ask!

References

Interesting links I used:

Other projects implementing this pattern (haven’t tried them):