Set in Stone and Cached in SteelPosted on by Robbie Cole
Skyscanner offers its suite of products in more than 30 languages, across more than 40 country-specific domains. Maintaining all the international meta-data we need is a big job, so we have a single authoritative source for the purpose — the Culture Service. Instead of having static tables of data compiled into every independent binary, the Culture Service offers a set of HTTP endpoints that return nice JSON blobs, which are easily consumable across all of our diverse technology stacks. This means that the data can be managed centrally but changes propagate through the ecosystem automagically.
Recently, we got a seemingly innocuous request to improve the basics of that offering. “Can the Culture Service accept If-Modified-Since request headers and return Last-Modified headers in its 200 responses, then return bodiless 304 responses if nothing has changed since the last request?” Not unreasonable – the data inside the Culture Service is volatile enough that it needs to be contained in one place for easy updating, but it’s also stable enough that it should be perfectly safe (and beneficial) to support standard HTTP caching behaviour.
Unfortunately, the implementation of the Culture Service ensured that this request was highly impractical to fulfil – so we had to do our biggest refactor ever, migrating to a radical new architecture.
Trees Will Not Grow On Sand
The Culture Service used to refresh its underlying dataset every hour. In real terms, a consumer would make a request, the service would find the desired value was no longer present in the cache, and repopulation would occur.
This had two issues:
- Performance concerns: the unlucky requester that caused the cache to repopulate got a slower response in comparison to other requests. The Culture Service has some critical-path functionality, so these spikes could impact real users performing real interactions, not just back-end functions.
- Data volatility: the vast majority of the time, nothing had changed and these refreshes were totally unnecessary.
The Culture Service is a little bit slow at cache population because it does not have a single clean data source. It constructs the suite of cultural settings that our products need from various independent places – the .Net/Windows culture information, internal databases, Unicode CLDR data files, more overrides…
That also means that it has a hard time tracking when any one of these sources has been modified, let alone if that change actually affected the final, distilled output. There was simply no practical way to track and serve meaningful Last-Modified dates with the old architecture.
The Push Always Shines On T.V.
The solution is to turn the cache on its head. Rather than constantly refreshing data that we know doesn’t change very often, we should hold onto the data we have until told otherwise.
So we split the single monolithic microservice into two sub-applications: the ‘loader’ application and the web application. The loader is a batch process, a command-line executable run on a schedule that handles all the slow and grungy marshaling of data and pushes it into the cache, while the web application is the plain old RESTful web service that simply serves whatever is in the cache.
This divided architecture gives us numerous major advantages, above and beyond fulfillment of the original need to track modification dates.
♫ Please don’t ask me to defend, the shameful lowlands of the way I’ve been caching suboptimally through time, woh-oh-oh… ♫
First is that the web layer is vastly simplified, as most of the Culture Service logic lies in the data-marshaling loader layer, making it leaner and faster. There is no need to instantiate dependencies that only get called during repopulation, there are now well-defined boundaries which reduce the surface area for debugging, test environments can run without database access…
Now that cache repopulation is no longer triggered by user requests, there are no lag spikes — response times are consistent all day long.
Finally, and perhaps best of all, the web layer is now insulated from all the external dependencies. Previously, it risked a failure to repopulate the cache because of a dependency (for example, loss of connection to a database).
Now, however, a failure in the loader layer simply does not refresh the cache. The only risk is that we might serve stale data for slightly longer, but as I said earlier, our cultural data is stable enough that this would have a very low impact.
The Living Date-lights
Funnily enough, because the loader code is now run outside the normal flow it is actually free to slow down, because its performance does not directly impact consumers.
Crucially, this increased performance budget means the loader can afford to track modifications from all its myriad data sources in one simple but brute-force way: by reconstructing the entire dataset and comparing it with the existing dataset. If there are changes, they get pushed to the cache, and if not, nothing happens. Either way, the web application picks everything up from the cache without delay.
This brings us right back to the original goal, as we can now accurately track modification dates by noting down when a new or modified item is pushed to the cache. Then we pass can these dates up through the web layer, do a bit of juggling to decide whether to return a 200 or 304 based on the given If-Modified-Since date and our Last-Modified date, and the job is done.
With all the other benefits this approach brings, the work was a massive success on every level: we’d recommend this architecture to anyone!