Journey to the centre of MemcachedPosted on by David Oliveira
Journey to the centre of memcached
A few months ago our team noticed that Skippy (Skyscanner’s deep linking engine and one of the components our squad looks after) had started to log an increasing number of
NotInCache exceptions. Those exceptions occur when we try to lookup a trip/hotel/car hire in the cache and it can’t be found, causing a new pricing request to be issued. This results in the redirect taking far longer than usual, so not a desirable situation at all.
After a few hours of investigation, we found that some of the data we were sending to Elasticache/memcached was actually being rejected with a
"SERVER_ERROR Out of memory storing object" error message.
We’ve always relied on memcached as our caching platform and even though we knew there was a practical certainty we would exceed the cluster capacity, we thought memcached would deal with it, dropping the oldest items with low or no impact at all. But instead we found that we couldn’t store new data there, causing many recent items to not be found in cache and effectivel degrading the customer experience on skyscanner.net.
Initial investigations bore little fruit and we quickly concluded that our answers were hidden somewhere deep in the inner workings of memcached itself, thus our low-level investigations began.
How does memcached actually work?
Memcached organises its memory in slabs, predetermined lists for objects of the same size range. This reduces the pain of memory fragmentation, since all the objects in a given slab have a similar size.
By default the memory is split in 42 slabs, with the first slab for items up to 96 bytes (on 64 bit architectures), the second one for items from 96 to 120 bytes, etc… and the last one for items from 753.1K to 1MB (maximum item size). The size range of the next slab is always increased by a factor of 1.25 (default) and rounded up to the next multiple of 8.
A page is a memory area of 1MB which contains as many chunks as will fit. Pages are allocated by slabs to store chunks, each one containing a single item. A slab can allocate as many pages as the ones available according to the
-m parameter (maximum memory to use for items).
A chunk is the minimum allocated space for a single item, i.e. a value with the string “super small” will be assigned to the first slab, which contains the items up to 96 bytes. However, every single item on that slab will use 96 bytes, even though their current size might be smaller. This mechanism obviously wastes some memory but it also reduces the performance impact of value updates.
Figure 1 – A visual representation of how memcached organises its data.
How do we run out of memory ?
Once we understood exactly how memcached organises information we could say that we would run out of memory when all the available pages are allocated to slabs.
However, memcached is designed to evict old/unused items in order to store new ones, so how does that work?
For every single slab, memcached keeps a list of the items in the corresponding slab sorted by use (get/set/update) time – the LRU (least recently used) list. So when memory is necessary to store an item in a given slab, it goes straight to the start of the LRU list of the corresponding slab and tries to remove some items to make space for the new one.
In theory it should be enough to remove a single item to make space for another one, however the item that we want to delete might actually be locked and therefore not able to be deleted – so we try the next one, and so on.
In order to keep a limited response time, memcached only tries to remove the first 5 items of the LRU – after that it simply gives up and answers with “SERVER_ERROR Out of memory storing object”.
Figure 2 – regardless of how the items are distributed amongst pages, they are correctly sorted by usage time on the LRU list.
So why would items get locked ?
Every item operation (get, set, update or remove) requires the item in question to be locked. Yes, even get operations require an item lock. That’s because if you’re getting an item from memcached, you want to prevent it from being modified/removed while it’s being sent through the network, otherwise the item’s data would get corrupted. The same applies to updates, increments, etc… The lock ensures data sanity and operation atomicity. Also some internal housekeeping processes might cause items to get locked but let’s not focus on that for now.
In order to prove that we could run out of memory just by locking 5 items, we ran the following test:
- Launch a memcached instance
- Store 5 items of 1-96 bytes length (that’s important because it will map to the first slab)
- Request the 5 items we just stored but don’t
read()their values (
get X\r\n– being X each one of the values we used on the previous step) – that should keep the 5 items locked
- Store thousands of 1-96 byte items on memcached in order to fill up its available memory and force it to try to evict data on slab #1
On the last step of the test we started to see the
"SERVER_ERROR Out of memory storing object" once the 5 oldest items (the ones on the top of the LRU) were all locked, making it impossible for memcached to release memory for the new items.
Measuring the pressure
To understand better the patterns of data we’re storing on memcached and to better visualise which slabs are getting more pressure to evict data we can use
memcached-tool, a script that comes along with memcached source code. It also allows you to get an overview of the distribution of your data amongst slabs and a few other metrics that are quite important. We’ve improved memcached-tool by making it print slab and global memory efficiency. You can check it out here: memcached-tool-ng
The stats for one of the nodes of our cluster
From the screenshot above we can notice that there are definitely 2 groups of slabs under a lot of pressure: slab #2 for items between 97 and 120 bytes and slabs #13, #14 and #15 for items between 1.2kbytes and 2.3kbytes. Also the slabs around #13 and #15 have some considerable pressure. We can see this based on the number of items (~53 million, these 4 slabs) and evictions – if we have to evict items, it means we’re already already running out of memory. Also those slabs use a considerable amount of space (~46GB altogether).
Other important values are the
OOM (out of memory), which tells us the number of times we weren’t able to evict data, and the
Max_age which is the age of the oldest item in the given slab.
So is it possible to avoid the locks?
Our platform uses memcached quite intensively – we set roughly 2 items on memcached for nearly every single commercial flight combination in the world, multiplied by the number of airlines and agencies selling that flight, plus a few hundred thousand items of other business data. That can go over 500K memcached sets per second per cluster at peak times.
We also run our platform on about a thousand AWS instances making it almost 100k concurrent connections to memcached retrieving and storing data simultaneously.
This combination of circumstances makes it really easy to get into this kind of concurrency problems.
So in order to avoid being unable to store data on memcached due to locked items, we had to tweak a few settings:
lru_crawler: Enables a background thread to check for expired items and remove them – a good thing if you want to keep your slabs clean and avoiding evictions;
lru_maintainer: Splits the LRU’s in 3 sub-LRU’s (HOT, WARM and COLD); New items are put in the HOT LRU; Items flow from HOT/WARM into COLD; A background thread shuffles items withing the sub-LRU’s as limits are reached – it avoids having always the same items at the top of the LRU;
Other settings you might want to check:
-f(chunk size growth factor): Defines the growth factor between slabs (as mentioned on the Slabs topic); It defaults to
1.25; By changing it to a lower value you might end up with more slabs (spreading the pressure of the evictions) but be aware that you can’t have more than 63 slabs – so, for instance, if you change it to
1.10, you will have 63 small slabs but the last one is going to contain all the items between 42k and 1MB (being 1MB the maximum item size), which means every item will take 1MB, causing a really bad memory efficiency;
-I: Defines the maximum item size; For instance, if you don’t store items higher than 500kb, you might want to tweak this setting along with the
slab_chunk_maxsettings so you can spread your data amongst more slabs;
expirezero_does_not_evict: Defines whether an item with expire time=0 is evictable or not; If that’s
1(on), you will get OOM errors as the limits are reached; We didn’t have to tweak this as the default if
slab_automove: If you have a long running memcached instance and your usage pattern (key/item size) has changed overtime, you might be incurring on many evictions. It happens because once memcached reaches its memory limits, it won’t be able to allocate more pages to slabs that need them, so the pattern of allocated pages per slab is set forever. These 2 parameters make memcached take pages from slabs without evictions, when a slab is seen having the highest eviction count 3 times, 10 seconds apart.
Memcached is a widely used distributed caching platform with a relatively small learning curve. It barely requires a configuration before you’re able to use it and its API is extremely simple and straight forward.
Despite its simplicity, the way it stores data and manages its memory limits might have some caveats, eventually leading you to some unexpected behaviour. Understanding how your data looks, how memcached organises it and seeing how your cluster is performing will definitely bring you one step further on the control of your whole platform and knowing its limits, avoiding bad surprises in the future.