One picture is worth a thousand words. So, how does it scale to a million pictures?Posted on by Jacek Wojdel
Well…it probably depends on whether they are all the same or not.
We always knew we wanted our hotel product to be very visual. Booking a hotel isn’t the same as booking a flight; photography really helps bring the hotel experience to life, which is why, on average, when a traveller looks at a hotel on the Skyscanner site, they’ll see around a dozen photos to help them make a decision on where to stay.
However, collecting these images is another matter. Every time we present a piece of information on our webpage, it is in fact a consolidated view derived from tens of different sources. We partner with over a hundred providers, and each of them, for each hotel, will give us the hotel’s details (name, street address, type of accommodation, rating etc). It’s then the Hotels Data team’s job to decide which data to use to present it in the best way to our users. The automated process of doing so this is what we call ‘Data Release’, so in essence:
If you just thought ‘deduplication’, or ‘entity resolution’, you’re on a right track. An integral part of the data provided to us are the images of the hotels. Our team is tasked with downloading all of them (literally millions) from our partners and figuring out which ones to present on our webpage. Again, this happens all automatically, in the ‘Image Release’ process.
About a year ago, this process was running on one of our data-centres, took about three days, and could be initiated roughly once a month. Since then, we moved to Cloud, it’s become a continuously running process, and it is synchronised weekly with the rest of Data Release. As a part of this task, we had to figure out how to de-duplicate images in a way that will be fast and suitable for our needs. Here’s how we did it.
You might wonder, what’s the deal here. Couldn’t we just take all the pictures from the providers and display them on our web-site? Well… the result would probably look more or less like this:
Not exactly helpful, and certainly not the kind of experience we want travellers to have on our site. As you can see, most of the images from different providers are in fact all the same. Just to make things a bit more complicated, they might also be resized, recolored trimmed, watermarked etc. Effectively, we had to create a system that would automatically tell that:
The following two images are the same, and we should use the bigger one:
The following two are not:
The following are the same for our purposes:
The left is cropped, and the right is better:
The process of finding these image-near-duplicates is best done by calculating a so called image-hash, and comparing the hashes of all of the images we have downloaded. There is a multitude of possible hashes: pHash, aHash, dHash, perceptual-hash… and each comparison can be done at varying level of accuracy… so how do we know which one to chose?
Of course, we need to measure. Which brings us finally to the Image Release Corpus.
Image Release Corpus
A corpus is a set of data with accompanying manual labels attached to them. In our case, the corpus comprises of about 1,200 images grouped together into 500 groups with identical visual content each. These were grouped manually in a tedious process involving an HDTV and a small custom script for a quick pre-grouping, browsing and labelling of images. Let me tell you: I do not ever want to see a hotel in Dubai again.
Once this work is done, we can run any algorithm for image deduplication on all of the images and measure its performance against human decision.
There are several measures that can be used for evaluation of performance:
· Purity – how many generated groups contain only a single manual label
· Completeness – how many generated groups contain all images of the same manual label
· Duplicates – how many of the same images are we likely to show to the end-user
In all of the possible approaches, one always has to balance between being too strict about image comparison (which leads to higher number of duplicates shown to the end user) or too lenient (which leads to grouping different images together, and an effective image loss).
Of course, one of the cool things about being a developer is that you can write tools that will help you write tools for the task at hand. With the tools of your choice.
So, after a bit of fiddling with Jenkins, Django and AngularJS, we came up with a small dashboard that is updated on every push to our code repository and evaluates all of the measures for the current Image Release deduplication process.
In this way, we could quickly evaluate all the available image hashing methods and play with different accuracies for comparison. Additionally, for debugging purposes, we can dig further to actually see what kind of mistakes the algorithm made on each group of images.
And we can even look into the specifics of image to image comparison.
Doing so has allowed us to quickly evaluate our approach and chose one that not only worked faster and more reliably than what we started with, but which also allowed us to bring in more than 20% of the images that were previously discarded due to incorrect deduplication. At the same time, this process allows us to stay within the limit of the same probability of showing a duplicate image to the end-user.
Simple image deduplication is just the beginning. The potential for the image analysis is certainly there, and we already have quite some data to work on it. We might, one day, revisit it.