Common Pitfalls in ExperimentationPosted on by Colin McFarland
Through experiments we expose our ideas to empirical evaluation. Naturally, uncertainty follows, but the organisational mind-set one develops around this uncertainty is crucial.
More often than not, the things we believe will improve metrics simply don’t. When exposed to real users in controlled experiments, those no-brainer ideas we assume to be obvious often fail to improve the metrics designed for them. Far from the case studies from agencies selling their services, research from Internet Economy organisations we trust reveal the failure statistics from experimentation to be humbling.
In his book Uncontrolled, Jim Manzi revealed that at Google only about 10% of experiments lead to business changes. It’s an arrestingly small number, but it’s a similar story at Microsoft.
In Online Experimentation at Microsoft (PDF), Ron Kohavi confides that of their experiments, only about one third were successful in improving the key metric.
“It’s been humbling to realize how rare it is for [features] to succeed on the first attempt”.
From thousands of experiments during my time leading Experimentation at Shop Direct, intuition was wrong more often than not. At Skyscanner, early evidence suggests towards 80% of experiments fail to improve the predicted metrics.
It’s easy to see how this idea could be industry-wide and poorly recognized. If any Internet Economy business is to seriously compete, it needs to begin by abandoning assumptions and moving from a culture of deployment to a culture of learning. As we start out, we believe these features and designs to be valuable – we’re investing time and effort building them – so what’s going wrong? Why do some many fail and what can we do about some of it?
Apart from statistics around failure, little has been said about what we can do to tackle it. We can do better. If we understand where experiments go wrong, we can work to improve things. Understanding common pitfalls can help us determine why some experiments failed.
Presented here are common pitfalls in experimentation, loosely in order of impact.
Pitfall 1: Not Sharing or Prioritising Evidence
Another common problem I see at large scale across businesses is the learning from experiments isn’t shared widely beyond the team operating the experiment. Clearly, winning experiments should be promoted across the organisation. When we fail more often than not, it’s easier to consider how success can improve if we understand winning experiments in one domain so that we can repeat them in another.
When experiments fail it isn’t as straightforward. Hindsight can lead us to realise some experiments were executed poorly. The data from that experiment could have limited or no value to others. Perhaps it’s invalid. Far more important is sharing surprising failures — those ideas that many people would think will be successful. If your users have rejected a feature or idea and it surprised you, it will surprise others too. Sharing these surprising results is a wonderful opportunity.
You can save others repeating the same experiment; they may adapt their method based on your data, or validate your experiment and offer new insights along the way. You’ll evolve your understanding of cause and effect as an organisation this way.
Prioritisation is hard, and sometimes you won’t even spot your misses, because our own ideas often seem a better choice than competing ideas simply because they are our ideas. We shouldn’t be surprised: Behavioural economists have demonstrated the IKEA effect [PDF]; when we construct products ourselves we overvalue them.
You’ll likely have many data sources available to you, and should use them: data is data. Evidence from other experiments, qualitative, and quantitative feedback, should challenge priority at every opportunity. Experiments fail more often than not; taking evidence from others’ winning experiments gives us an opportunity to validate it at a wider scale. Evidence from experiments should challenge our execution of experiment details too.
Pitfall 2: Poor Hypothesis and No Overall Evaluation Criterion
Hypotheses are not beliefs. They are predictions. Often we can consider deploying an experiment and then trying to find the data to prove ourselves right. This is a problem. Look for anything and you’ll find something, but this bias in analysis can lead you to miss a wider detrimental impact and many false positives.
A good place to start is by defining an Overall Evaluation Criterion (OEC), a small set of metrics agreed across your organisation. Over time the OEC can evolve. When you experiment against an OEC, most feature ideas are now simply hypotheses to improve it and we can move faster by proving ourselves wrong quickly.
Pitfall 3: Poor Understanding of Significance
Significance is widely misunderstood, even a book by the co-founder and CEO of the leading A/B testing platforms gets this wrong: ‘Once the test reaches statistical significance, you’ll have your answer’ and ‘When the test has reached a statistically significant conclusion, the tester sends a follow-up with results and the key takeaways’ are incorrect procedures that will lead to finding many more false positives than expected. The correct procedure is to determine the duration upfront, conduct power calculations to understand the traffic you need, and calculate the significance of your data when the test has run for the full duration.
Another misconception I see often is the expectation that significance is somehow neatly wrapped with the experiment to proof it. Significance is concerning itself only with the differences in numbers. We are null hypothesis significance testing (NHST) – the null is the inverse of our prediction. With NHST the starting position for most experiments is that nothing interesting is happening, and it offers a form of evidence – expressed as p-Value – that something interesting might be happening. The lower the p-Value, the more you can trust the ‘interestingness’ of your data and draw conclusions accordingly.
Pitfall 4: Running Inconclusive Experiments Longer
By its nature, running experiments causes waste. Many designs or features won’t survive; a lot of your work, no matter how much you want it to, just won’t land with your users. That can be hard to take but it is part of our process. We are all passionate about our products but we need to careful not to fall in love with our own ideas and focus on validating ourselves.
This can lead to running inconclusive experiments longer in the hope users will get used to things. While is it possible novelty effects could cause deltas to change significantly, this is rare in our setting. Avoid running your experiment for longer than the power you need simply because you aren’t getting the data you want.
There’s another problem with this approach. Let’s say you’re aiming to measure 1% uplift and you already have the power to demonstrate that but your experiment hasn’t, then looking for significance further only increases your chances of false positives.
Ryan Singer said it best: “You can’t improve […] when you’re emotionally attached to previous decisions. Improvements come from flexibility and openness.” We’ll modify and repeat to learn further, but we should be deferential enough to change tact if users tell us through their clicks that our ideas are off track.
Pitfall 5: Ineffective Ramp Up Strategy
In an uncontrolled setting a feature would ship to 100% of users and take days for user feedback/data to show something bad has happened. When you run an experiment you have the ability to detect and abort these bad changes quickly. Failure is inevitable in experiments, but expensive failure is not if we can design effective ramp up procedures.
Typically, the first stage will be within a 5% range of users to minimise the blast radius and risk of realising a change. The second stage will be within a 50% range of users to measure the change, and a third stage (if your experiment is successful), will be to ramp the feature to all users (or within a 95% if a hold out group to measure impact over time).
It’s crucial your ramp up strategy doesn’t slow you down unnecessarily. An effective ramp up strategy should have clear goals against each stage, and move forward to the next stage unless results are unsatisfactory. For example: stage one may look to validate there is no significant detrimental impact while ensuring engineering and platform resources are performing reliably in less than 24 hours.
Pitfall 6: Method That Doesn’t Match the Expected Uplift
On one hand we can make so many changes at once that our experiment levels out, on the other we could make changes so incremental we can’t realistically measure it in a meaningful timeframe. Consider this as a hill climbing metaphor. We’re using experiments to climb, but we’re doing it blindfolded. We can’t see upfront if we’re at the top of a hill (local maximum) with a much bigger mountain further ahead (global maximum) that will need a leap to get there, or if we’re far away from the peak of this hill and we make big gains with small steps.
Experiments help us assess the terrain and design our leaps accordingly. Power calculations are our guide, for example they might tell us that if we want to learn the impact of our change in 2 weeks with the traffic we have we can only identify a meaningful lift of 5%. So we can therefore determine how we’ll assess the terrain accordingly. Example: We have 2 weeks: will this background change give us the 5% uplift we need? Experiment power is not an arbitrary decision like other parts of the process; this statistical analysis is required to design effective experiments. Dan McKinley created an intuitive online calculator that we have adopted as standard.
Pitfall 7: Failing Ideas Not Experiments
You need to make an important distinction. A failed experiment is not a failed idea; it is failing only the concept in its current condition. Failed experiments are not necessarily dead ends. You can learn a lot about your users’ behaviour even if they didn’t result in a positive change to the metrics. A modification could turn your idea around. Small changes matter unless you are at a local maximum. Explore negative and neutral experiment data to help inform further iterations of the concept, or new hypotheses to be trialled. With segmentation there is a risk of false positives, so validation with new experiments is important.
If conditions change, so too could the outcome. Consider that in highly experimented products, conditions change often, and as such as you experiment your product constantly evolves. The implication of this is experiments you accept now could change what you rejected in the past. Similarly, Booking recognise some ideas could be before their time.
What’s most important is you don’t take a winning experiment implementation and make it a sacred cow never to be explored or challenged further.
Prove Yourself Wrong
Understanding the common pitfalls in experimentation will help you get closer to determining when your experiments failed in execution or when they failed because your intuition about the concept was wrong. If your experiment failed in execution, you can quickly iterate, taking what you learned the first time to improve your approach to better assess the terrain next time.
It’s important to constantly try to increase your iterative capital by making experiments cheaper. If you determine it’s your intuition that’s wrong, this can be humbling but should be celebrated. Through this “prove yourself wrong” culture, new discoveries can be made and innovation can be accelerated.
Want to read more? Check out ‘Design Like You’re Right, Test Like You’re Wrong’, a direct response to the pitfalls above.
Thanks to the many people across Skyscanner and externally who provided feedback for this article. In the latter, special appreciation to Ron Kohavi (Microsoft), Ya Xu (LinkedIn) and Ben Dressler (Spotify).