Design Like You’re Right, Test Like You’re Wrong

Posted on by**Colin McFarland**

Trustworthy experiments are how we test our intuition to make better, data driven, products for our users and partners. The theoretical foundations for running controlled experiments are fairly straightforward, however getting data you can trust can be challenging in practice, especially at large scale.

To accelerate our learning from experiments it’s crucial that we form valid hypotheses, then embrace the critical thinking required to observe the data and evaluate it without bias. To support this, we’ve developed a new standard for forming hypotheses at Skyscanner.

### Hypothesis Kit

We developed a ‘Hypothesis Kit’ that is easy to use, designed to guide experimenters with no or limited experience to onboard and start running trustworthy experiments quickly.

*Design like you’re right:*

Based on [quantitative/qualitative insight].

We predict that [product change] will cause [impact].

*Test like you’re wrong:*

We will test this by assuming the prediction has no effect (the null hypothesis) and running an experiment for [X week(s)].

If we can measure a [Y%] statistically significant change in [metric(s)] then we reject the null hypothesis and conclude there is an effect.

### Guard Against Bias

*Design like you’re right, test like you’re wrong* is a simple encapsulation of the critical thinking required to run trustworthy experiments. We are attuned to our user needs as travelers ourselves, but when making data driven decisions, we have to recognize and guard against the biases that can impact our decision making.

There’s confirmation bias, the tendency to focus on data that confirms our hypothesis while over-looking data that could go in the face of it. There’s also HARKing (Hypothesising After The Results are Known), the inclination to believe that surprising data is obvious after the fact and retro-fitting our hypothesis in light of it.

We can limit these errors by predicting things up front and analysing objectively. You do this by assuming you’re wrong and concluding there is an effect (rejecting the null hypothesis) only when changes are statistically significant. Most of the time, assuming we’re wrong simply means nothing interesting will happen in the data of our experiment.

### Gut vs Data

The way in which we should balance data-driven ideas with gut based ideas is a common discussion point. Gut ideas are of course legitimate sources for a hypothesis, the important thing is that we’re testing them. With many experiments, we build our intuition over time. But we should prioritise objectively based on data. Consider your next experiment: will it be the hypothesis based on gut alone, or one based on strong evidence from prior research?

There are many qualitative and quantitative sources that we can use to inform our experiments. We’re asking to state the data that informs the hypothesis on the belief it is always possible to gather data to inform our prediction. The higher the smallest statistically significant change we can measure in the experiment, the more compelling the data that informs the hypothesis should be, since large uplifts are rarely observed.

### Power Analysis

Power Analysis determines the minimum sample size we need to be reasonably confident of detecting a change in our experiment. Since our method is built upon frameworks of probability and statistics there are inherent errors related to false positives and false negatives. Effective experiments aim to reduce both types of errors; however, the experimenter should note that it is impossible to remove them entirely.

Conducting Power Analysis later in the experiment process is a common pitfall, perhaps alongside development or as an experiment is due to be deployed. This can lead to experiments with a small change that doesn’t realistically match the minimum change that can be measured in the experiment. Or, it can lead to the opposite, a number of changes bundled that could be measured specifically as the Power Analysis supports a multi-variant approach. Bringing power calculations to the very start of the process will help us make leaps relative to what we can measure.

### Power Hacking

Another challenge with power analysis is how we make the calculations. Traditionally, power analysis requires the experimenter to estimate how much they predict a metric will change with the experiment. This question is difficult to answer as we know our intuition is often wrong and predicting a specific metric delta is even more difficult. Power Hacking often occurs as a result.

Power Hacking looks something like this: estimate an uplift, say 1%, conduct the power analysis and determine the duration is too long, then increase the expectation of the experiments impact, to say 3%, to get a duration that meets expectation. This is a problem; we’ve hacked the power but the change we’re making hasn’t changed relative to the new expected uplift. As such, we could declare the experiment a failure when the case could simply be that the experiment was under-powered.

### Speed vs Certainty

To reduce Power Hacking we’ve made an important change to how we determine power. We’ll focus our calculations based on the time we have to learn about the change rather than an arbitrary expected uplift from the change (usually a guess). The implication is that rather than powering for how long to run an experiment in order to measure it, we will power for the minimum statistically significant change we can measure in the time you have we have to learn from it.

This matters: long running experiments can measure statistically significant changes which are not necessary practically significant (or what statisticians would call substantive). By prioritising fast experiments we’ll focus our efforts on impact for our users and partners by ruling out less promising ideas earlier, and moving to iteration sooner. We aim for most experiments to run for 1 or 2 weeks so our kit will tell us the smallest statistically significant change we can measure in weeks. To increase our confidence in some experiments we’ll run longer validation experiments.

### Multiple metrics

It’s important to consider all the metrics you intend to measure up front. When more than one success metric is concerned, calculate the power for all metrics and run the experiment for the metric that requires the biggest sample. If you power for metric X (that you’re most interested in) you need to make sure metric Y (that is also important) can also be measured.

Care should be taken with post-hoc segmentation of metrics. As you explore the data to understand more about the *why*, it’s good practice to use any discoveries to raise new hypotheses for further experiments. A further pitfall is segmenting to find something interesting and not realising that the metric is underpowered. The key is to guard against these pitfalls and not just to find a winning metric.

### Peer Review

In any scientific endeavor peer review is important to assess the quality of research. This is usually a slow process. As Nassim Nicholas Taleb remarked, “Science evolves from funeral to funeral”. In other words, peer review is painfully slow but essential. We move fast, but we can benefit from peer review within our #experimentation community, and Hypothesis Kit helps us assess the quality and intent of any experiment quickly.

Let’s put the hypothesis kit to the test across a few different areas:

**An experiment for the homepage:**

*Design like you’re right:*

Based on the insight that confidence messaging has made a big impact in previous experiments.

We predict that by showing statement and provider logos localised for each market we will see an increase in users understanding and trusting our flights product when they land on the homepage.

*Test like you’re wrong:*

We will test this by assuming the prediction has no effect (the null hypothesis) and running an experiment for 1 week.

If we can measure a 1% statistically significant change in user-search then we reject the null hypothesis and conclude there is an effect.

**An experiment for Skyscanner Japan:**

*Design like you’re right:*

Based on the insight that Japan users are more likely to make a decision on popularity than price.

We predict that ordering results by popularity will cause increased user confidence and reduce bounce rate.

*Test like you’re wrong:*

We will test this by assuming the prediction has no effect (the null hypothesis) and running an experiment for 2 weeks.

If we can measure a negative 2% statistically significant change in bounce-rate then we reject the null hypothesis and conclude there is an effect.

**An experiment for email:**

*Design like you’re right:*

Based on the insight that our 30 most popular routes account for the vast majority of exits

We predict that displaying only destinations in our top 30 routes (rather than a random selection) will cause an increase in clicks.

*Test like you’re wrong:*

We will test this by assuming the prediction has no effect (the null hypothesis) and running an experiment for 2 weeks.

If we can measure a 33.62% statistically significant change in click-through rate then we reject the null hypothesis and conclude there is an effect.

*Hypothesis Kit was developed with Rik Higham with contributions from David Pier, Lukas Vermeer, Ya Xu and Ronny Kohavi. “Design Like You’re Right..” credits Jane Murison and Karl Weick. Original Hypothesis Kit from Craig Sullivan.*

Pingback: 7 Learnings in the Journey from Protoype to Product – Code Voyagers()