There’s a 5% chance that these results are total bullshit

Warning! This post has fallen victim to the base rate fallacy and needs to be amended. While the overall message of ‘push for greater statistical significance’ still rings true, the statistical conclusions depicted below are dubious at best. My apologies for spreading a popular misconception, and thank you to Peep Laja and Ryan Thompson for helping to shed light in the darkness. Updated article coming soon.

In the world of A/B testing, determining statistical significance can be a precarious business. We’re taking techniques that were designed for static sample sizes and applying them to continuous datasets – a context they were never intended for. Of course there are going to be issues.

Compounding that complexity, I’m sure many of you have had encounters wherein teammates wanted to call the result of an A/B test early, before it was really done baking.

This is why writers like Evan Miller have penned posts on the perils of A/B testing, and companies have gone so far as to rebuild traditional statistics to produce more bulletproof testing readouts.

It’s why people argue for A/A testing to illustrate these dangers, stirring others to rebel against A/A testing as wasted testing time.

Yet even with these lessons and safeguards in place, so long as testing software continues to visualize anecdotal results, there will come a day when you’re asked to call a test prematurely.

And it’ll be tempting too! The acclaim that comes with a successful test, the unconscious desire to see your hypothesis confirmed, the pressure to show results – it’s enough to bias the very best.

But is that test you’re running really a winner?

As guardians of the sanctity of our test results, we have to be strong when no one else will. So, from one guardian to another, allow me to unveil my favorite new tool in this struggle: a simple rhetorical device to reframe the current state of test results.

Stop saying:
“We’ve reached 95% statistical significance.”

And start saying:
“There’s a 5% chance that these results are total bullshit.”

Seems harsh at first, right?

But both statements are equally accurate, the later just reminds us of the truth: that 95% statistical significance means there’s still a 1-in-20 chance that our seemingly victorious experiment is actually the result of random variation.[1]

Let me put that another way: If you’re running squeaky clean A/B tests at 95% statistical significance and you run 20 tests this year, odds are one of the results you report (and act on) is going to be straight up wrong. You are not being a jerk by reminding people of this, you are doing them a favor, even if it doesn’t feel like it.

When your colleagues and co-founders are make decisions based on these test results, they’re depending on you to serve as a lense of truth. They deserve to understand the test results they see, and they deserve to work from numbers that they can truly trust.

Yes it will take longer. Yes we all want to move fast. Yes 95% is the industry standard.

I’m here to tell you that you can and should do better than the industry standard. You’re not in this line of work to provide the industry standard. Being wrong 1 time in 100 is a radically better outcome for your company than being wrong 1 time in 20, especially when you’ve got million dollar decisions on the line.[2]

There are thousands of choices you’ll have to make with insufficient data, but this isn’t one of them. You owe it to your team to achieve superior clarity, all you’re asking for is a little patience to get the job done right.

Put your own personal spin on this rhetorical technique if you need to, but give it a try — it’s a quick way to remind stakeholders of the fallibility of our A/B tests and push for a higher degree of statistical confidence (hello 99%).

That’s a result we can be proud of, that’s a result we can get behind, that’s a result we can build a culture of continuous testing on.

So say it with me, one more time:
“There’s a 5% chance that these results are total bullshit.”

Good luck out there, and be strong.

 


[1] That’s assuming you’re running at high statistical power with a single variant! Nevermind if you raise your β value or increase the number of experimental variations, the likelihood of Type I and Type II errors mounts very very quickly, and you’ll be getting yourself into all sorts of statistical nastiness.

[2] I will of course acknowledge that many companies do not have the leisure of reaching the highest levels of statistical significance due to traffic constraints. If this is the case for your company, it’s still incredibly important to emphasize the accuracy or inaccuracy of whatever test results you achieve.