Your A/B tests are part of the replication crisis

A/B testing in your SaaS is part of the replication crisis.

Experiments you run in your SaaS guide its growth. Got more free trials by changing the text in major buttons? Then you will probably continue testing the text inside the buttons. If you incorrectly interpret the results, or they are not reproducible, then you will spend more time wondering why two similar experiments show wildly different results.

Running reproducible experiments will guide you toward making changes that move the needle and away from running in circles scratching your head.

Even academic researchers struggle with reproducibility

Academia, a field filled with people who are really good at statistics, has a replication problem.

A lot of academic studies cannot be reproduced and doing unreproducible research was one of my biggest fears during my PhD. Research built on top of unreproducible studies might also be incorrect. Studies that cannot be reproduced can push scientific fields into unproductive directions.

The average academic is likely better at statistics than the average Data Scientist or Developer at a SaaS company.

Incentives play a large role

If someone is incentivized to show a “positive” result, then they are more likely to push the boundaries to get this.

In academia, researchers are rewarded for publishing in prestigious journals. Publication is the path to career advancement. Scientists are heavily incentivized to produce positive results because journals often only publish positive results. Some researchers go as far as faking their data.

Similar incentives exist in SaaS companies.

Gary from Marketing manages to get his A/B test with redesigned payment pages running. It needed quite a bit of work to go live; design, development, testing. The opportunity cost was around 100K. If more people signup with the updated payment pages then there will be a large Return On Investment.

If one of Gary’s variations outperforms the baseline and increases his employer's long-term MRR, then Gary likely gets a pat on the back for a job well done. Well done Gary. If none of Gary’s variations outperform the baseline then he has an awkward conversation with his Product Manager.

Gary is incentivized to show his experiments produce results with positive business value.

Policy is not the solution

There is nothing like a good old policy to make everyone conform and fix all of your problems - 11/10 security engineers agree.

Policy like requiring A/B tests to have a p-value under 0.05 is not the solution to the reproducibility problem in your SaaS. Academia has been trying this one for a while but it didn’t really help. Scientists just started p-value hacking.

If a scientist trained in statistics can misinterpret p-values then so can Gary from marketing.

Focus on the process, not results

Good experimentation is more about process than results.

We can't control experiment results. The best we can do is run experiments that are worth running, and make sure that they are run fairly. Through good experimentation we can learn what kinds of experiments are worth running. If the experiment is not replicable because we are focussing on getting positive results, then things we "learn" might be incorrect and send us down the wrong path.

Experiment to learn

Experimentation is an excellent way to learn things about the world.

A/B tests are used to figure out what impact a change has. As you run more tests you will learn more about your customers or users. Experiments can build on each other - if you figure out that customers are sensitive to the color of your homepage Call To Actions, then you can try changing the Call To Action colors in other parts of your app.

To maximize learning, “success” or “failure” does not depend on whether your variation improved the thing you are measuring.

Look for sensitivity

If users are sensitive to a change, then it might move the needle.

Users will be more sensitive to certain kinds of changes than others. Tried a few different images on your homepage and signup rate didn’t change? - it might be tough to find an image that increases signup rate. Changed the color of your signup button and signups decreased 10%? - fantastic, you might be able to find a different color that increases signups 10%. In both cases, we have learned something.

Finding sensitivity paves the way for finding changes that can move the needle on business metrics later.

Culture helps

Without a culture of experimentation, negative results will “disappear”, and that’s a bad thing.

If Gary has to have an awkward conversation every time an experiment fails to find a better variation, or something users are sensitive to, then he may stop reporting these results. This can take two forms, he may (accidentally) start p-value hacking, introducing biases in result presentation or experiment design, or just not reporting non-results. Some companies may also decide to stop experimenting because “they are not getting anything from it”.

Effective experimentation is not possible without an objective view of all of the results.

Go forth and experiment

I hope that this post has provided some food for thought on reproducibility and incentives in your SaaS company. Reproducibility is both very important and very difficult to get right.

I am pretty keen to hear and learn from anyone in SaaS companies that are running experiments. You can reach me on Twitter.

ArticleYour A/B tests are part of the replication crisis