Why is A/B Testing so hard to stomach?

Image by Horia Varlan

Image by Horia Varlan

In my daily life and conversations with other companies, I hear comments like this pretty often:  “A/B tests are a waste of time. They take a long time to run, they make us display a bad version of the product to some  users, and they usually fail to show anything.”

I’ve thought about this a while, and come up with a response:

A/B tests usually show that whatever you’re doing to improve the product isn’t working. It’s not the A/B test that’s “failing”.

But that can be really hard to stomach. So, I thought I’d unpack some of the common concerns I’ve heard about A/B tests (and their cousins, multivariate tests and A/B/C tests), and try to address what’s going on.

Randomized control trials like A/B tests certainly aren’t perfect, but for testing whether a change in the app actually improves something (i.e., conversion, user engagement, impact, etc.), they are the gold standard.  There’re the best thing we’ve got.

A/B tests usually fail

Many people get frustrated at test results that come back with no clear difference between the versions they’re testing, and call that “failing”. Assuming you designed and ran the test correctly (see below), a “no difference” result should be celebrated. It tells you that you weren’t changing something important enough for the test to have found a difference. Either be happy with what you have, or try something more radical. It saves you from spending further time on your current approach to improving the product.

What’s a well-designed test? It’s one where you’ve defined success and failure beforehand. It’s not one where you go searching for statistical significance (that’s just a distracting prerequisite).  For example, let’s say you have a potential new feature / button color / cat video.  How much of an impact does it need to have before you care?  If you improve conversions by 20% is that your threshold for success? Is it worthwhile to work on this further if you’re only getting a 2% boost?  That definition of success and failure, along with the amount of noise in the system, determines how many people you need in the test.  If you get a result of “no difference” from the test, that doesn’t necessarily mean “there’s no effect”; it means there’s no effect that you should care about.  You can move on.

[See also: Max Al Farakh’s interesting post on this topic: “Most of your A/B Tests will fail”.]

A/B tests mean your showing a bad version of the product to some of your users

If you have a good UX team, then most of the time, no one really knows if a change in the app will improve it.  You  can’t accurately predict whether the new version will be better or worse.

That’s right: our seemingly solid hunches are usually random guesses, especially when we have a good design team

There are two reasons why.

First, a good UX team will deliver an initial product that is well designed, and will deliver product improvements that are also well designed.  We all make mistakes, but a good design team will get you in the right ballpark with the first try.  By definition, further iterations are going to have a small impact relative to the initial version of the product.  Don’t be surprised that new versions have similar results (conversion, etc.) to earlier versions – celebrate the fact that the earlier version was a good first shot.

Second, human behavior is just really confusing.  I spend most of my days thinking about how to design products to support behavior change, and it’s still confusing. We just can’t forecast how people will react in many cases to change in the product. In familiar situations, we can and should use our intuition about a set of changes to say which one is likely to be better – like when we’re applying common lessons we’ve learned in the past. But, when you have a good design team, the common lessons have already been applied. You’re at the cutting edge, and so your intuition can’t help anymore. That’s why you need to test things, and not rely (solely) on your intuition.

A/B tests take a lot of time

One of the reasons that A/B tests and other randomized control trials appear to take a lot of time is that they aren’t always the best tool to use. They are the final arbiter of success or failure; in many cases they should only be used after a series of simpler, quicker tests have pointed you in the right direction.  For example, if you just need a rough sense of whether a new direction your considering is worthwhile, then do a quick and dirty test – do some interviews or testing with clickable prototypes before building anything.  If the results of those quick and dirty tests are exciting, go for it. Build out the app, and A/B test it. If not, maybe it’s not worth further development (or testing) until you have a better option.

The amount of time that a randomized control trial takes depends on the number of people who are flowing past the test, and how big of an impact you need to get in order to care about the result. If you’re looking for a really small change, or you have relatively few users looking at the test, then it’s going to take longer. That’s true, and unavoidable.

A/B tests mean you aren’t confident in the changes you’re proposing

This is another issue I’ve heard, and it’s a really tricky one. You naturally expect that any changes that you’re planning to make to the product will improve it.  As I mentioned above that’s often not the case (since it’s hard to make a good product better, and human behavior is inherently complex).

That sets up a problem of cognitive dissonance, though. It’s very uncomfortable to think that some of the changes you’ve carefully planned out, thought about, and decided will helpare actually going to do nothing – and you don’t know which ones those are!  It would be like you’re admitting a lack of confidence in the changes that you’ve bought into. So, a natural (but dangerous) response is to plough ahead, and decide that testing is not needed.

There’s no simple solution to address this situation – the need to confidently build something you shouldn’t actually be confident in. The best approach that I’ve come across is to move the testing process out of the reach of that cognitive dissonance. Make testing part of the culture of the organization; make it a habit that’s followed as standard procedure, and not something that the organization agonizes over and debates each time a new feature is added.

There’s lots more that one could say about A/B test and other types of randomized control trials. Those are some of the biggest concerns I’ve heard however. I’d love to hear what you think.

Lessons

  • Results that show no difference between versions of your app (on well-designed tests) are excellent – they tell you that the current approach isn’t working.  You’re not making a big enough difference to warrant the further effort.
  • Changes to the product are often going to have small impact.
  • When confronted with novel situations, our intuitions usually aren’t that good.
  • If you can get by with a quick and dirty test before the product is built do it – save the A/B tests for cases where they are really needed (as a final arbiter of success and failure).
  • Make A/B testing a routine part of product development.

-Steve
Please follow me on twitter @sawendel

Category

Theory