A/B testing has undoubtedly become the buzzword of the marketing world. It has the potential to transform your marketing approach and fundamentally enhance the way you do business online.

It is the only reliable way of establishing cause and effect. In fact, 75% of the internet retailing top 500 are using an A/B testing platform. While 61% of organisations are planning to bolster testing services in the next 12 months.

And yet: poor A/B testing methodologies are costing online retailers up to $13bn a year in lost revenue. 

That’s a really big number. It’s no longer enough to say that you use A/B testing. How you do it is far more important. Here are three A/B testing horror stories.

The cases are anonymous, but the scenarios are very real. Avoiding these traps can help you transform an A/B horror story into the marketing fairytale you always dreamed of.

Let’s kiss the toad and turn him into a prince.

Horror story 1: Variation and the case of the flipping coin

One travel website wanted to redesign its entire booking funnel. The plans were drawn up and the design was put live to 50% of its visitors in order to gauge its effectiveness.

Early results looked great and suggested a fairly significant revenue uplift. Within three days, the manager asks why they were not showing this to the other 50% of visitors since the results looked great on collected data to date.

So the experiment was thrown into the site code, despite the small sample size.

A couple of weeks later, that same manager is left wondering why revenues have now significantly fallen. The new booking funnel is suspected and finally the A/B testing is done to full sample size.

The actual result of the test in full was a 10% down lift in conversion rate.

Moral of the story

Ever flipped a coin to find out who gets that last piece of cake? Well the first few times you flip it, it’s quite possible that you would get either all heads or all tails. But if you flipped it 100 times, it would even out to 50/50. The more you do a test, the more accurate your chances become, the less random the results.

Sample size and length of test need to be long enough to shift the balance away from statistical randomness. Looking at test results too early or every day means you might see a strong uplift  before a significant sample is collected. Variation is especially pronounced at the start of a test.

So, it’s back to the drawing board! But testing properly saved this company a significant amount of money.

2: The tale of false positives and the pregnant man

Multivariate testing (or MVT) is a common method used in A/B testing. It works so that various combinations of site elements: colour, positioning, copy etc are served to segments in an attempt to find their most effective variant.

One company undertook MVT with 100 variants and they waited until some results made statistical sense; (apparently) following testing best practice!

A few experiments were won and their ecommerce director happily decided to implement one variant into their code. Yet at the end of the year, they had trouble proving revenue uplift despite the win.

When they came to us, we re-ran the tests for a validation phase and they resulted in either zero impact or were genuinely damaging conversion. Result of the original test: a loss of time and money.

Moral of the story

Without validating your winning variations through a second testing phase, you don't know whether they won because of a true effect or because of random variation.

The chance of false positives is high with 1000 variations - something will ‘win’ just by chance. Variation means that if a man takes a pregnancy test a 1000 times, at least one time it would produce a false-positive!

If you notice you’ve had lots of winning tests but no increase in your bottom line, ask your testing company: what is your false positive rate?

3: Early data and the sad story of leaving at half-time

A different travel firm invests in thousands into their new product page design and puts it live as a 50/50 experiment.

After the first week, management were calling to halt the test - conversion rates had crashed! But because they stopped before the test had reached statistical power, they had no way of knowing the true effect of the test.

A disgruntled marketer did the calculations, and just before they were about to pull the plug on it, proved that the probability of the negative result being a true one was less than 30%.

They showed their working to management, and were given the opportunity to let the test run to completion.

The new product page design was shown to be conclusively more successful, and the company decided to alter their marketing KPIs to be in line with revenue driven, rather than the uplift of tests.  

Moral of the story:

Early data reached before significance is likely to be the result of random variation. Stopping before an experiment has reached significance reduces your power, the chance of detecting a true effect, thereby potentially missing a winning variation.

Also, a new site design that interrupts the flow of visitors in the middle of their research phase may have a short term negative impact, your test must have a strong hypothesis grounded in the context of the experiment. Ensure stakeholders know this.

A/B testing can be a minefield. Make sure your testing process is rigorously aware of the statistical traps that can lose you time and money.

Don’t count your chickens before they’re hatched! You might be losing at half -time and tempted to leave the stadium...

Ian McCaig

Published 27 February, 2014 by Ian McCaig

Ian McCaig is Founder at Qubit and a contributor to Econsultancy.

29 more posts from this author

You might be interested in

Comments (9)

Save or Cancel


Thanks for the article!

“Without validating your winning variations through a second testing phase, you don't know whether they won because of a true effect or because of random variation.”

Could you clarify this – are you advocating that for every winning test, you validate the results by re-running that same experiment (immediately following the first test's completion)?

over 4 years ago


Dean Hurcombe, Marketing Executive at LSL CCD

@Brian - I'm not sure if this answers your question but this is how I've always inpreted and applied secondary testing with specific regard to MVT.

In practical terms, if you ran A/B testing over individual site variables such as colour, point size, type face, positioning etc. it's easy to view each variable a channel, and each channel may have a clear statistical leader, so the common sensinsical approach may be to assume that you can take each 'winning' variable and combine them to create the He-Man of all sites!

But in reality, it is paramount to embark on the secondary testing phase to negate both the false positive factor and what I've come to call the 'Galacticos' factor (WARNING! Ubiquitous sporting analogy alert!) .... taking the best performing player at each position, and putting them all in the same team, may look like the winning combination on paper, but it's only after you see them play a match on a real pitch (secondary testing) you know if you have the optimum combination! It may be that some of the lesser rated players, when played in combination with the Galacticos, provide the greatest team!

Ian - this is a fantastic article, I wish I could've explained these points so succinctly, it would've saved me a lot of time trying to explain to Sales Directors why A/B and subsequent UAT testing needs to take so long! I'll be citing many of your points ad verbatim in future.

over 4 years ago


Dean Hurcombe, Marketing Executive at LSL CCD

I'd like to add that the resistance I've previously faced from Sales Directors was in previous employment, the SD's at my current company are all enlightened scholars, gents and thoroughly nice chaps! ;-)

over 4 years ago


Deri Jones, CEO at SciVisum Ltd

Good article, I guess you're saying overall that Marketers (and indeed most people) don't understand basic statistics and probability theory!

And hence many business decisions are made based on numbers that don't mean what they are thought to mean!

And A/B testing is most prone to this, as it claims to be an evidence-based, numerical, fully non-subjective thing (unlike many of the decisions that marketers have to make).

I'm sure there's a Dilbert cartoon that would make this point better than I have just done!

over 4 years ago

Ian McCaig

Ian McCaig, CMO & Founder at Qubit

Brian: In an ideal world, we would validate our A/B tests immediately after completion, in order to reduce the probability of false positives. But sometimes this is not possible. There are other options. At Qubit, when we implement the winning version of a test, we usually have a 5% control group, that exists to monitor the ongoing conversion rate, and to calculate incremental revenue.

Dean: Thankyou for your kind words. I agree that one of the main difficulties with calculating the effect of multiple tests is the interaction between the changes. Often the thing that has the largest effect is a significant change. However just because this is the case, it does not follow that multiple significant changes to a website will lead to an even larger effect. If you want to have some collateral to show to the non-believers, we have recently published a whitepaper, 'Most Winning A/B Test Results are Illusory' which you can find here: bit.ly/qABresearch.

Deri: This is true. In our experience, many marketers are not even aware of the concept of statistical power and the necessity of waiting until you have a large enough sample size. This is little better than allowing a random process decide the future of your company. Hypothesis driven A/B testing carried out with statistical rigour is the way forward.

over 4 years ago

Ian McCaig

Ian McCaig, CMO & Founder at Qubit

If you would like to see a more visual explanation of the concepts in this blog post you can check out our infographic - http://bit.ly/qABinfographic

If you want to get into the details our data scientist completed a whitepaper on this subject - bit.ly/qABresearch


over 4 years ago

Chris Michael

Chris Michael, Digital Transformation Consultant and CTO at CJEM

Some great points made here.

But the biggest causes of failure in A/B testing are:

1. Lack of clear objectives/goals, e.g. are we trying to get signups, sales or both? And what is the desired/expected increase.
2. Jumping to a hypotheses/designing an alternative experience based on opinion rather than facts and data. You need to start with a good understanding of how users currently behave, where the pain points are and what they do/don't respond well to.
3. Testing too many variations at the same time. You have already touched on this above. A series of simple, iterative tests is always better than a single big test. Testing should be always on and BAU.
4. Assuming all customers are the same. You need to break down and test separately by device, user type etc. Mobile users often behave differently than desktop, and returning/existing customers differently too.
5. Lack of balls. In my experience, simple always wins. You should be prepared to test at least one route with the bare minimum of information (e.g. removing cross sell info from landing pages and navigation from booking/order pages.

over 4 years ago



Dean: Regarding your 'Galacticos factor' - it seems like if this is the trend you are observing it is likely 1.) more "better" participants are going to one variation over the other - this is common early on with a small sample size, OR if test tool was setup incorrectly and isn't allocating visitors at random 2.) you are observing "Regression toward the mean" - this is quite common if you aren't segmenting returning visitors / customers, who have a bias due to the novelty of "trying" the new treatment (tends to skew results positively early on) or have to re-learn an already known process (tends to skew results negatively early on). This phenomenon is discussed in the whitepaper Ian referenced (a great read!)

over 4 years ago

Pete Austin

Pete Austin, Founder and Author at Fresh Relevance

Re: MVT with several variants reduces the significance.

This is a huge problem in several scientific fields, the most high-profile being climate studies, where some researchers keep tweaking their computer model and comparing it against their data until it fits the figures. They then assume it's proved - not realizing that the number of tweaks they made (or maybe planned to make) has reduced the probability that the fit is a real effect.

This also seriously affects real-time marketing, so does anyone have a link to the math needed calculate the aggregate probabilities, in a form that can be programmed? Proper math please!

Your solution of re-running the test is completely valid, but I would rather not have to do this in cases where the results of the first run are significant.

over 4 years ago

Save or Cancel

Enjoying this article?

Get more just like this, delivered to your inbox.

Keep up to date with the latest analysis, inspiration and learning from the Econsultancy blog with our free Digital Pulse newsletter. You will receive a hand-picked digest of the latest and greatest articles, as well as snippets of new market data, best practice guides and trends research.