How long to run an A/B test?

Shi Wah Tse
Bootcamp
Published in
3 min readJul 1, 2021

--

I have an idea I’ve been experimenting with, and it’s at a point where I want to run an A/B test — to see how customers really behave with this idea and does it actually hit those viability metrics.

Does it actually convert to more sales? Reduce drop offs?

I use this calculator to figure out how long to run the a/b test: https://docs.adobe.com/content/target-microsite/testcalculator.html

Image of the Adobe Target sample size calculator

How I fill in this calculator:

1. Confidence level

It’s defaulted at 95% — I’m tempted to lower this to 85% to make the test run shorter, but our analytics guru says he can’t give the business an accurate estimation if its less than 90%.

“I can’t create a report back to the business that this idea generated $X with anything less than 90% confidence.”

I asked a few analysts why 95%. A few of their answers:

  • One talked about a marble story (too long to repeat here) “If you had a jar of 100 marbles, half of them black and white..”
  • One explained statistical significance “Statistical significance — if we run this test again 100 times, 95 times it will have the same result.”
  • One explained it by showing a graph: “What we really want to see in hypothesis or A/B testing, is the green distribution (exp B) is more to the right, less overlap, ie. the conversion of exp B is always better than control.”

2. Statistical power

So I leave it at 80%. Here is what our analyst had to say:

“ The usage of statistical power in the sample size calculator is to make sure we have enough data, so that we know the detection of the difference between things is not by chance.”

“For estimation of sample size, we usually leave the power as 80%. Once we got the data (good enough data) of an a/b test, we use statistical significance to decide the result.”

She also then mentioned something complicated I don’t really understand:

“ From experiment point of view, or clinical trial, Type I error (statistical significant) is crucial than Type II error, so that 5% significant level is used for determine of experiment result, rather than probability of Type II error (20%).

We also are using different methods (eg. sequential or conditional probability) to check our experiment result.”

I’ll leave it with the analysts :)

3. Baseline conversion rate

If the business set the success metric at +1% uplift in conversion, then I would put in 1.

4. Total number of daily visitors

Number I have to get from current adobe analytics, how many users go through this flow daily.

5. Number of Offers Including Control

It means how many tests you want to run — a/b/c/d/e etc. So I put in 2 if I just want to run an A/B test

Results

After inputting the above, I look at the lift row and see what matches to our viability success metric (e.g. uplift of 1%) and see how many days it takes to complete.

The problem I run into is when it takes too long to run an A/B test to reach at least 90% confidence, in which we have to get ‘creative’ in the measurement and experience design. I can write another story about that :)

--

--

Sydney based UX Designer who plays with code. I crack open ideas as a living!