Skip to main content

How to Read an A/B Test Calculator (Without a Statistics Degree)

_By Atticus Li — Applied Experimentation Lead at NRG Energy (Fortune 150). Creator of the PRISM Method. Learn more at atticusli.com._

G
GrowthLayer
8 min read

Editorial disclosure

This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.

_By Atticus Li — Applied Experimentation Lead at NRG Energy (Fortune 150). Creator of the PRISM Method. Learn more at atticusli.com._

---

The first time I had to plan an A/B test, I opened a sample-size calculator and stared at five input fields I didn't understand. Baseline conversion rate, OK. Statistical confidence, sure, 95%. But "minimum detectable effect"? "Statistical power"? "Number of variants including control"? I clicked around and got a number that I had no idea how to defend.

Most A/B test calculators assume you already know the statistical vocabulary. This post is the guide I wish someone had handed me — what each input actually means, what each output is telling you, and how to make sensible choices without needing a statistics degree.

The five inputs every calculator asks for

1. Baseline conversion rate

The conversion rate of your control (the unchanged version). If 100 visitors hit the page and 5 convert, the baseline is 5%.

This input is doing two jobs. First, it's telling the calculator the variance of your data — proportions close to 50% have higher variance than proportions near 1% or 99%, and that affects sample size. Second, when you set a _relative_ MDE later, the baseline is what you multiply against to get the absolute target.

A 5% baseline with a 10% relative MDE means: detect a lift from 5.0% to 5.5%. A 50% baseline with a 10% relative MDE means: detect a lift from 50% to 55%. Same "10% lift" — wildly different absolute effects, and very different sample requirements.

Most analysts pull this from analytics over a recent lookback window (30, 60, or 90 days). Use a window long enough to smooth out weekly seasonality but short enough to reflect current product reality. 30 days is a reasonable default.

2. Minimum detectable effect (MDE)

The smallest improvement you want the test to be able to reliably detect. Usually expressed as a relative lift ("we want to detect a 10% lift in conversion rate") or sometimes as absolute percentage points ("we want to detect a 0.5pp lift").

This is the input people misunderstand most often. MDE is not a prediction of how much the variant will lift conversions. MDE is the _resolution limit_ of your test — the smallest effect you'll be powered to detect. The variant might genuinely produce a 3% lift, but if your MDE is 8%, the test will likely return inconclusive.

Lower MDE means the test can detect smaller effects, but requires more sample. There's an inverse-square relationship: cutting MDE in half quadruples the required sample size. Realistic MDE choices in CRO:

  • 5% relative MDE: For very small changes (button color, micro-copy). Requires lots of traffic.
  • 10% relative MDE: The most common default. Reasonable for layout changes, new CTAs, feature additions.
  • 15-20% relative MDE: For larger changes — major redesigns, pricing changes, new flows. Less traffic needed but you'll miss smaller wins.

Pick an MDE that represents a _meaningful_ business outcome, not the smallest theoretically detectable effect. If a 3% lift wouldn't change your decision about shipping the variant, don't power the test for 3%.

3. Statistical confidence (also called significance level, or 1 - α)

How confident you want to be that an observed "winner" isn't a fluke. 95% is the industry standard, meaning a 5% chance (α = 0.05) of declaring a winner when there is no real difference.

Higher confidence means lower false-positive rate but requires more sample size. 90% confidence requires roughly 30% less sample than 99% confidence. For exploratory tests where the cost of being wrong is low, 90% is defensible. For high-stakes pricing or homepage tests, 99% might be worth the extra runtime.

4. Statistical power (1 - β)

The probability that your test will detect the MDE if it's real. 80% is the industry standard, meaning a 20% false-negative rate (β = 0.20). At 80% power, if your variant truly produces an effect equal to the MDE, you'll detect it 4 out of 5 times.

Lower power means cheaper tests but more missed wins. Higher power means more expensive tests but fewer missed wins. 80% is the convention because it's the point where the marginal sample cost of going higher (90%, 95%) starts to outweigh the marginal value of catching one more true positive.

5. Number of variants

How many cells your test has, including control. A standard A/B test has 2 (control + 1 variant). An A/B/C test has 3.

More variants means more sample required, because the same total traffic gets split more ways. Some calculators also apply a multiple-comparison correction (Bonferroni, Dunnett's) for tests with 3+ variants to control the false-positive rate across the family of comparisons. If you're running 4+ variants, look for whether your calculator handles this.

The three outputs that matter

Output 1: Sample size per variation

The number of visitors _each_ arm needs before you have enough data to detect the MDE with the specified confidence and power. Per arm, not total. A two-variant test needs roughly twice this many visitors total.

This number tells you whether the test is feasible. If your page gets 500 visitors a week and the calculator says you need 30,000 per arm, you're looking at a 60-week test. That's almost certainly not viable — you should either pick a larger MDE, find a higher-traffic page, or redirect to a different test.

Output 2: Estimated test duration

Sample size per variation × number of variations ÷ daily (or weekly) traffic. Most calculators do this for you if you provide the traffic input.

Two duration thresholds I use:

  • Less than 2 weeks: even if statistically valid, run for at least 2 full weeks to capture day-of-week and weekly cycle effects.
  • More than 6-8 weeks: start questioning whether the test is worth running. Long-running tests are exposed to seasonality, competitor changes, marketing-mix shifts, and bored stakeholders who want to ship something _anyway_.

Output 3: Detectable effect by week

Some calculators (ours, Speero's, others) show the _minimum detectable effect at each week_ of the test. This is the inverse of the sample-size question — if you only have N weeks of traffic, what's the smallest effect you can reliably detect?

This is genuinely useful. Often the right answer to "how long should this test run" isn't a fixed sample size — it's "as long as it takes to be powered for an effect that matters to the business." The week-by-week MDE table tells you, for example, that with your current traffic you can detect a 5% lift in 4 weeks but a 3% lift requires 11 weeks. The decision then becomes: is a 3% lift worth 11 weeks of runtime? Often it is not.

Three things calculators are bad at

Even good calculators have blind spots that the analyst has to handle manually:

Different formulas, different numbers

If you use one calculator for sample size planning and a different one for significance analysis, you may get conflicting answers for the same data. Speero's blog acknowledges this directly: "Differences in the calculators' underlying statistics cause this discrepancy, but if you're not aware of it — or why it's happening — you're left only with frustration and uncertainty."

The reason is that there is no single universal sample-size formula in A/B testing. Evan Miller, who maintains the most-cited free calculator, puts it bluntly: "There's no universal Sample Size Formula. Instead, you can't have a Sample Size Formula without a corresponding Analytic Formula, and there are many Analytic Formulas to choose from."

The fix: pick one calculator for both planning and analysis, and stick with it.

Peeking inflates false positives

The sample size your calculator returns assumes you'll wait until the test reaches that size before checking the result. If you peek every day and stop early when significance crosses 95%, you've voided the statistical guarantee.

Evan Miller has a classic post on this that every analyst should read. His warning, in his own words:

"If you peek at an ongoing experiment ten times, then what you think is 1% significance is actually just 5% significance."

The fix: pre-commit to a sample size, wait for it, then analyze. If you need flexibility to peek, switch to Bayesian methods or sequential testing — both are designed to handle peeking without inflating error rates.

Multiple variants need correction

Running an A/B/C/D test? You're now making three pairwise comparisons against control. The chance of at least one false positive across three comparisons is higher than the per-comparison rate. Without correction, your effective false-positive rate is closer to 14% than the 5% you set.

The fix: use a calculator that handles multiple-comparisons correction (Bonferroni, Dunnett's, or Holm-Bonferroni). Or run sequential A/B tests instead of A/B/n.

A worked example

Suppose you're testing a homepage CTA change. You pull from analytics:

  • 30 days of traffic: 90,000 visitors → 3,000 visitors per day → about 21,000 per week
  • 30 days of conversions: 4,500 → conversion rate = 5.0%
  • You want to detect a 10% relative lift (so the variant going from 5.0% to 5.5%)
  • Standard 95% confidence, 80% power
  • 2 variants (A/B test, 50/50 split)

A standard Casagrande-Pike calculator returns: ~30,420 visitors per variant. Total = 60,840 visitors. With 21,000 weekly traffic split 50/50, that's 10,500 per variant per week. Required runtime: 30,420 / 10,500 ≈ 2.9 weeks.

Round up to 3 weeks (or extend to a full 4 weeks for clean weekly cycles). That's your test plan: 3-4 weeks of runtime, 30,000+ visitors per arm, powered to detect a 10% relative lift.

If your calculator returns a different sample size (say, 32,000 instead of 30,420), it's probably applying a continuity correction. The difference is statistically defensible but practically immaterial — you're still looking at a 3-week test either way.

What to do next

Three concrete habits that have served me well:

Pull baseline from a real analytics report. Don't estimate. The accuracy of every other output depends on the baseline being right.

Pick MDE based on what would change a decision. Not the smallest theoretically detectable effect. If a 4% lift wouldn't change whether you ship the variant, don't power the test for 4%.

Pre-commit to sample size before launching. Write it down somewhere visible. Resist the urge to peek and stop early. If you need that flexibility, use a Bayesian tool.

The math behind A/B test calculators isn't simple, but the _use_ of them can be. Five inputs, three outputs, and three habits that prevent the most common ways tests go wrong. That's the practical kit.

Sources

About the author

G
GrowthLayer

GrowthLayer is the system of record for experimentation knowledge. We help growth teams capture, organize, and learn from every A/B test they run.

Keep exploring