Skip to main content

The CRO Analyst's Guide to Trusting (or Not Trusting) Your A/B Test Calculator

_By Atticus Li — Applied Experimentation Lead at NRG Energy (Fortune 150). Creator of the PRISM Method. Learn more at atticusli.com._

G
GrowthLayer
7 min read

Editorial disclosure

This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.

_By Atticus Li — Applied Experimentation Lead at NRG Energy (Fortune 150). Creator of the PRISM Method. Learn more at atticusli.com._

---

A CRO analyst on my team noticed last week that our pre-test calculator was returning slightly different numbers than the Speero calculator he used to use. Same inputs. Different MDE estimates. The difference was small — about 0.4 percentage points — but it was enough to make him question whether our tool was wrong.

The honest answer turned out to be: neither tool is wrong. They're using different statistical conventions, both defensible. But the question he asked was the right one. Knowing when to trust your calculator, when to question it, and how to decide between conflicting tools is one of the most underrated skills in CRO.

Here's the decision framework I use, built around four questions you should be able to answer about whatever calculator your team relies on.

Question 1: Does the methodology match your analysis?

Every sample-size calculator is making an implicit assumption about how you'll analyze the test at the end. The math only works if those assumptions match.

Three checks:

Same statistical paradigm? If you plan with a frequentist sample-size calculator (Evan Miller, GrowthLayer, AB Tasty, the public side of VWO and Speero) and analyze with a Bayesian tool (VWO SmartStats, Optimizely Stats Engine), you'll get conflicting decisions and not understand why. Pick one paradigm and use it end to end.

Same continuity correction (or lack thereof)? Some calculators apply Fleiss continuity correction, which inflates required sample size by 1.5–2.5%. If your planning calc uses Fleiss but your analysis test doesn't (Pearson chi-square, plain z-test), you'll over-collect sample. If the reverse, you'll under-collect.

Same multiple-comparisons handling? A/B/C/D tests need a correction (Bonferroni, Dunnett's, Holm) to control the false-positive rate across the family of comparisons. If your calculator doesn't apply one and your analysis tool does (or vice versa), the numbers won't reconcile.

The cleanest setup: pick one tool that handles both planning and analysis, document its assumptions, and stick with it. Mixing tools is the most common source of calculator confusion in real CRO work.

Question 2: Is the difference between calculators 0.5pp or 5pp?

Not all calculator disagreements are equal. The size of the gap tells you what's going on.

0.1–0.5pp difference in MDE for the same inputs: This is iteration noise and approximation choice. Both calculators are using valid formulas — Casagrande-Pike with slightly different convergence settings, or one with a continuity correction layered on top. Operationally irrelevant. Pick one calculator and move on.

0.5–2pp difference: Meaningful methodology choice. Probably one tool applies continuity correction (Fleiss) and the other doesn't, or one uses Lachin (separate variances) instead of Casagrande-Pike (pooled variance). Worth understanding which one your tool uses, but neither is wrong.

2pp+ difference: Different paradigms entirely. Likely one is fixed-sample frequentist and the other is sequential or Bayesian. These cannot be directly compared — they're answering different questions. Pick the one that matches your analysis paradigm and ignore the other.

For the analyst on my team: he was looking at a 0.4pp gap between our calculator and Speero's. That's the iteration-noise + Fleiss-correction zone. Not worth re-thinking the methodology over. Worth documenting the difference and moving on.

Question 3: Does the calculator surface what you actually need?

Speero's excellent post on calculator features lists four things that matter:

  1. Centralized calculators. Sample size, duration, and significance should use the same statistical methodology. Switching between vendors mid-test creates conflicting answers.
  2. Test duration awareness. Real-time visibility into how long the test still needs to run, given current traffic. Pre-test estimates assume traffic is stable; mid-test re-estimation handles the cases where it's not.
  3. Sharing results. A "share link" feature that lets you send a stakeholder the live calculation without recreating it in a Google Doc. We added this to our calculator after using Speero's and seeing how much faster handoffs got.
  4. Multiple variations. A correction for the multiple-comparison problem when running A/B/C/n tests. The author of Speero's post specifically calls out Dunnett's adjustment as more efficient than the more common Bonferroni method.

I'd add three more things specifically for analyst handoff:

  1. Visitors per variant per week. Not just the total cumulative sample needed — the per-week breakdown so you can see when the test crosses each MDE threshold. This is the single most useful column for actually planning runtime.
  2. Methodology documentation. A line on the calculator page that tells you which formula it's using. "Casagrande-Pike-Smith (1978), no continuity correction." Without this, you cannot defend the calculator's output to a stakeholder who's seen a different number from a different tool.
  3. Paste-friendly text export. A copy-text format with one number per line that maps directly to standard input fields. So when you need to verify a result against another tool, you can paste in the values without manually transcribing.

If your current calculator is missing any of these, the time cost of those gaps shows up in every test plan you write.

Question 4: When should you question the calculator entirely?

Three scenarios where the calculator output is the _least_ reliable thing you have:

Sample ratio mismatch (SRM). If your traffic split between variants doesn't match what was configured (e.g., you set a 50/50 split but observed 47/53), the calculator's assumptions are violated. The whole frequentist framework assumes equal allocation. SRM usually indicates a bug — incorrect bucketing, an A/A test that isn't actually random, an ad campaign blocking certain variants. Speero's calculator warns on SRM; good ones do.

Peeking and early stopping. Evan Miller's classic post on this is required reading. The headline:

"If you peek at an ongoing experiment ten times, then what you think is 1% significance is actually just 5% significance."

The sample size your calculator returns assumes you'll wait until that size before checking. Peek and stop early, and you've voided the guarantee. The fix is either to pre-commit to the sample size with discipline, or switch to Bayesian / sequential methods that handle peeking by design.

Non-binomial metrics. Most A/B test calculators assume binary outcomes (converted / didn't convert). If you're testing on revenue per visitor, average order value, or session duration, the underlying statistics are different. You need a calculator that handles continuous metrics — one with standard-error inputs, not just conversion-rate inputs. Speero acknowledges this gap in their own post about working calculator features.

If any of these three apply to your test, the calculator's number is at best a starting point and at worst actively misleading.

A worked example of the framework

Last quarter our team ran a test that split into a methodology argument. Here's how the framework resolved it.

Test: A pricing page redesign, 14,820 weekly visitors, 5,190 weekly conversions (35.02% baseline), 2 variants, target 10% relative lift.

Calculator A (the one we'd been using): week 4 MDE = 8.86%, sample size per variant = 2,975.

Calculator B (a senior analyst's preferred tool): week 4 MDE = 8.49%, sample size per variant = 2,800.

The 0.4pp gap surfaced as a debate over which calculator was "right."

Question 1 — methodology match: Both calculators were going to be used to plan a test that would be analyzed with our standard z-test of two proportions, no continuity correction. Calculator A was self-consistent (uncorrected planning, uncorrected analysis). Calculator B (Speero) was applying Fleiss continuity correction in planning, but the analyst was going to analyze with the same uncorrected z-test. Mismatch.

Question 2 — size of gap: 0.4pp. Iteration noise + Fleiss correction. Operationally irrelevant.

Question 3 — does it surface what we need: Both showed visitors per variant by week, both let us share results. Tie.

Question 4 — should we question entirely: No SRM concerns, no peeking issue, binary metric. No reason to question either.

Resolution: We kept Calculator A. The 0.4pp gap was real but operationally meaningless, and Calculator A's methodology matched our analysis math. We documented the choice in the team handbook and moved on.

The decision took 20 minutes once we worked through the framework. Without the framework, the same debate had eaten three meetings the previous quarter.

What to put in your team's handbook

If your CRO program has more than two analysts, write down the calculator policy. A one-pager that answers:

  • Which calculator do we use for sample-size planning? Specific tool, specific URL.
  • Which calculator do we use for significance analysis? Same tool, ideally.
  • Which formula does our calculator use? "Casagrande-Pike-Smith (1978), no continuity correction" is a defensible answer. So is "Bayesian Beta-Binomial" if you've gone that direction. The point is it's named.
  • What do we do when results disagree across calculators? Default rule: trust the team's standard tool. Investigate the disagreement, but don't switch tools mid-test.
  • What confidence and power defaults do we use? 95% confidence, 80% power are industry-standard. Document if you've chosen differently.

This document should fit on one page. Its job is to prevent the methodology argument from happening twice.

The bigger principle

The thing I've learned from running 100+ tests per year for the past five years: the calculator is rarely the limiting factor in test quality. Sample-size math, continuity corrections, formula choice — these matter at the margins, but they're not what makes experimentation programs succeed or fail.

What matters more, in my experience: pre-registered hypotheses, disciplined sample-size commitment, SRM monitoring, multiple-comparisons awareness, and stakeholder education on what tests can and cannot tell you.

A team that picks any defensible calculator and uses it consistently will outperform a team that argues about formula choice for every test. The calculator is a tool. The discipline is the moat.

Sources

About the author

G
GrowthLayer

GrowthLayer is the system of record for experimentation knowledge. We help growth teams capture, organize, and learn from every A/B test they run.

Keep exploring