I gave the same A/B test inputs to five popular sample-size calculators this week. They returned five different numbers — small differences in some cases, large differences in others. None of them is wrong. All of them are using accepted statistical methods. They just make different assumptions about which method is appropriate.

If you've ever wondered why your team's "we need 30,000 visitors per variant" never matches the 32,500 your stats consultant pulled from a different tool, this is your benchmark.

The setup

Standard CRO scenario. I picked a realistic mid-traffic test rather than a contrived one:

Weekly traffic: 12,000 visitors
Weekly conversions: 600 (so a 5.00% baseline conversion rate)
Variants: 2 (control + 1 variant)
Statistical confidence: 95% (α = 0.05, two-tailed)
Statistical power: 80%
Target relative lift: 10% (so detect 5.00% → 5.50%)

I plugged these inputs into five calculators and recorded what each returned for per-variant sample size required to detect the target lift with the specified power.

The results

I actually ran the numbers — first three are direct API hits or local calculation against published formulas, last two are educated reads from each platform's docs since their proprietary engines aren't fully exposed:

| ------------------------------------------------------------------------- | ------------------ | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |

| [Evan Miller](https://www.evanmiller.org/ab-testing/sample-size.html) | 31,234 | baseline | Casagrande-Pike, no continuity correction (computed against his published formula) |

| [Speero / CXL](https://calc.speero.com/calculator/?speero=true) | ~30,500 (inferred) | -2.4% | Their MDE-by-week table at week 5 reads 10.06%, slightly under the 10% target — implying their per-variant sample requirement is a touch lower than ours at this baseline. (More on this below.) |

| [Optimizely](https://www.optimizely.com/sample-size-calculator/) | varies | n/a | Sequential testing (Stats Engine, mSPRT) — not directly comparable to fixed-sample tools |

| [VWO](https://vwo.com/tools/ab-test-sample-size-calculator/) | ~31,200 | matches | Standard frequentist on the public calc; their SmartStats product engine uses Bayesian and is a separate paradigm |

The first surprise: most of the standard frequentist calculators agree closely (within 1%). The Speero gap is real but small — a few percent shift on required sample size that translates to a fraction of a day of additional or shorter runtime at most realistic traffic levels.

The bigger surprise is Optimizely. They're not using the same formula at all.

The week-by-week MDE table (where the real differences show)

For the same inputs, here's what GrowthLayer and Speero return for cumulative MDE by week. I hit Speero's cloud function endpoint directly with type=mde, monthly_traffic=52140 (12,000 weekly × 4.345), p0=0.05, variants=2, tails=2, confidence=0.95, power=0.80. I read GrowthLayer's output from the pre-test calculator directly:

| ---- | ------------------ | ----------- | ---------- | ------ |

| 1 | 6,000 | 23.60% | 23.20% | 0.40pp |

| 2 | 12,000 | 16.40% | 16.14% | 0.26pp |

| 3 | 18,000 | 13.40% | 13.08% | 0.32pp |

| 4 | 24,000 | 11.60% | 11.28% | 0.32pp |

| 5 | 30,000 | 10.40% | 10.06% | 0.34pp |

| 6 | 36,000 | 9.40% | 9.18% | 0.22pp |

The visitors-per-variant column matches across both tools because it's just weekly_traffic × week_number / variants. The MDE column differs because each calculator uses a slightly different statistical convention.

Why Optimizely's number doesn't fit on the table

Optimizely runs on their [Stats Engine](https://www.optimizely.com/insights/blog/how-to-calculate-sample-size-of-ab-tests/), which uses a sequential likelihood ratio test (mixture sequential probability ratio test, or mSPRT). This is a fundamentally different paradigm from the fixed-sample frequentist approach the other calculators use.

Stats Engine is designed to let you peek at results without inflating the false-positive rate. The trade-off is that there's no single "sample size required" answer — the test ends when the test statistic crosses a continuously-updating boundary, which depends on the actual data observed.

For planning, Optimizely's calculator returns a _recommended_ sample size that's typically larger than the fixed-sample equivalent (because mSPRT pays a sample-size cost in exchange for the peeking flexibility). Their docs explicitly recommend not comparing it directly to fixed-sample calculators.

If you use Optimizely's platform end-to-end, this works well — the planning math matches the analysis math. If you plan with their calculator and then analyze with someone else's, you'll get inconsistent answers.

Why VWO has two different answers

VWO's public sample-size calculator uses standard frequentist math (Casagrande-Pike, similar to Evan Miller). But their actual product, VWO SmartStats, runs Bayesian analysis on tests. The two are not directly comparable.

If you plan a test with VWO's calculator (frequentist) and then analyze it with VWO SmartStats (Bayesian), you'll see different probability statements than you expected. The Bayesian "probability variant beats control" doesn't have a clean equivalent in the frequentist sample-size formula.

In practice this means: if you're a VWO platform user, plan with their calculator and _also_ plan a Bayesian decision threshold (e.g., "ship when probability > 95%") that you'll use for the actual analysis. Don't try to convert between the two paradigms.

Why Speero looks slightly less conservative here

For this 5% baseline test, Speero's MDE values run 0.2–0.4 percentage points _below_ GrowthLayer / Evan Miller. That direction matters: it means Speero's underlying iteration is a touch less conservative than the consensus formula at low baselines.

The story flips at higher baselines. I separately tested a 35% baseline scenario (a downstream pricing-page experiment we run on a high-conversion segment), and Speero returned MDE values 0.4 percentage points _higher_ than ours — i.e., more conservative — and Speero's own backend API returned numbers about 1pp above its own UI. That kind of within-tool inconsistency is documented across many user reports, and it's a useful signal: when a tool's UI and its API don't agree on the same inputs, you can't always reproduce the result you're showing a stakeholder.

NCSS PASS, the publisher of the gold-standard sample-size software, addresses the closest known reason for tool-to-tool variance directly:

"Although this adjustment [Fleiss continuity correction] is included in the formula because it was specified by Fleiss, Levin, and Paik (2003), in practice this adjustment is not recommended because it reduces the power and the actual alpha of the test procedure."

Speero is a respected CRO firm and their calculator is one of the better ones (with sample-ratio-mismatch detection, multiple-comparisons handling, etc.). The methodology choices they've made are defensible — they're just not the choices the standards bodies recommend as the default.

Why Evan Miller is the de facto reference

Evan Miller's free sample-size calculator is the single most-cited tool in CRO. There are three reasons:

The math is documented. He's published the underlying formulas, discussed when each is appropriate, and warned about common misuses. Most commercial tools don't.

It uses the standard formula. Casagrande-Pike-Smith without continuity correction, matching what the analysis tests almost everyone runs (z-test of two proportions, Pearson chi-square).

It's free and pre-existing in everyone's bookmarks. The network effect of being the de facto reference means every CRO consultant has used it, every analyst can verify a result against it, and every methodology argument can be settled by saying "let's just check Evan Miller."

If you're picking a calculator and don't have a strong preference, default to Evan Miller. Anything that matches Evan Miller is using the consensus formula.

What this means for your team

Three takeaways from running the benchmark:

The ~1% spread between standard calculators is operationally irrelevant. If your tool returns 31,200 and a competing tool returns 31,800, run the test. The difference is less than a day of additional runtime at typical traffic.

Don't mix Bayesian and frequentist tools mid-test. Pick one paradigm. If you plan with frequentist math (Evan Miller, GrowthLayer, AB Tasty) and analyze with Bayesian (VWO SmartStats, Optimizely Stats Engine), you'll get conflicting answers and not understand why.

Document which calculator your team uses. The biggest source of confusion in experimentation programs isn't the formula choice — it's people switching between tools without realizing it. Pick one, write it in your handbook, and call it the team standard.

For our team, the standard is Casagrande-Pike-Smith without continuity correction (Evan Miller / GrowthLayer / AB Tasty all match here). It matches the analysis math we use, it's the consensus per NIST and NCSS PASS, and it doesn't impose a continuity-correction tax that the standards specifically recommend against.

If your team uses something different and it's working — great. The right answer is consistency, not a specific formula.

What I'd love to see from calculator vendors

Three product improvements I'd ask for if I had a magic wand:

Documented methodology. Every calculator should publish which formula it uses, in plain English, on the calculator page itself. Not buried in a FAQ. Not hidden in a whitepaper. Right next to the inputs. Speero gets credit for their blog post explaining the design space, but they should also document their specific choice on the calc page.

Consistent calculators across pre-test and post-test. Most vendors offer separate sample-size and significance calculators. They should be built on the same statistical foundation, not silently use different formulas.

Honest disagreement footnotes. If your calculator gives a different answer than Evan Miller's for the same inputs, say so on the page and explain why. Most users will never notice the discrepancy — but the ones who notice will lose trust in your tool when they figure it out from a different angle.

The gap between A/B test calculators is not a quality problem. It's a documentation problem. Tools that explain their choices are tools you can defend.

Calculator Showdown: Optimizely, VWO, Evan Miller, Speero, and GrowthLayer (Same Inputs, Different Numbers)