Why Two A/B Test Calculators Show Different MDEs for the Same Inputs
_By Atticus Li — Applied Experimentation Lead at NRG Energy (Fortune 150). Creator of the PRISM Method. Learn more at atticusli.com._
Editorial disclosure
This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.
_By Atticus Li — Applied Experimentation Lead at NRG Energy (Fortune 150). Creator of the PRISM Method. Learn more at atticusli.com._
---
A CRO analyst on my team flagged a problem this week: she ran the same A/B test inputs through three different sample size calculators and got three different answers. Not slightly different. Visibly different — and visible enough that the team started arguing about which tool was "right."
If you have ever wondered why your team's "sample size needed" doesn't match the number your stats consultant pulled from a different tool, this post is for you. The short answer is: there is no single, universal sample-size formula in A/B testing. Different calculators make different defensible choices, and those choices land you in slightly different places.
To make this concrete, I picked a realistic test scenario and actually ran it through three calculators side by side. Here's what I found, with the receipts.
The test scenario
A pricing page A/B test on a mid-traffic SaaS marketing site:
- Weekly traffic: 12,000 visitors
- Weekly conversions: 600 (so a 5.00% baseline conversion rate)
- Variants: 2 (control + 1 variant)
- Statistical confidence: 95% (α = 0.05, two-tailed)
- Statistical power: 80%
- Target relative lift: 10% (i.e., detect a move from 5.00% to 5.50%)
Same inputs, three calculators. Here is what each one returned.
What each calculator actually said
I ran this scenario through three sources. The first two are calculations I controlled directly so the math is reproducible. The third is Speero's live cloud function endpoint, which I called over HTTPS with the same parameters.
1. GrowthLayer pre-test calculator (live tool) — I plugged the inputs into our tool and read off the per-week MDE table:
| Week | Visitors / variant | MDE |
|------|-------------------|------|
| 1 | 6,000 | 23.60% |
| 2 | 12,000 | 16.40% |
| 3 | 18,000 | 13.40% |
| 4 | 24,000 | 11.60% |
| 5 | 30,000 | 10.40% |
| 6 | 36,000 | 9.40% |
Sample size required for the 10% target lift across the full test: 31,234 per variant.
2. Evan Miller's calculator (sample-size.html) — Evan Miller uses the standard Casagrande-Pike-Smith two-proportion formula with no continuity correction, the same formula GrowthLayer uses. For these inputs his calculator returns 31,234 per variant — identical to GrowthLayer (rounding aside). This is the consensus reference number that nearly every CRO consultant uses as a sanity check.
3. Speero / CXL calculator (calc.speero.com) — I hit Speero's cloud function endpoint directly with type=mde, monthly_traffic=52140 (12,000 weekly × 4.345), p0=0.05, variants=2, tails=2, confidence=0.95, power=0.80. Speero returned:
| Week | Visitors / variant | MDE |
|------|-------------------|------|
| 1 | 6,000 | 23.20% |
| 2 | 12,000 | 16.14% |
| 3 | 18,000 | 13.08% |
| 4 | 24,000 | 11.28% |
| 5 | 30,000 | 10.06% |
| 6 | 36,000 | 9.18% |
The visitors-per-variant column matches exactly across all three calculators. The disagreement is purely in how each tool converts that available sample into a "minimum detectable effect" estimate.
Side-by-side at 5% baseline:
| Week | GrowthLayer | Evan Miller | Speero | Spread |
|------|------------|-------------|--------|--------|
| 1 | 23.60% | 23.60% | 23.20% | 0.40pp |
| 2 | 16.40% | 16.40% | 16.14% | 0.26pp |
| 3 | 13.40% | 13.40% | 13.08% | 0.32pp |
| 4 | 11.60% | 11.60% | 11.28% | 0.32pp |
| 5 | 10.40% | 10.40% | 10.06% | 0.34pp |
| 6 | 9.40% | 9.40% | 9.18% | 0.22pp |
GrowthLayer and Evan Miller agree to the rounding digit because they're built on the same formula. Speero is consistently 0.2–0.4 percentage points lower at this baseline. None of these is wrong. All are published, peer-reviewed approximations of the same underlying statistical question. They just answer it differently.
Why the spread exists
Each calculator is making different design choices on the two real knobs in two-proportion sample-size math:
GrowthLayer and Evan Miller both use the Casagrande-Pike-Smith (1978) formula with pooled variance under the null hypothesis and no continuity correction. This is the approach the NIST Engineering Statistics Handbook §7.2.4.2 presents as the default. It's what nearly every academic-leaning calculator and most modern CRO tools default to.
Speero appears to use a slightly different two-proportion approximation — based on reverse-engineering their published numbers across many test scenarios, the closest fit is Casagrande-Pike with iteration step granularity in the 0.0001 absolute range. At low baselines (like our 5% case) the gap is small (0.2–0.4pp). At higher baselines (around 30%+) the gap can grow to ~1pp because their iteration convergence and ours land at slightly different stop points.
The interesting case is when calculators disagree by more. I ran a separate test against a high-baseline scenario (35% baseline conversion rate) and saw:
- GrowthLayer: 6.11% MDE at week 1
- Speero UI: 6.50%
- Speero's own backend API hit directly: 7.34%
Speero's UI returned a different number than its own backend for the same inputs. That's a sign of internal inconsistency that the analyst can't resolve from outside their codebase — which is itself a useful piece of information when you're choosing which calculator to trust.
The most-cited expert on this is blunt about it
Evan Miller, who built the calculator most CRO practitioners use as a reference point, has written publicly about why calculators disagree. His framing:
"There's no universal Sample Size Formula. Instead, you can't have a Sample Size Formula without a corresponding Analytic Formula, and there are many Analytic Formulas to choose from."
His specific point is that the sample-size formula has to match the test you plan to use to analyze the data. A z-test of two proportions wants one formula. A chi-square with continuity correction wants a different one. A Fisher's exact test wants a third. A Bayesian posterior wants something else entirely.
The corollary: if your sample-size calculator and your significance calculator come from different vendors, they are quietly making different assumptions, and your "we hit significance" moment will land in a slightly different place than your "we have enough sample" moment. This is the common failure mode behind sentences like "the calculator told us 30,000 visitors per variant, but the test is still showing inconclusive at 32,000."
Why the gap usually does not matter (and when it does)
For most CRO programs running tests with sane traffic, the 0.2–0.4pp gap between calculators is operationally irrelevant. If your calculator says you need 31,234 visitors per variant and Speero says 31,800, the difference is one extra day at 12,000 weekly traffic. The decision you make about whether to run the test is the same.
The gap starts mattering in three specific situations:
Low-traffic tests. When you are running on a niche page where each week of additional runtime is expensive, the more conservative calculator pushes some tests from "feasible" to "not feasible" that the less conservative calculator would let you run. I have seen analysts shelve genuinely high-impact test ideas because their tool inflated the required runtime by 15%.
High-stakes tests. If a test result is going to drive a major product decision, the analyst needs to be able to defend the methodology. "Speero said it was significant" is a weaker defense than "we used the standard two-proportion z-test as documented in NIST §7.2.4.2 and pre-registered our sample size." Knowing which formula your tool uses matters here.
Cross-team consistency. When different teams use different calculators, the same observed lift gets called "decisive" by one team and "still inconclusive" by another. This is corrosive to trust in the experimentation function. Pick one method, document it, get everyone on it.
The bottom line for analysts
Three things to take away if you have been seeing this in your own work:
Pick one calculator and stick with it for a given test. Switching between sample-size and significance calculators from different vendors is the most common way to end up confused about whether you have a winner. The math is consistent within a tool; it is not consistent across tools.
Know which formula your tool uses. If your vendor cannot tell you whether they apply a continuity correction, that is a signal worth paying attention to. The good ones publish their methodology. Speero has written publicly about the problem space — they describe the same gap I am describing here from the other side.
Treat sub-half-percentage-point differences as noise; treat 1pp+ differences as a methodology question. When two calculators disagree by less than half a percentage point on MDE for the same inputs, you are looking at iteration step granularity and approximation choice. When they disagree by more than that, one of them is making an assumption you should understand before trusting the output.
The reason GrowthLayer's pre-test calculator is built on the uncorrected Casagrande-Pike formula is exactly the recommendation of the gold-standard sample-size software:
"Although this adjustment is included in the formula because it was specified by Fleiss, Levin, and Paik (2003), in practice this adjustment is not recommended because it reduces the power and the actual alpha of the test procedure."
— NCSS PASS / Tests for Two Proportions
In a future post I'll walk through the four most common sample-size formulas, when each was developed, and which one your favorite CRO calculator probably uses under the hood. The differences are smaller than the marketing makes them sound — but they matter when you are trying to defend a result.
Sources
- NIST Engineering Statistics Handbook §7.2.4.2 — Sample sizes required
- NCSS / PASS — Tests for Two Proportions
- Casagrande, J.T., Pike, M.C., Smith, P.G. (1978). An improved approximate formula for calculating sample sizes for comparing two binomial distributions. Biometrics 34(3): 483–486.
- Fleiss, J.L., Tytun, A., Ury, H.K. (1980). A simple approximation for calculating sample sizes for comparing independent proportions. Biometrics 36: 343–346.
- Evan Miller — Sample Size Calculator
- Evan Miller — On A/B sample size formulas (LinkedIn)
- Speero — The Problem with (Most) A/B Test Calculators
- Speero / CXL — Live calculator
- Chinese University of Hong Kong — Casagrande/Pike sample size reference
GrowthLayer is the system of record for experimentation knowledge. We help growth teams capture, organize, and learn from every A/B test they run.
Keep exploring
Browse winning A/B tests
Move from theory into real examples and outcomes.
Read deeper CRO guides
Explore related strategy pages on experimentation and optimization.
Find test ideas
Turn the article into a backlog of concrete experiments.
Back to the blog hub
Continue through related editorial content on the main domain.