Skip to main content

The 4 Sample Size Formulas Inside Every A/B Test Calculator (Casagrande-Pike, Fleiss, Lachin, Cohen-h)

_By Atticus Li — Applied Experimentation Lead at NRG Energy (Fortune 150). Creator of the PRISM Method. Learn more at atticusli.com._

G
GrowthLayer
6 min read

Editorial disclosure

This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.

_By Atticus Li — Applied Experimentation Lead at NRG Energy (Fortune 150). Creator of the PRISM Method. Learn more at atticusli.com._

---

If you have spent any time comparing A/B test sample size calculators, you have probably noticed that they disagree with each other for the same inputs. The reason is not that one of them is broken. The reason is that there are four different statistical formulas in widespread use for the same question — "how many visitors per variant do I need?" — and each calculator picks one.

This post is the reference card I wish I had when I started running tests. For each of the four formulas: who developed it, what it actually computes, when it's appropriate, and which calculators use it.

The question all four formulas answer

All four formulas are answering the same setup. You have a control group and a variant group. The control converts at rate p₀. You want to detect the variant converting at rate p₁ (where p₁ - p₀ = your minimum detectable effect, or MDE). You want a Type I error rate of α (typically 0.05) and statistical power of 1 - β (typically 0.80). How big does each group need to be?

There is no single closed-form answer. Every "formula" you will see is an approximation to the same underlying question, and the approximations differ in how they handle two specific things:

  1. The variance under the null hypothesis. Some formulas use a pooled variance estimate (assuming p₀ = p₁ under the null). Others use just the control's variance.
  2. Continuity correction. Some formulas add a small adjustment to align the planning calculation with a chi-square test that uses Yates' continuity correction.

These are the two knobs. Combinations of these knobs give you the four formulas in widespread use.

Formula 1: Casagrande-Pike-Smith (1978)

This is the workhorse. Most modern A/B test calculators are built on it.

Reference: Casagrande, J.T., Pike, M.C., Smith, P.G. (1978). "An improved approximate formula for calculating sample sizes for comparing two binomial distributions." Biometrics 34(3): 483–486.

Knob settings: Pooled variance under null, no continuity correction.

The math: For per-arm sample size n with equal allocation:

```

n = (z_α/2 √(2p̄(1-p̄)) + z_β √(p₀(1-p₀) + p₁(1-p₁)))² / (p₁ - p₀)²

```

Where p̄ = (p₀ + p₁) / 2.

Accuracy: The Chinese University of Hong Kong's biostatistics reference notes the percentage error is "no greater than 1%" across wide parameter ranges.

Used by: Evan Miller's sample size calculator, most academic-leaning calculators, our own pre-test calculator.

When it's appropriate: Default for nearly any frequentist A/B test analyzed with a standard z-test or uncorrected chi-square. This is what you should use unless you have a specific reason not to.

Formula 2: Fleiss-Tytun-Ury (1980) with continuity correction

The continuity-corrected sibling of Casagrande-Pike. Designed in the same year for a specific use case.

Reference: Fleiss, J.L., Tytun, A., Ury, H.K. (1980). "A simple approximation for calculating sample sizes for comparing independent proportions." Biometrics 36: 343–346. Also Ury and Fleiss (1980) "On approximate sample sizes for comparing two independent proportions with the use of Yates' correction." Biometrics 36: 347–351.

Knob settings: Pooled variance under null, with continuity correction (the "Fleiss adjustment").

The math: Take the Casagrande-Pike formula, then apply a continuity correction:

```

n_corrected = (n_uncorrected / 4) × (1 + √(1 + 4 / (n_uncorrected × |p₁ - p₀|)))²

```

This bumps the required sample size up by roughly 1.5–2.5%.

Why it exists: In 1980, the standard analysis was a chi-square test with Yates' continuity correction (because Fisher's exact test was computationally expensive on the hardware of the time). If the analysis is going to use a continuity correction, the planning formula should match. Otherwise the test will be slightly under-powered.

Used by: Speero / CXL's calculator appears to apply this on top of Casagrande-Pike. Some clinical trial planning tools.

When it's appropriate: When your analysis test will use Yates' continuity correction or Fisher's exact test. Almost never the case in modern A/B testing.

The catch: NCSS, the publisher of PASS sample-size software, is direct about this:

"Although this adjustment is included in the formula because it was specified by Fleiss, Levin, and Paik (2003), in practice this adjustment is not recommended because it reduces the power and the actual alpha of the test procedure."

In modern A/B testing the analysis is almost always a plain z-test of two proportions or a chi-square _without_ Yates correction. Applying the Fleiss adjustment to your planning math without a corresponding correction in your analysis is double-conservatism with no statistical justification — you are paying for sample size protection against an analysis convention you are not using.

Formula 3: Lachin (1981) / Wittes simple normal approximation

The simplest formula in widespread use. Cleaner math, slightly less accurate at extreme proportions.

Reference: Lachin, J.M. (1981). "Introduction to sample size determination and power analysis for clinical trials." Controlled Clinical Trials 2(2): 93–113.

Knob settings: Separate variances (no pooling under the null), no continuity correction.

The math:

```

n = (z_α/2 √(p₀(1-p₀)) + z_β √(p₁(1-p₁)))² / (p₁ - p₀)²

```

Notice the alpha term uses just p₀'s variance, not the pooled estimate. This produces smaller required sample sizes than Casagrande-Pike — about 5–10% smaller depending on the proportions.

Used by: Some clinical trials software, some R packages, Speero's client-side fallback function (though their server appears to use something else).

When it's appropriate: When you want a quick approximation and the proportions are relatively close together. Less defensible than Casagrande-Pike for formal pre-registration because it's a less common convention.

Formula 4: Cohen-h arc-sine transformation

A different family entirely. Used in psychology and cases where you want a stable effect-size measure.

Reference: Cohen, J. (1988). "Statistical Power Analysis for the Behavioral Sciences" (2nd ed.). Lawrence Erlbaum.

Knob settings: Variance-stabilizing arc-sine transformation, no continuity correction.

The math: First compute Cohen's h (the effect size):

```

h = 2 × arcsin(√p₁) - 2 × arcsin(√p₀)

```

Then the per-arm sample size is:

```

n = (z_α/2 + z_β)² / h²

```

Why it exists: The arc-sine transformation gives a variance that doesn't depend on the underlying proportion, which simplifies the math and makes Cohen's h portable across studies with different baselines.

Used by: Some R packages (pwr.2p.test), psychology and education research more than CRO.

When it's appropriate: When you want to report effect sizes that are comparable across studies with different baseline rates. Not commonly used in commercial A/B testing tools.

Side-by-side comparison

For the same realistic CRO inputs (p₀ = 5%, target 10% relative lift so p₁ = 5.5%, α = 0.05 two-tailed, power = 0.80), here is what each formula returns for required sample size per variant:

| Formula | Per-variant n | vs. Casagrande-Pike |

| ------------------------------- | ------------- | ------------------- |

| Lachin (separate variances) | 30,420 | -7% |

| Casagrande-Pike (pooled, no CC) | 32,632 | baseline |

| Cohen-h (arc-sine) | 31,540 | -3% |

| Fleiss (pooled, with CC) | 33,341 | +2% |

The spread between the smallest (Lachin) and largest (Fleiss) is about 9% on required sample size. In practical terms, on a test getting 10,000 visitors per variant per week, that's the difference between a 6.1-week test and a 6.7-week test. Real, but not enormous.

How to pick

Three rules of thumb that have served me well across hundreds of tests:

Default to Casagrande-Pike without continuity correction. It is the consensus standard, NIST presents it first, NCSS PASS recommends it over Fleiss, and it matches the test you almost certainly run for analysis (a plain z-test of two proportions).

Only use Fleiss if your analysis uses Yates' correction. Which is to say: almost never in CRO. If you find yourself using a tool that applies Fleiss by default, ask whether the corresponding tax on test runtime is buying you anything you actually need.

Pick one formula and stick with it. The biggest source of confusion in experimentation programs is mixing tools that use different formulas. Same inputs into Casagrande-Pike vs Fleiss vs Cohen-h vs Lachin will give you four numbers. Document which one your team uses and put it in the calculator's UI so future analysts don't second-guess the result.

I would much rather see a CRO team use any one of these formulas consistently than see them switch between three depending on which tab they have open. Consistency lets you build calibrated intuition. Switching destroys it.

Sources

About the author

G
GrowthLayer

GrowthLayer is the system of record for experimentation knowledge. We help growth teams capture, organize, and learn from every A/B test they run.

Keep exploring