Skip to main content

Continuity Correction in A/B Testing: The 1980 Convention Modern Calculators Should Skip

_By Atticus Li — Applied Experimentation Lead at NRG Energy (Fortune 150). Creator of the PRISM Method. Learn more at atticusli.com._

G
GrowthLayer
6 min read

Editorial disclosure

This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.

_By Atticus Li — Applied Experimentation Lead at NRG Energy (Fortune 150). Creator of the PRISM Method. Learn more at atticusli.com._

---

If your A/B test sample-size calculator gives you a number 1.5–2.5% larger than another calculator using the same inputs, the most likely reason is a continuity correction. Specifically, the Fleiss continuity correction layered on top of the standard Casagrande-Pike sample-size formula.

This sounds like a small statistical detail. It is not. It is the most common reason CRO analysts get inconsistent answers from different tools, and the recommended fix is the opposite of what most people expect: stop using the correction, not figure out which calculator applies it.

The publishers of PASS, the gold-standard sample-size software, are blunt:

"Although this adjustment is included in the formula because it was specified by Fleiss, Levin, and Paik (2003), in practice this adjustment is not recommended because it reduces the power and the actual alpha of the test procedure."
NCSS PASS / Tests for Two Proportions documentation

NIST's Engineering Statistics Handbook presents the uncorrected formula as the default and treats continuity correction as an option. Modern biostatistics references warn that the related Yates correction is "overly conservative."

Here is the history that explains why this correction got bolted onto so many calculators in the first place, and why it almost never makes sense for a modern A/B test.

Where the correction came from

The continuity correction is not new. It traces back to Frank Yates in 1934, who proposed subtracting 0.5 from the absolute difference between observed and expected frequencies in a 2×2 chi-square test before squaring. The motivation: a chi-square test approximates a discrete distribution (binomial counts) with a continuous one (chi-square). The 0.5 adjustment was an attempt to correct for that discreteness.

In 1978, Casagrande, Pike, and Smith published the modern workhorse sample-size formula for two proportions in _Biometrics_ 34: 483–486. Their formula was uncorrected — designed to plan a test that would be analyzed with a standard chi-square or z-test.

In 1980, Fleiss, Tytun, and Ury published a parallel approximation in _Biometrics_ 36: 343–346 that aligned the planning formula with Yates-corrected chi-square analysis. The same year, Ury and Fleiss published an even more direct sister paper: "On approximate sample sizes for comparing two independent proportions with the use of Yates' correction."

The logic of these 1980 papers was tight: if your analysis test is going to apply Yates' correction (which inflates the test statistic and reduces power), your planning math should account for that. Otherwise you'll plan a 95% confidence test, run a 95% confidence Yates-corrected test, and end up with effective power well below 80%.

The world in 1980 vs. the world in 2025

Three things have changed since 1980 that make the continuity correction obsolete for most A/B testing:

Computing power. In 1980, Fisher's exact test was painful to compute on a 2×2 table. Yates-corrected chi-square was the practical compromise — easier to calculate, more conservative than uncorrected chi-square, less computationally expensive than Fisher's exact. Today every laptop runs Fisher's exact in microseconds, and we don't actually need the Yates compromise at all.

Modern A/B testing tools default to uncorrected z-tests. Look at how Optimizely, VWO, AB Tasty, Evan Miller's calculator, and most CRO tools analyze tests: a two-proportion z-test, no continuity correction. The chi-square equivalent (Pearson chi-square) also doesn't apply Yates. So the planning calculation should match — and that means _not_ using the Fleiss adjustment.

The Yates correction itself has fallen out of favor. Multiple modern biostats references describe Yates as "overly conservative." Wikipedia summarizes the consensus: "Despite recommendations to the contrary, medical researchers still routinely use the Yates-corrected chi-square statistic in analyses of 2×2 contingency tables… these 'corrected' statistics are overly conservative." When the underlying analysis correction is itself disputed, layering planning math on top of it stops making sense.

What the tax actually costs

The Fleiss adjustment bumps required sample size by about 1.5–2.5%. That sounds small. In real CRO work, here is what that tax buys you:

For a test with baseline 5%, target 10% relative lift, 95% confidence, 80% power:

  • Casagrande-Pike (uncorrected): 30,420 visitors per variant
  • Fleiss-corrected: 31,159 visitors per variant

739 extra visitors per variant. On a page receiving 5,000 visitors per variant per week, that's an extra day of test runtime. Across 100 tests per year, that's three months of additional cumulative runtime — for protection against a chi-square correction your tool isn't applying anyway.

The cost compounds for low-traffic tests. On a niche page getting 500 visitors per variant per week, the same Fleiss tax pushes runtime from 60.8 weeks to 62.3 weeks. Some of those tests fall off the "feasible" side of the line entirely.

What the standards bodies actually say

Three authoritative references explicitly recommend against using the Fleiss continuity correction in modern A/B testing:

NCSS PASS (the gold-standard commercial sample-size software):

"Although this adjustment is included in the formula because it was specified by Fleiss, Levin, and Paik (2003), in practice this adjustment is not recommended because it reduces the power and the actual alpha of the test procedure."

That is unusually direct language for technical documentation. NCSS includes the Fleiss adjustment as a feature of their software, and then explicitly warns users not to use it.

NIST Engineering Statistics Handbook §7.2.4.2 (source) presents the uncorrected normal approximation as the default sample-size formula for two proportions. Continuity correction appears as an optional alternative, not the recommended path.

Evan Miller, who built the most widely used free A/B test sample-size calculator, has written that "there's no universal Sample Size Formula. Instead, you can't have a Sample Size Formula without a corresponding Analytic Formula." His calculator uses the uncorrected formula because it's the planning math that matches the analysis tools most CRO programs actually use.

When the correction _does_ make sense

To be fair to Fleiss et al., there are three scenarios where the continuity correction is still appropriate:

You analyze with Fisher's exact test. Fisher's exact is more conservative than uncorrected chi-square because it's computing exact probabilities on a discrete distribution rather than approximating with a continuous one. If your analysis uses Fisher's exact (some clinical trials do), continuity-corrected planning math matches.

Very small expected cell counts. When your expected count in any cell of the 2×2 table drops below 5, the normal approximation breaks down and conservative adjustments help. This is rare in CRO — even a 100-visitor test rarely has expected counts that low.

Regulatory contexts that require it. Some clinical trial designs explicitly call for Yates-corrected analysis with matching Fleiss-corrected planning. If you are running an FDA-regulated trial, follow the protocol your statistician registered. CRO is not that context.

If none of these three scenarios apply to your work, the Fleiss adjustment is doing nothing for you except costing you sample size.

What to do about it

Three concrete actions for an experimentation team:

Audit your sample-size calculator's methodology. If you can't tell from the documentation whether your tool applies a continuity correction, ask the vendor. The good ones publish their math. The bad ones don't, and you should treat that as a yellow flag.

Standardize on Casagrande-Pike-Smith without continuity correction. It's NIST's default, NCSS recommends it, Evan Miller's calculator uses it, and it's what our pre-test calculator is built on. Document the choice in your team's experimentation handbook so future analysts don't second-guess the result.

Stop trying to match across tools that use different formulas. If your planning calculator and your significance calculator use different conventions, the gap will surface as confusion at decision time. Pick one stack and stick with it.

The 1980 papers that introduced the Fleiss adjustment did so for a real reason — aligning planning math with the analysis convention of that era. The convention has changed. The adjustment is now an artifact, and as NCSS PASS says directly, it's an artifact that "is not recommended."

Sources

About the author

G
GrowthLayer

GrowthLayer is the system of record for experimentation knowledge. We help growth teams capture, organize, and learn from every A/B test they run.

Keep exploring