Skip to main content

When Bayesian Saved the Test Frequentist Would Have Killed: A Practical Guide to Statistical Methods in CRO

6 tests had >95% Bayesian probability but didn't reach frequentist significance. One was shipped and won. Here's the practical guide to choosing your statistical method.

A
Atticus LiApplied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method
15 min read

Editorial disclosure

This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.

Fortune 150 experimentation lead100+ experiments / yearCreator of the PRISM Method
A/B TestingExperimentation StrategyStatistical MethodsCRO MethodologyExperimentation at Scale

The frequentist versus Bayesian debate in CRO is usually framed as a philosophical disagreement. Two camps, two worldviews, two sets of tools. Pick one and defend it.

That framing is not useful. Both methods exist, both have legitimate applications, and the question of which to use in a given situation has a practical answer that does not require you to resolve a philosophy of probability.

What I found after auditing a enterprise experimentation program — running both frequentist and Bayesian analysis in parallel across every test — is that the two methods consistently produced different answers to different questions. And the decisions made using only one of them, without reference to the other, were sometimes wrong in ways that had real business consequences.

Six tests in the dataset had Bayesian probability of the variant beating control above 95% — but did not reach frequentist significance at the conventional p-value threshold. They were called "inconclusive" and killed. One of those tests was later revisited, the variant shipped, and it produced hundreds of additional enrollments. The frequentist read had been correct on its own terms — it correctly said the evidence was insufficient to rule out chance. But it answered a different question than the business needed answered.

This article is a practical account of when each method is appropriate, when each will lead you astray, and the unglamorous statistical decisions — one-tailed vs two-tailed tests, Bonferroni corrections, sample ratio mismatch — that matter more to your program's reliability than the frequentist-versus-Bayesian question ever will.

The Test That Bayesian Saved

One test in the dataset stands out as the clearest argument for running both methods in parallel. It was testing a redesigned verification page — the step in the enrollment flow where users confirm their personal information before submitting.

The redesign simplified the layout, reduced the number of visible fields, and added a progress indicator. The hypothesis was that reducing the cognitive load at this step would reduce abandonment. The baseline conversion rate on this page was 89% — already high, which meant the test was chasing a small absolute improvement on a large base.

After running to the planned sample size, the frequentist result was inconclusive. The p-value was above the threshold; the confidence interval on the lift estimate crossed zero. Strictly interpreted, the test did not produce sufficient evidence to reject the null hypothesis. The standard frequentist call was "no significant difference — do not ship."

The Bayesian analysis told a completely different story.

The posterior probability that the variant was better than control was 1.0. Not approximately 1.0 — the posterior distribution had essentially no probability mass on "control is better." The expected loss — the expected degradation in conversion rate if we shipped the variant and it turned out to be wrong — was effectively zero at the level of precision that matters to the business.

The reason for the disagreement between the two methods was the high baseline. With a baseline conversion rate of 89%, a meaningful lift in absolute terms (say, 1.5 percentage points) translates to a small relative lift (about 1.7%). Frequentist tests are calibrated for relative effects; detecting a 1.7% relative improvement requires a very large sample. The Bayesian framework, operating on the posterior distribution over the conversion rate itself, was able to accumulate confidence that the variant was better even when the absolute sample size was insufficient to rule out chance by frequentist standards.

The business decision was to ship the variant based on the Bayesian read. It was the right call. The test produced hundreds of additional enrollments over the following quarter — a genuine, durable improvement that would have been permanently discarded if the team had relied solely on the frequentist result.

Key Takeaway: Frequentist significance and Bayesian probability answer different questions. A test can have P(variant wins) = 1.0 and an expected loss near zero while still failing to reach frequentist significance — particularly when baseline conversion rates are high and the absolute effect is small. Running both methods exposes this gap.

The 6 "Bayesian Directional" Tests That Were Killed as Inconclusive

The verification page test was not an isolated case. In total, six tests in the dataset had Bayesian probability of the variant beating control above 95%, but did not reach the frequentist significance threshold used by the program.

All six were called "inconclusive" based on the frequentist read and were not shipped.

I am not arguing that all six should have been shipped. The Bayesian result at 95% probability is not the same as the verification page result at effectively 100% probability with near-zero expected loss. Some of those six tests were genuinely ambiguous — the posterior was wide, the expected loss was not negligible, and the business decision was appropriately cautious.

But across the six, the pattern is clear: a testing program that uses only frequentist analysis will systematically call some real winners inconclusive, particularly when:

- Baseline conversion rates are high, requiring large samples to detect meaningful absolute changes - Traffic volumes are moderate and the business cannot afford to run tests long enough to reach frequentist significance - The effect being tested is small in relative terms but meaningful in absolute terms given high traffic volumes

The cost of this systematic error is not just the individual missed wins. It is the signal loss across the program. When six tests with directionally strong positive results are filed as inconclusive, the program's learning accumulates more slowly. Hypotheses that deserved follow-up investigation are not pursued. Teams draw incorrect conclusions about what works.

When Bayesian Is Better: High Baseline, Business Decisions, Expected Loss

The Bayesian framework is particularly well-suited to three scenarios that arise regularly in enterprise CRO.

High baseline conversion rates. When your baseline CVR is above 70%, small absolute improvements translate to small relative improvements, and frequentist tests require enormous samples to achieve significance. Bayesian analysis can express reasonable confidence in directional results with more modest sample sizes by appropriately accounting for the prior information embedded in the baseline rate.

Business-oriented decision framing. The frequentist question is: "Can I rule out chance?" The Bayesian question is: "What is the probability this variant is better, and how much do I stand to lose if I ship it and I am wrong?" The second question is almost always more relevant to a business making a shipping decision. Expected loss as a decision criterion — "ship if the expected loss from a wrong decision is less than X% of baseline" — translates statistical evidence directly into risk-adjusted business language.

Continuous monitoring environments. Frequentist tests have a specific pathology around continuous monitoring: peeking at results before the planned sample size is reached inflates false positive rates. Bayesian tests do not have the same fixed-horizon requirement — the posterior can be monitored continuously without inflating error rates, though interpretation requires care.

When Frequentist Is Better: Regulatory Contexts, False Positive Control

Frequentist methods retain important advantages in specific contexts.

Regulatory or compliance contexts. When A/B test results will be submitted as evidence to a regulator, compliance function, or external auditor, frequentist p-values are the lingua franca. The null hypothesis testing framework has a long history in scientific and regulatory contexts, and the p<0.05 threshold has institutional meaning that Bayesian posterior probabilities do not yet carry in most compliance environments.

Controlling false positive rates across a program. When you are running many tests and the primary concern is minimizing the rate at which you ship variants that are not actually better than control, frequentist testing with appropriate corrections offers precise control over the Type I error rate. Bayesian testing offers control over expected loss, which is a different guarantee.

Testing on low-baseline metrics where false positives are costly. When the primary metric has a very low baseline (under 5%) and the cost of a wrong decision is high — shipping a variant that turns out to be worse would be difficult to reverse, or would incur significant operational cost — the explicit false positive rate control of frequentist testing is valuable.

Key Takeaway: Bayesian excels when the decision question is "how likely is this variant to be better, and what is the risk of shipping it?" Frequentist excels when the question is "can I rule out chance with a specified error rate?" These are genuinely different questions, and the right method depends on which one you need answered.

The Decisions That Matter More Than Frequentist vs Bayesian

I want to be direct about something: the frequentist-versus-Bayesian debate, in most CRO programs, is less important than several unglamorous statistical decisions that receive far less attention. Here are the four that actually drove meaningful errors in this program.

One-Tailed vs Two-Tailed Tests

A one-tailed test asks: "Is the variant better than control?" A two-tailed test asks: "Is the variant different from control in either direction?"

If you are testing a change and you do not care whether the variant is worse — you only want to ship it if it is better — a one-tailed test is statistically appropriate and approximately 40% more sensitive than a two-tailed test at the same sample size. This is a substantial difference.

In the program I audited, tests were run as two-tailed by default, including several where the one-tailed framing was clearly the correct business question. The consequence was tests that were called inconclusive under two-tailed analysis but would have reached significance under the correct one-tailed framing.

The decision rule for tailed-ness is simple: if you would ship the variant only if it is better (and would not ship it even if it were significantly worse), use a one-tailed test. If you need to detect differences in both directions — because a significantly worse result would trigger a different decision than a significantly better result — use two-tailed.

Bonferroni Corrections for Multi-Variant Tests

When you test three variants against a control, you are running three simultaneous significance tests. Running three tests at the same significance threshold as a single test inflates your program-wide false positive rate: if each test has a 5% false positive rate, three simultaneous tests have a combined false positive rate closer to 14%.

The Bonferroni correction adjusts the significance threshold for each individual test downward to control the family-wise error rate. For three simultaneous comparisons at a 5% family-wise error rate, each individual comparison uses a threshold of approximately 1.67%.

The program I audited rarely applied Bonferroni corrections on multi-arm tests. Several multi-variant results that were called significant at the standard threshold may have been false positives when the correct multiple comparison adjustment was applied.

This matters most in multi-arm tests where all arms look roughly similar and one arm crosses the significance threshold by a small margin. That is exactly the configuration where the Bonferroni correction changes the call.

Sample Ratio Mismatch Detection

Sample ratio mismatch (SRM) occurs when the proportion of users assigned to each variant does not match the intended split. If you set a 50/50 split and your analytics shows 53% in the control and 47% in the variant, something is wrong with the randomization, the data collection, or both.

SRM is one of the most underdiagnosed sources of invalid test results. In the dataset I audited, five tests showed SRM at a level that warranted investigation. Two of those tests were eventually called winners — but in both cases, the SRM was concentrated on one device type (specifically, a known issue with the testing platform's cookie handling on certain mobile browsers) and did not affect the overall sample integrity significantly. The three remaining tests with SRM were more ambiguous and should have been rerun.

An SRM check should be a standard part of every test's results analysis, not something done only when results look suspicious. The chi-square test for sample ratio mismatch takes less than two minutes and can invalidate or qualify a result that might otherwise be called with false confidence.

The Significance Threshold Problem

An analyst on the team stopped a test after reviewing the dashboard and noting that the test had reached "69% significance." She interpreted this as the test being close to statistically significant — nearly at 95%.

This is a common and consequential misunderstanding. A significance level of 69% means the test has not produced evidence against the null hypothesis. It does not mean the test is two-thirds of the way to being significant. There is no meaningful distance between 69% and 95% on a scale that matters for decision-making. A 69% result is far closer to "no evidence" than to "significant."

The framing "69% significant" is not standard statistical language, and the concept it implies — that significance is a continuous scale you climb toward — is incorrect. Either the test meets the pre-specified threshold or it does not. Watching significance levels tick upward and stopping when they feel high enough is precisely the peeking behavior that inflates false positive rates.

This happened because the team lacked a shared, documented standard for what the significance threshold was and what it meant. The fix is not technical — it is a written decision protocol, shared with everyone who reads test results, that defines what the threshold is and what actions it authorizes.

Key Takeaway: One-tailed vs two-tailed choice, Bonferroni corrections for multi-arm tests, SRM detection, and a documented significance threshold are the four statistical decisions that drove the most errors in this program. These are worth more attention than the frequentist-versus-Bayesian debate in most testing programs.

SRM Detection in Practice: 5 Tests, 2 Still Winners

Because SRM deserves specific attention: here is how the five SRM-affected tests in the dataset broke down.

Two tests showed SRM concentrated on Android Chrome — a specific device-browser combination where the testing platform had a known issue with cookie persistence after certain in-app browser sessions. The SRM affected only this segment, and the segment was small enough (approximately 8% of total traffic) that excluding it did not materially change the overall result. Both tests were called winners after the SRM-affected segment was excluded and the remaining data was validated.

One test showed SRM that appeared to be caused by a bot filtering discrepancy: the control's bot filtering was removing a higher proportion of sessions than the variant's, resulting in slightly different "clean" sample sizes. The fix was to apply consistent filtering to both arms and re-analyze. The result remained positive but at a lower magnitude.

Two tests showed SRM without a clear cause. In both cases, the SRM was substantial enough to invalidate the results — a 46/54 split on a 50/50 test, where the asymmetry was not explained by any known technical issue. Both tests were rerun.

The detection process for all five was identical: compare the observed traffic split to the configured split using a chi-square test, with a threshold that flags anything more than a 2-3% deviation as worth investigating. This check takes minutes and is a mandatory step in the results review process I now build into every program.

Expected Loss as a Decision Framework

The most practically useful output of Bayesian analysis for business decision-making is expected loss: the expected degradation in your primary metric if you ship the variant and the true effect turns out to be zero or negative.

The formula is intuitive. If there is a 95% probability that the variant is better, there is a 5% probability that shipping it will result in a loss. The expected loss is the probability-weighted average of the negative outcomes — the 5% tail of the posterior distribution multiplied by the magnitude of the possible loss.

For the verification page test described earlier, the expected loss was near zero because the posterior had essentially no probability mass on negative outcomes. For a test with 65% Bayesian probability of winning, the expected loss might be meaningfully large, depending on the width of the posterior distribution.

The decision rule I use: ship if expected loss is less than a threshold you set based on the reversibility of the change and the operational cost of being wrong. For easily reversible changes (copy, layout, CTA text), the threshold can be higher. For changes that are difficult to reverse (structural funnel changes, pricing, contract terms), the threshold should be lower.

This framework translates the statistical output — a posterior distribution — into a business decision format that stakeholders can engage with directly. "We are 85% confident the variant wins, and even if we are wrong, we expect to lose less than 0.2% of baseline conversion" is a more useful statement for a business decision than "p=0.08, not significant."

A Practical Decision Tree for Choosing Your Statistical Approach

Here is the decision tree I use at the start of every test design:

1. Is the context regulatory or compliance-driven? Yes → Use frequentist with pre-specified significance threshold. Document the threshold before the test starts. No → Proceed to step 2.

2. Is the primary concern controlling false positive rate across a high-volume testing program? Yes → Use frequentist with Bonferroni corrections for multi-arm tests. Apply SRM checks on every test. No → Proceed to step 3.

3. Is the baseline conversion rate above 60%, or is the expected effect small in relative terms? Yes → Run both frequentist and Bayesian in parallel. Use expected loss as the primary decision criterion. No → Proceed to step 4.

4. Is there a strong business reason to make a decision before the frequentist sample size is reached? Yes → Use Bayesian with expected loss threshold as the decision criterion. Document the threshold pre-test. No → Use frequentist. Pre-specify the significance threshold, direction (one-tailed vs two-tailed), and minimum detectable effect before the test starts.

In all cases: run an SRM check before calling results. Document the statistical approach in the test brief before launch. Do not change the statistical approach after seeing preliminary results.

How Running Both Methods in Parallel Works in Practice

The practical implementation of parallel frequentist and Bayesian analysis is less complex than it sounds. Both analyses run on the same data — you do not need separate tests or separate sample splits.

The frequentist analysis produces a p-value and confidence interval. The Bayesian analysis produces a posterior probability and expected loss. Both are available in the results review.

The decision protocol I use in [GrowthLayer](https://growthlayer.app) is: the primary decision metric is expected loss (Bayesian), with a secondary check that the frequentist result is not significantly contrary to the Bayesian read. If the Bayesian analysis says 95%+ probability of winning with near-zero expected loss, the test is shipped regardless of the frequentist p-value. If the frequentist result is significantly negative while the Bayesian result looks positive — a discrepancy worth investigating — the test is held pending investigation of the data quality.

This is not an arbitrary override of frequentist methods. It is a documented, pre-specified decision protocol that uses both analytical frameworks and applies each where it has genuine advantages.

The verification page test would have been shipped under this protocol. The other five "Bayesian directional" tests would have been reviewed using expected loss, and the ones with genuinely large expected loss would have been held for more data. The distinction matters: not all 95%+ Bayesian results are equivalent. The posterior width — how uncertain you are about the magnitude of the effect — matters as much as the probability.

Key Takeaway: Running both methods in parallel is not complicated. The decision protocol — which method governs the final call, and under what conditions — should be documented before the test starts, not after results are in. Expected loss is the most actionable Bayesian output for business decisions.

Conclusion

The frequentist-versus-Bayesian debate is not the most important statistical question in your testing program. The most important statistical decisions are more mundane: whether you use one-tailed or two-tailed tests, whether you apply multiple comparison corrections, whether you detect and investigate sample ratio mismatches, and whether everyone on your team uses the significance threshold consistently.

Get those right first. Then add Bayesian analysis as a parallel track — not to replace frequentist analysis, but to answer the business decision question that frequentist analysis does not address directly: "How likely is this variant to be better, and how much do we stand to lose if we ship it and we are wrong?"

The verification page test produced hundreds of additional enrollments by being shipped on the basis of a Bayesian read that the frequentist framework alone would have buried. That win was not recovered by choosing the "right" statistical method. It was recovered by using both methods, understanding what each one says, and having a documented decision protocol that applied the right question to the right decision.

Run both. Document your protocol. Check for SRM on every test. And stop watching significance levels tick upward and calling it analysis.

Want to run both Bayesian and frequentist analysis on every test, track expected loss as a shipping criterion, and build a decision protocol your whole team can follow? [GrowthLayer](https://growthlayer.app) gives you the infrastructure to run a statistically rigorous testing program without needing a dedicated data scientist on every decision.

_Atticus Li is a CRO Strategist and the Founder of [GrowthLayer](https://growthlayer.app), a platform for managing and improving enterprise experimentation programs._

About the author

A
Atticus Li

Applied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method

Atticus Li leads applied experimentation at NRG Energy (Fortune 150), where he and his team run more than 100 controlled experiments per year on customer-facing surfaces. He is the creator of the PRISM Method, a framework for high-velocity experimentation programs at large enterprises. He writes regularly about the statistical and operational details of A/B testing — the parts most CRO content skips.

Keep exploring