There is a test result I still think about.

The variant had an 87% Bayesian probability of being better than control. The posterior distribution showed a clear rightward shift — the variant was almost certainly producing a positive effect on the primary metric. But the test had not reached the pre-specified frequentist significance threshold of 95%. So under our existing decision rules, it was marked inconclusive and the variant was not shipped.

Several months later, a different team ran a related test on the same page flow. That test produced a statistically significant result in the same direction. We shipped the change. It performed as expected.

Had we shipped the first variant — the 87% probability result — we would have captured those gains months earlier. The conservative statistical framework that was supposed to protect us from false positives had instead cost us months of opportunity. The expected value of shipping was clearly positive. We had the data to see that. But our decision rule was not designed to use it.

That experience accelerated my thinking about expected loss as a decision framework, and it changed how I evaluate test results in our program.

The Problem With "Is It Significant?"

Statistical significance is a threshold test. It answers a binary question: given the null hypothesis of no difference, is the observed result unlikely enough to reject that null? A p-value of 0.04 means: if there were truly no difference, you would see data this extreme about 4% of the time.

That is a coherent question. But it is not the question a business decision-maker actually needs to answer.

The question a product or marketing leader needs to answer is: given what I know about this test, what is the cost of being wrong in each direction?

If I ship a variant that is actually neutral, what do I lose? If I do not ship a variant that is actually positive, what do I lose? These two errors have asymmetric costs, and p-values do not help you evaluate them.

P-values are also deeply counterintuitive as decision inputs. They do not tell you the probability that the variant is better. They do not tell you the expected magnitude of the effect. They do not tell you what decision you should make. They tell you the probability of observing data at least as extreme as what you saw, conditional on a null hypothesis you are probably not particularly interested in.

Expected loss addresses the question that matters. It quantifies: if I make this decision and I am wrong about the direction, how much value am I likely to give up?

What Expected Loss Is

Expected loss is a Bayesian decision theory concept. At its core, it measures the expected value of the error you make by choosing a particular action.

After running an A/B test, you have a posterior distribution over the true difference between variant and control. That distribution tells you not just the point estimate of the effect, but the full probability-weighted range of plausible effect sizes.

Expected loss from shipping the variant is calculated as: the probability-weighted average of the control's performance advantage over the variant, across all scenarios where the control is actually better.

Expected loss from not shipping (keeping control) is: the probability-weighted average of the variant's performance advantage over control, across all scenarios where the variant is actually better.

In plain language:

Expected loss of shipping = on average, how much you lose if you ship and the variant is actually worse
Expected loss of not shipping = on average, how much you lose if you hold back and the variant is actually better

A decision threshold based on expected loss says: ship when the expected loss of shipping is below some tolerance threshold. Do not ship when the expected loss of shipping exceeds that threshold. The threshold is expressed in the same units as your primary metric — conversion rate points, revenue per visitor, or whatever you are measuring.

Calculating Expected Loss in Practice

For a binary conversion metric (the common case in CRO), the posterior distributions for variant and control conversion rates are typically modeled as Beta distributions. Beta distributions are the natural conjugate prior for proportion data, which means the math stays tractable.

After observing conversion data, your posterior for each variant is:

Control: Beta(alpha_c + conversions_c, beta_c + non-conversions_c) Variant: Beta(alpha_v + conversions_v, beta_v + non-conversions_v)

Where alpha and beta are your prior parameters (typically 1 and 1 for a flat, uninformative prior, or something more informative if you have historical data about typical effect sizes on this type of page).

The expected loss of shipping the variant is then:

E[Loss | ship variant] = E[max(theta_c - theta_v, 0)]

In words: integrate over all combinations of theta_c and theta_v, take the difference where control is better, multiply by the probability of that combination, and sum. This is the probability-weighted expected harm from shipping a variant that is actually worse.

For a flat prior, this simplifies to:

E[Loss | ship] ≈ (1 / B(alpha_c, alpha_v)) * integral of [Beta CDF differences]

In practice, this integral is approximated via Monte Carlo simulation. You draw a large number of samples from each posterior (10,000 to 100,000 draws is standard), compute the difference at each draw, average the positive differences for one direction and the negative differences for the other, and you have your expected loss estimates.

This sounds more complex than it is. The actual implementation is a few lines of code, and many Bayesian testing frameworks compute it automatically. If you are using a platform that surfaces "probability to be best" alongside a conversion rate distribution, expected loss is derivable from those same posteriors.

A Decision Framework Based on Expected Loss

The threshold question — how low does expected loss need to be before you ship? — is a business calibration, not a statistical one. It depends on what a conversion rate point is worth to your organization.

Here is the framework I use in our program:

Step 1: Convert expected loss to business value. If your expected loss is 0.3 percentage points (absolute) on a metric that converts at 5%, and each conversion is worth some dollar amount, you can express expected loss directly as expected dollars at risk per period.

Step 2: Set a loss tolerance threshold. How much expected value are you willing to risk per decision? This should be calibrated to the size of the change, the reversibility of the shipping decision, and the opportunity cost of waiting. For small, easily reversed UI changes, a relatively generous threshold is appropriate. For pricing changes or structural changes that are difficult to roll back, a tighter threshold makes sense.

Step 3: Compute expected loss in both directions. You need both numbers: expected loss of shipping, and expected loss of not shipping. The decision is not simply "is expected loss low?" It is "which decision has lower expected loss?"

Step 4: Apply the threshold. If the expected loss of shipping is below your tolerance threshold, ship. If the expected loss of not shipping is below your tolerance threshold and the expected loss of shipping is above it, hold. If neither is below threshold, the test needs more data.

Step 5: Document the reasoning. For any test where you deviate from the standard significant/not-significant binary, document the expected loss calculation and the business reasoning behind the threshold. This creates an audit trail and helps calibrate thresholds over time.

The Six Tests We Killed as Inconclusive

In our program, over the span of roughly two years, there were six tests that reached Bayesian probability to be best above 90% — in some cases above 95% — without reaching frequentist significance at the conventional threshold.

Under a purely frequentist decision framework, all six were marked inconclusive. The variants were not shipped.

When I applied expected loss analysis retrospectively to those six tests, the results were instructive. In four of the six cases, the expected loss of shipping was quite low in absolute terms — a fraction of a percentage point of the primary metric. The tests had not reached significance largely because the effects were small and the confidence intervals were wide, not because there was strong evidence the variants were equivalent to control.

For those four tests, expected loss analysis would have supported a shipping decision, or at minimum a decision to collect more data with an explicit end date rather than running indefinitely.

In the remaining two cases, the expected loss calculation told a different story: although the probability to be best was above 90%, the posterior distribution included non-trivial probability mass on scenarios where the variant was meaningfully worse than control. The expected loss of shipping was not trivially small. It was in a range where caution was warranted. The frequentist framework had reached the right conclusion by accident — not because p = 0.06 means "the variant is risky," but because the underlying data happened to show a wide distribution with some downside exposure.

The nuance matters. Expected loss does not simply lower the bar for shipping. It asks a different question, and it sometimes produces more conservative conclusions than a pure probability threshold would.

The Test We Should Have Shipped

One of those six tests deserves more detail because it illustrates the cost of the wrong decision framework most clearly.

The test was on a multi-step enrollment flow. The variant simplified the second step — reducing the number of fields and restructuring the layout. After three weeks, the variant showed a lift in completion rate for that step. The Bayesian probability to be best was 91%. The frequentist p-value was 0.08.

Under our rules at the time, the test was inconclusive. The variant was not shipped.

When I applied expected loss analysis to the data at the three-week mark: the expected loss of shipping the variant — the average harm if the variant was actually worse — was approximately 0.4 percentage points of completion rate on that step. The expected loss of not shipping — the average gain we were foregoing if the variant was actually better — was approximately 1.8 percentage points.

The asymmetry was clear. The expected cost of an incorrect ship decision was less than a quarter of the expected cost of an incorrect hold decision. The rational business decision, accounting for uncertainty, was to ship.

We did not ship. Several months later, a revised version of the same hypothesis was tested, reached significance, and was implemented. The gains we saw from that test were consistent with what the first test had suggested.

The earlier result was not a false positive. It was an accurate signal that our decision framework was too blunt to act on.

When Frequentist Significance Is Still the Right Call

Expected loss analysis is not a universal replacement for significance thresholds. There are contexts where the conservative, pre-specified frequentist framework is clearly correct.

High-stakes, difficult-to-reverse decisions. Pricing changes, legal-adjacent copy, or structural changes to core flows that affect large revenue bases should be held to a higher standard. The cost of a false positive in these contexts may genuinely exceed the cost of a false negative. Significance thresholds calibrate that asymmetry correctly for high-risk decisions.

Regulated environments. If test results will be reviewed by legal, compliance, or a regulatory body, frequentist methods with explicit error rate control are the defensible standard. Expected loss is harder to explain in an audit, and the institutional credibility of the frequentist framework matters in those contexts.

Low-probability, high-magnitude downside scenarios. Expected loss averages across the posterior distribution. If your posterior has a thin tail on the left where the variant is substantially worse, expected loss may look acceptable even though there is a meaningful probability of a significant negative outcome. For tests where even low-probability harm is unacceptable, the full posterior distribution matters more than the expected value.

The key is matching the decision framework to the decision context. Expected loss is the right tool for the middle of the distribution — tests where you need to make good decisions under uncertainty about typical-sized effects. It is less suited for the extremes.

Building Expected Loss Into Your Workflow

The practical implementation is simpler than the theory suggests. If your testing platform reports probability to be best and a conversion rate distribution, you can approximate expected loss by hand with a few steps.

But the more important shift is conceptual: from asking "did we cross the significance threshold?" to asking "what is the cost of each decision, weighted by our uncertainty about the true effect?"

That question produces better decisions. It forces you to be explicit about the business stakes rather than hiding behind a statistical threshold. It gives you a framework for acting on high-probability, small-effect tests rather than leaving those results on the table. And it gives you a framework for staying cautious on high-probability, high-variance tests where the downside is real.

I track expected loss alongside Bayesian probability and frequentist p-values for every test in our program through GrowthLayer. Having all three metrics in the same view — and watching them tell different stories about the same data — is the best education in the limits of any single framework I have found.

The Bottom Line

"Is it significant?" was designed to answer a specific question about long-run error rates in repeated experiments. It is a useful check, but it is not a decision protocol.

The question that drives good business decisions is different: if I act on this result and I am wrong, what do I lose? Expected loss answers that question directly, in units that matter to the business, accounting for the full uncertainty in your data.

For the six inconclusive tests in our program — and for the one I still think about most often — a cleaner decision framework would have produced better outcomes. Not by lowering standards, but by asking the right question.

If you want to make smarter decisions from the tests you are already running, start there.

Expected Loss: The Decision Metric That Replaces "Is It Significant?"