Most A/B testing content treats results as binary: you either have a winner or a loser. Ship or kill. Green or red.

After running 100+ experiments per year at a Fortune 150 energy company — generating $30M in verified revenue impact in 2025 — I can tell you that framing is dangerously incomplete. Every experiment lands in one of six distinct outcome categories, each demanding a different decision framework.

1. Clear Winner: Stat Sig and Economically Meaningful

Your variant hits statistical significance, and the lift matters to the business. Feels easy — but carries a trap that catches even experienced teams.

The trap is shipping without checking downstream. We had a test that increased enrollment starts by 14%. The team wanted to ship immediately. But enrollment confirms — actual completions — showed no lift. The variant attracted low-intent clicks that never followed through.

The rule: Never ship on primary metric alone. Watch enrollment confirms, not just starts. Watch purchase completions, not just add-to-carts. The 24 hours validating downstream saves you from shipping a metric mirage.

When to ship: Stat sig at your pre-set threshold (typically 95%), economically meaningful lift confirmed, downstream metrics neutral or positive.

2. Clear Loser: Stat Sig Negative Result

A clear loser is the second-best outcome you can get. Not the worst — the second best. A stat sig negative gives you high-confidence information about what does not work, eliminating an entire direction of future effort.

Real example: We replaced a generic "Sign up" button with value-prop CTAs like "Start saving today" and "Get your free plan." Enrollment dropped significantly.

The insight: at the action point, users have already decided to act. Specific value propositions add evaluation friction — forcing users to pause and assess whether the promise matches their needs. Generic "Sign up" imposed zero cognitive load at the exact moment when cognitive load is most destructive.

That single loser taught us more about conversion psychology than a dozen winners. We now apply the principle — reduce evaluation friction at action points — across every funnel.

When to act: Document the insight, socialize the learning, update your hypothesis library, kill the variant.

3. Stat Sig but Economically Trivial

This fools teams without a finance partner. Your dashboard shows green — stat sig reached — but the business impact is negligible.

The CFO framing: a 0.3% relative lift reaching significance translates to $12K/year in revenue. The test cost $40K in engineering, design, and experimentation slot time. You spent $40K to find $12K.

The formula I use to pre-qualify every test:

EBITDA Impact = Brand Monthly EBITDA × Annualized Traffic × Baseline CR × Relative Lift

If projected EBITDA impact does not clear a threshold justifying resources consumed, the result is economically trivial regardless of stat sig. Use this formula in reverse too: calculate the MDE that would make a test worthwhile before running it.

When to ship: Almost never. Exception: truly zero-cost to ship with no opportunity costs.

4. Directional but Inconclusive

The most common result — roughly 61% of tests end here. The variant trends positive but never reaches your significance threshold before runtime expires.

Critical reframe: "inconclusive" does not mean "failed." It means you lack sufficient evidence at the confidence level you chose.

Decision framework:

Runtime extension: Could 1-2 more weeks reach significance? If power analysis says 3x more runtime needed, extending is not the answer.
Bayesian posterior: Even if frequentist test is inconclusive, what does the posterior show? 75-80% probability the variant is better with bounded downside might be enough to act.
Business context: How reversible is the decision? A copy change reverted in 5 minutes has different risk than a pricing experiment.

Decision patterns:

High potential, clean data, low downside: Ship provisionally with scheduled confirmation test.
High potential, mixed data, meaningful downside: Plan a powered re-run with tighter controls.
Low potential, ambiguous data: Archive as weak signal; move to higher-value hypotheses.

5. Flat / True Null Result

This is where most teams say "the test failed" and move on. That is wrong. A true null result — where neither variant shows any meaningful difference — is information, not failure.

A flat result tells you something important about hypothesis quality. There are two common interpretations:

"Already optimized." The page or element you tested is already at or near a local optimum for that class of intervention. Marginal changes within the same design paradigm will not move the needle. This signals: go bigger — test a fundamentally different approach, not a tweak.

"Variant was not different enough." Your treatment was too similar to control. Users literally did not notice or care about the difference. This signals: same hypothesis, bolder execution.

The distinction matters: the first interpretation means "stop testing this surface" while the second means "test this surface with a bolder variant."

When to act: Consolidate to the operationally simpler variant. Mark the intervention class as "low leverage at current magnitude" in your backlog. Shift experimentation capacity to structurally different ideas.

6. Mixed Signals: Primary Up, Secondary Down

This is the most dangerous result type because it looks like success if you only check one metric.

Your primary conversion metric goes up — or looks flat — but a secondary or downstream metric goes down. The dashboard says "acceptable" but the full picture tells a more complicated story.

Real example: We ran a homepage redesign test. The primary metric — click-through to enrollment page — was flat. But enrollment confirms dropped. The investigation revealed why: the new design used a modal enrollment flow instead of a dedicated enrollment page. The modal had a higher start rate (reduced friction to begin) but a lower completion rate (felt less serious, users abandoned more easily). Net effect was negative despite surface-level metrics looking acceptable.

This is exactly the kind of result where checking only your primary metric would lead you to ship something harmful. I wrote about a similar case in detail: We Made the UX Better and Conversion Dropped.

Decision framework for mixed signals:

Never ship on primary metric alone. If secondary metrics are degraded, you need to understand why before making a decision.
Calculate the net impact. Sometimes the primary lift outweighs the secondary loss. Sometimes it does not. Use your EBITDA formula to compare.
Look for design compromises. Often the variant contains a good idea and a bad idea packaged together. Decompose and test the components separately.

The Decision Matrix

Classify every result using two dimensions: confidence level (low vs. high) and business impact (low vs. high).

High confidence + High impact positive: Type 1 — Ship after downstream validation.
High confidence + High impact negative: Type 2 — Kill, document the learning.
High confidence + Low impact: Type 3 — Usually do not ship; protect experimentation capacity.
Low confidence + High potential impact: Type 4 — Provisional ship or powered re-run.
Low confidence + Low impact: Type 5 — Standardize and move on.
Any confidence + Contradictory signals: Type 6 — Investigate before any decision.

Key Takeaways

A clear loser is the second-best outcome — it gives you high-confidence learning and eliminates wasted effort on that direction.
Statistical significance without economic significance is a trap — always run the EBITDA impact formula before shipping.
61% of tests are inconclusive, not failed — build a decision framework for incomplete information instead of treating it as binary.
Flat results are signals about hypothesis quality — they tell you whether you are optimizing the right lever at the right magnitude.
Mixed signals are the most dangerous result type — always check secondary and downstream metrics before shipping any winner.
Every outcome type has a specific playbook — when you stop treating experiments as "win or lose," your program becomes a capital allocation engine.

Frequently Asked Questions

What percentage of A/B tests actually produce a clear winner?

In a mature program running well-formed hypotheses, roughly 15-20% of tests produce a clear winner (Type 1). Another 10-15% produce clear losers (Type 2). The remaining 60-70% fall into Types 3-6. If your win rate is significantly higher than 20%, you are probably not testing bold enough hypotheses — you are confirming what you already know rather than learning something new.

How long should I let an inconclusive test run before calling it?

Set your maximum runtime before the test launches based on your power analysis. I typically cap tests at 4-6 weeks for most web experiments. If a test has not reached significance by its pre-set end date, classify it as Type 4 and use the decision framework rather than extending indefinitely. Indefinite extension introduces seasonal confounds, novelty decay, and organizational fatigue.

Should I use Bayesian or frequentist methods for analyzing results?

Both. I use frequentist significance as the primary decision gate (it is what most stakeholders understand and what most tools report), but I layer in Bayesian posterior probabilities for Type 4 results where the frequentist answer is "inconclusive." The Bayesian posterior gives you a probability distribution that helps with the "ship provisionally vs. re-run" decision. They are complementary tools, not competing philosophies.

How do I prevent mixed-signal results from being shipped as winners?

Build a mandatory checklist into your experiment review process. Before any test is marked "ready to ship," the analyst must confirm that all pre-registered secondary metrics are neutral or positive. If any secondary metric shows a statistically or practically meaningful degradation, the test is automatically classified as Type 6 (Mixed Signals) and requires a deeper investigation before a ship decision.

The 6 Types of A/B Test Results Nobody Explains Clearly