Skip to main content

Non-Inferiority Testing: The A/B Test Framework for When

4 tests proved their value by NOT hurting the primary metric while generating secondary wins. Here is the complete guide to non-inferiority A/B testing.

A
Atticus LiApplied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method
11 min read

Editorial disclosure

This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.

Fortune 150 experimentation lead100+ experiments / yearCreator of the PRISM Method
A/B TestingExperimentation StrategyStatistical MethodsCRO MethodologyExperimentation at Scale

Most A/B tests ask: "Does variant beat control?" The primary metric either goes up enough to declare a winner, stays flat, or goes down. Three outcomes, and one of them is what you hoped for.

But there is a category of business problem where that framing is wrong. Where the real question is not "does this increase conversion?" but "does this add value without hurting what we already have?"

Non-inferiority testing is the statistical framework for that question. I have used it in four meaningful cases across a high-consideration enrollment funnel. In each one, the test would have been declared inconclusive or a failure under a standard superiority framework -- and each one would have been the wrong call.

This article explains non-inferiority testing from the ground up: when to use it, how to set it up correctly, what the results actually tell you, and the mistakes I see teams make when they try to apply it without fully understanding the framework.

When Standard Superiority Testing Is the Wrong Tool

A superiority test is designed to detect whether a variant performs meaningfully better than control. That is the right design for most tests. You have a hypothesis about a change that should improve a metric, you measure whether it does, and you make a decision.

Superiority testing is the wrong tool in two specific situations.

First: when you are testing an additive channel. If you are adding a new pathway to conversion -- a phone sales channel alongside a digital enrollment flow, for instance -- the new channel cannot be measured solely by its effect on the existing metric. If the phone channel converts customers who would not have enrolled digitally, the digital conversion rate may stay flat even as total conversions increase. A superiority test on the digital metric would conclude "no impact" and miss the entire point.

Second: when you are testing a change with secondary benefits. Sometimes you want to change the user experience for reasons that are not captured in your primary metric. A design overhaul, a copy simplification, a page restructure. If the new experience is better for users but does not move the primary metric, a superiority test calls it a failure. But "no worse on primary, meaningfully better on secondary" is a perfectly valid outcome if you designed the test correctly.

Non-inferiority testing reframes the question. Instead of asking "is variant better than control?", it asks "is variant no worse than control by more than a defined margin?" If the answer is yes, and variant shows meaningful improvements elsewhere, you have a result you can act on.

Key Takeaway: Non-inferiority testing is not a consolation framework for tests that fail to show improvement. It is the correct framework for a specific, well-defined business question: does this change cause meaningful harm to what already works?

The Non-Inferiority Margin: Your Most Important Design Decision

Before you run a non-inferiority test, you must define your non-inferiority margin. This is the maximum amount of degradation in the primary metric you are willing to accept in exchange for the secondary benefits you expect.

The margin is a business decision, not a statistical one. Statistics can tell you whether your observed result is within the margin. Only you can determine what margin is acceptable.

In practice, I set the margin by asking: at what level of primary metric decline would we not ship this change regardless of secondary benefits? That threshold is your margin.

For a high-consideration enrollment funnel where the primary metric is the core conversion event, I have used a -2% threshold. This means: if the variant causes a decline in primary conversion larger than 2%, the secondary benefits are not sufficient justification to ship. If the decline is within 2% (or if the variant actually improves the primary metric), the secondary results can drive the decision.

The -2% threshold sounds specific, but it is grounded in the business context. We calculated that a sustained -2% decline in primary conversion across the full traffic volume would be material enough to flag as harm. Anything smaller than that was within the noise level of normal quarter-to-quarter variation and would not constitute a significant adverse effect.

Your margin will depend on your primary metric's baseline, your traffic volume, and the strategic importance of the change you are testing. A test with large expected secondary benefits warrants a somewhat wider margin. A test with modest secondary benefits should use a tighter one.

Statistical Considerations: Power, Sample Size, and One-Sided Testing

Non-inferiority tests require larger sample sizes than their superiority counterparts. This surprises practitioners who assume that proving "not worse" requires less evidence than proving "better." The logic goes backwards. Proving that something is definitely not worse requires enough power to rule out harm, which is a higher statistical bar than simply detecting improvement.

Calculate your sample size with the following inputs:

  • Your non-inferiority margin (the maximum acceptable decline)
  • Your expected treatment effect (often assumed to be zero -- you expect the variant to be equivalent to control)
  • Your desired power (80% minimum; 90% preferred for high-stakes decisions)
  • Your significance level (one-sided alpha of 0.025, which corresponds to a two-sided alpha of 0.05)

The one-sided versus two-sided distinction matters here. In superiority testing, you typically use a two-sided test because you care about both the possibility that variant is better and the possibility that it is worse. In non-inferiority testing, you are only testing one direction: whether the variant is worse than control by more than your margin. This is a one-sided test, which affects both your sample size calculation and how you interpret the confidence interval.

The key confidence interval interpretation: for a non-inferiority test, you are looking at the lower bound of the confidence interval around the treatment effect. If the lower bound is above your non-inferiority margin (expressed as a negative number), you have established non-inferiority.

For the -2% margin example: if your 95% confidence interval for the treatment effect runs from -0.8% to +1.4%, the lower bound (-0.8%) is above the -2% margin. Non-inferiority is established. You can now look at your secondary metrics to determine whether to ship.

Key Takeaway: Non-inferiority tests require larger samples than superiority tests and use one-sided confidence intervals. Design the test correctly upfront -- retrofitting a non-inferiority interpretation onto a test designed for superiority is a methodological error.

Four Cases From a Real Testing Program

Case 1: The Phone Channel CTA -- The Gold Standard

This is the clearest non-inferiority application I have run. The test introduced a prominent call-to-action directing users to a phone sales channel. The business question was not "will this increase digital conversions?" -- it obviously would not. The question was "will this generate incremental phone sales without cannibalizing digital enrollments?"

The primary metric was digital enrollment completion rate. The non-inferiority margin was -2%.

The result: the primary metric held at -0.4% versus control. Well within the -2% margin. Non-inferiority established.

The secondary metric was phone-assisted enrollments, tracked through a dedicated phone number associated with the variant. The variant generated hundreds of incremental phone conversions that would not have occurred without the CTA. These were net new customers who preferred the phone channel and would not have enrolled digitally.

This is the textbook non-inferiority outcome. Primary held, secondary won big, and the test provided clear justification to make the phone CTA permanent. The total revenue impact was substantially larger than any pure digital conversion test we ran that quarter.

I log results like this in GrowthLayer specifically so the reasoning is preserved. A test that "didn't move the primary metric" could look like a failure in a surface-level program review. Having the non-inferiority framing documented alongside the secondary results makes the business case clear to anyone reviewing the history.

Case 2: The Landing Page Redesign

A significant visual redesign of the primary landing page. The existing design had accumulated years of incremental changes and was not cohesive. The new design was cleaner, faster, and better organized.

We did not expect the redesign to move primary conversion -- it was not changing the funnel structure or the value proposition. We were resetting a debt accumulation, not making a targeted conversion play.

Primary metric: product chart views (a mid-funnel engagement proxy). Result: -0.06%. Effectively flat. Non-inferiority established with a wide margin to spare.

Secondary metric: phone sales. Result: approximately doubled versus the prior period, from ~234 to ~469. The redesign made the phone CTA more visible and contextually appropriate, driving a channel mix shift that more than compensated for any potential primary impact.

The non-inferiority framework let us ship a redesign that a pure conversion-rate view would have called inconclusive. Without the pre-specified non-inferiority framing, the conversation would have been "why are we making such a big change if the conversion rate did not move?" With it, the conversation was "primary held, phone conversions doubled, ship it."

Case 3: The Satisfaction Guarantee Copy Test

A test of copy modifications to the satisfaction guarantee section of an enrollment page. The hypothesis was that more specific, credible guarantee language would reduce friction and improve both primary and secondary metrics.

The primary result was essentially flat -- inconclusive on a superiority interpretation. But the non-inferiority margin was comfortably maintained.

The secondary metric -- enrollment confirmations at a downstream step -- showed a +3.4% increase with sufficient statistical confidence to take seriously. The mechanism makes sense: clearer guarantee language reduced hesitation at the decision point, leading more users who started enrollment to complete it.

The flat primary with positive secondary is a pattern I call "funnel quality improvement" -- the test did not attract more users into the funnel, but it improved the quality of the users who were already in it. That is a meaningful outcome that non-inferiority testing correctly surfaces as a win.

Case 4: The Pricing Badge Test

A test adding a visual pricing badge to a product card. The badge was designed to create perceived value and reduce pricing friction.

Primary metric: product chart views. Result: -0.3%. Within the non-inferiority margin.

Secondary metric: enrollment start rate. Result: +3.4%. Users who saw the pricing badge were meaningfully more likely to begin the enrollment process after viewing the product information.

Same interpretation pattern as Case 3: primary holds, secondary moves. The badge did not increase mid-funnel engagement, but it increased the conversion rate from mid-funnel to late-funnel. Net positive, and the non-inferiority framing was what allowed us to surface that correctly.

Key Takeaway: The "primary holds, secondary wins" pattern is the most common non-inferiority outcome in practice. It represents genuine value that a superiority-only framework would systematically overlook.

Common Mistakes in Non-Inferiority Testing

Mistake 1: Declaring a winner when the primary metric is just "not worse."

Non-inferiority proves that the variant does not cause meaningful harm to the primary metric. It does not prove that the variant is superior. Do not use a non-inferiority result to claim that your variant is better than control on the primary metric.

The correct language is: "Non-inferiority was established. The primary metric held within the defined margin. Secondary metrics showed [X]." Not "the variant won."

Mistake 2: Choosing the margin after seeing the results.

This is outcome-motivated reasoning dressed up as statistics. Your non-inferiority margin must be specified before the test runs. If you choose the margin after seeing that the primary declined by 1.8% and want to justify shipping, any margin wider than 1.8% is rationalizing a decision you already made, not testing a hypothesis.

Mistake 3: Applying non-inferiority retrospectively to a failed superiority test.

A test designed for superiority that shows a flat primary should not be retroactively declared non-inferior. The sample size was calculated for a different question, the alpha was set differently, and the confidence interval was oriented differently. Retroactive non-inferiority is a different claim than pre-specified non-inferiority.

Mistake 4: Using non-inferiority when you actually need to prove improvement.

If your business requires the primary metric to improve -- if you are at a performance threshold where flat is not acceptable -- then non-inferiority is not the right design. Be honest about what your business actually needs from a test before choosing the framework.

When Non-Inferiority Is Not Appropriate

Non-inferiority is the wrong framework when:

  • The change you are testing is specifically intended to improve the primary metric and should be evaluated on whether it does
  • Your primary metric is under-performing and flat results represent an opportunity cost, not an acceptable outcome
  • You do not have a genuine additive or secondary hypothesis -- you are just hoping the primary holds
  • You cannot pre-specify a business-grounded margin before running the test

The framework should be chosen because it matches the business question, not because you expect the primary metric to be flat and want a way to ship anyway. Used honestly, non-inferiority is a powerful tool. Used as a fallback for tests that did not achieve their original goal, it erodes the credibility of your entire testing program.

How Non-Inferiority Fits Into a Broader Testing Strategy

In a mature testing program, you should have both superiority and non-inferiority tests in your pipeline at the same time. They address different hypotheses and different types of business value.

Superiority tests drive direct conversion improvement. They are your primary engine of optimization.

Non-inferiority tests enable strategic changes that deliver value through channels other than the primary metric. Design improvements, channel additions, funnel quality work, and experience enhancements all belong here.

The two frameworks together give you the ability to make a much wider range of confident decisions. Without non-inferiority testing, you are limited to changes that move the primary metric directly. With it, you can confidently ship changes that improve the overall system even when the effect does not show up in the primary KPI.

If you are building out your testing infrastructure, GrowthLayer includes fields for specifying test type (superiority vs. non-inferiority), the pre-specified margin, and the secondary metrics that will drive the decision. Having that structure in place makes the design discipline easier to maintain across a team.

Conclusion: Design for the Question You Are Actually Asking

The four non-inferiority tests I described generated substantial business value. One introduced a channel that produced hundreds of incremental conversions. One enabled a major redesign. Two surfaced funnel quality improvements that improved downstream conversion rates.

None of them would have been correctly evaluated under a standard superiority framework.

The key to non-inferiority testing is matching the statistical design to the business question. Not every test is asking "does variant beat control?" When the real question is "does this add value without breaking what works?", you need a different framework.

Pre-specify your margin. Calculate your sample size correctly. Use one-sided confidence intervals. Document your interpretation before you look at the data.

Done right, non-inferiority testing expands the range of questions your experimentation program can answer -- and the range of decisions you can make with confidence.

Ready to build a testing program that goes beyond simple win/loss tracking? GrowthLayer is free to start and gives your team the structure to run both superiority and non-inferiority tests with full documentation of design decisions. Start your first test today.

About the author

A
Atticus Li

Applied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method

Atticus Li leads applied experimentation at NRG Energy (Fortune 150), where he and his team run more than 100 controlled experiments per year on customer-facing surfaces. He is the creator of the PRISM Method, a framework for high-velocity experimentation programs at large enterprises. He writes regularly about the statistical and operational details of A/B testing — the parts most CRO content skips.

Keep exploring