Skip to main content

The Real Cost of Inconclusive Tests (And How Pre-Test Calculations Fix It)

---

A
Atticus LiApplied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method
7 min read

Editorial disclosure

This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.

Fortune 150 experimentation lead100+ experiments / yearCreator of the PRISM Method
A/B TestingExperimentation StrategyStatistical MethodsCRO MethodologyExperimentation at Scale

By Atticus Li -- Applied Experimentation Lead at NRG Energy (Fortune 150). Creator of the PRISM Method. Learn more at atticusli.com

Last year, I audited an experimentation program at a company running about 40 tests per year. Fourteen of those tests -- 35% -- ended as inconclusive. Not losers. Not winners. Just... nothing. No result. Weeks of work with zero signal.

When I tallied the cost -- design time, development hours, QA cycles, traffic opportunity cost, and the analysis time at the end -- those 14 inconclusive tests represented roughly $280,000 in wasted resources. And the worst part: every single one of them was predictable. A 10-minute pre-test calculation would have flagged each one as underpowered before a single line of code was written.

At NRG Energy, where we run 100+ experiments per year, we cannot afford a 35% inconclusive rate. Our target is under 10%, and we consistently hit it. The difference is not luck. It is pre-test math.

What an Inconclusive Test Actually Costs

Teams treat inconclusive tests as a minor disappointment. "Oh well, we did not learn anything. Let us move on." But the costs are real and quantifiable:

Design and research time. Before any test runs, someone spends 5-20 hours on hypothesis development, competitive research, user research synthesis, and variant design. For an inconclusive test, all of that time produced zero actionable output.

Development and implementation. Building test variants, setting up the experiment in your platform, implementing tracking, and configuring goals. At enterprise scale, this is typically 20-60 hours of engineering time per test. For inconclusive tests, those hours could have gone toward a test that would have produced a result.

QA and validation. Cross-browser testing, mobile verification, accessibility checks, analytics validation. Another 5-15 hours per test.

Traffic opportunity cost. This is the big one that nobody calculates. While an underpowered test is running for 6 weeks and going nowhere, that traffic could have been allocated to a properly powered test that would have produced a decision. If you run tests sequentially on the same page, an inconclusive test delays every test behind it in the queue.

Analysis and reporting time. Ironically, inconclusive tests often take more analysis time than conclusive ones, because the team tries to find signal in the noise. "Maybe if we cut it by this segment..." "What if we extend it another week?" These are sunk-cost-driven efforts that rarely produce value.

Add it up for a typical enterprise test:

Cost Component

Hours

Loaded Cost

Research and design

10-20

$2,000-$4,000

Development

20-60

$4,000-$12,000

QA and validation

5-15

$1,000-$3,000

Analysis and reporting

5-10

$1,000-$2,000

Traffic opportunity cost

--

$5,000-$15,000

Total per inconclusive test

$13,000-$36,000

Multiply that by 10 inconclusive tests per year and you are looking at $130,000-$360,000 in waste. For a program running 100+ tests, even a 10% inconclusive rate means significant lost value.

Why Tests End Up Inconclusive

There are exactly three reasons a test comes back inconclusive:

1. The real effect is smaller than the test can detect. You expected a 5% lift, but the true effect is 1.5%. Your test does not have enough statistical power to distinguish a 1.5% lift from zero. This is the most common reason, and it is entirely preventable with pre-test MDE calculations.

2. There is genuinely no effect. The variant is the same as the control. This is actually a useful result -- you learned that this change does not matter. But many teams code this as "inconclusive" instead of "the variant does not work," which is a framing problem, not a statistical one.

3. The test ran into a technical issue. Tracking broke, segments leaked, SRM invalidated the results. This is preventable with AA testing and monitoring, which I cover in a separate post.

Reason 1 is where the money is. And the fix is simple.

The MDE-First Approach

Minimum Detectable Effect is the smallest improvement your test can reliably detect given your sample size, baseline conversion rate, and desired statistical power.

Before any test enters our queue at NRG, we calculate MDE. The question is not "will this test win?" The question is "can this test produce a conclusive result?"

The calculation:

For a standard two-sample proportion test at 80% power and 95% confidence:

MDE = 2.8 x sqrt(2 x p x (1-p) / n)

Where:

  • p = baseline conversion rate
  • n = sample size per variant
  • 2.8 comes from the z-scores for alpha=0.05 (1.96) and beta=0.20 (0.84)

Example: Your page converts at 4% (p = 0.04). You can get 15,000 visitors per variant in 4 weeks. Your MDE is:

MDE = 2.8 x sqrt(2 x 0.04 x 0.96 / 15,000) = 2.8 x 0.00320 = 0.00896

That is a relative MDE of 0.896 / 4.0 = 22.4%. You can only detect effects larger than a 22% relative improvement. If your hypothesis expects a 5% improvement, this test will almost certainly come back inconclusive. Do not run it.

What to do instead:

Option A: Increase sample size. Can you run the test for 16 weeks instead of 4? Can you expand the audience?

Option B: Increase the expected effect size. Can you make a bolder change? Small copy tweaks rarely produce large effects. A structural redesign might.

Option C: Change the metric. A micro-conversion (CTA click) will have a higher base rate and produce a detectable effect faster than a macro-conversion (purchase).

Option D: Do not test it. Some changes are not worth testing. Ship based on best judgment, or do not ship at all.

Revenue-Per-Customer Projections: Is It Worth Testing?

MDE tells you whether a test can be conclusive. Revenue projections tell you whether it should be.

Even if a test is statistically feasible, it might not be worth the testing slot. At NRG, we pair MDE calculations with revenue projections:

Expected Revenue Impact = (Annual Traffic) x (MDE) x (Revenue Per Conversion) x (Expected Win Rate)

We use our program-level win rate (24%+) as the probability that any given test will win. This gives us the expected value of running the test.

Example: A test on a page with 500,000 annual visitors, an MDE of 10% relative (on a 4% base rate), and $200 revenue per conversion:

Expected Revenue Impact = 500,000 x 0.004 x $200 x 0.24 = $96,000

If the test costs $20,000 to run (design, dev, QA, opportunity cost), the expected ROI is positive. Run it.

But if the page only gets 50,000 annual visitors:

Expected Revenue Impact = 50,000 x 0.004 x $200 x 0.24 = $9,600

Now the expected value barely covers the cost of running the test. Think hard about whether this is the best use of a testing slot. This calculation is a core part of the Revenue Rank step in Atticus Li's PRISM Method.

When NOT to Run a Test

Based on our pre-test calculations, here are the situations where we explicitly choose not to test:

MDE exceeds plausible effect size. If you need a 30% relative improvement to detect anything, and the change is a button color swap, you are not going to get a 30% lift. Do not run the test.

Expected revenue impact is below testing cost. If the best-case scenario does not justify the resources, skip it. Ship the change without testing (if the risk is low) or deprioritize it entirely.

Traffic is too seasonal or volatile. If the page gets 80% of its traffic in one week per year, you cannot run a meaningful test during the other 51 weeks. Plan tests around traffic availability.

The decision has already been made. If leadership has already committed to shipping the change regardless of test results, do not waste a testing slot validating a foregone conclusion. Use that slot for a test that can actually influence a decision.

The test duration exceeds the decision timeline. If you need a result in 2 weeks but the test requires 8 weeks to reach power, the test cannot inform the decision. Find a faster alternative or make the decision without data.

Building This Into Your Process

Here is the pre-test checklist we use at NRG:

  1. State the hypothesis and expected effect size. Be specific: "We expect a X% relative improvement in [metric]."
  2. Calculate MDE. Given available traffic and test duration, what is the smallest effect you can detect?
  3. Compare expected effect to MDE. If expected effect < MDE, redesign the test or do not run it.
  4. Calculate revenue impact at MDE. If the test wins at exactly MDE, what is the revenue impact?
  5. Compare revenue impact to test cost. If impact < cost, deprioritize.
  6. Document the decision. Whether you run the test or not, record why. This builds institutional knowledge about what is worth testing.

This process adds about 30 minutes per test idea. It saves weeks per inconclusive test avoided. The math is not hard. The discipline to do it consistently is the challenge.

The Bottom Line

Inconclusive tests are the silent killer of experimentation programs. They waste resources, demoralize teams, and erode stakeholder confidence in testing. And the vast majority of them are preventable.

Do the math before you run the test. Calculate MDE. Project revenue impact. Compare to cost. If the numbers do not work, have the discipline to say "this is not worth testing."

Your program's credibility depends not on how many tests you run, but on how many of them produce decisions.

Atticus Li leads enterprise experimentation at NRG Energy with a 24%+ win rate. Pre-test calculations are a core component of the PRISM Method. Learn more at atticusli.com.

About the author

A
Atticus Li

Applied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method

Atticus Li leads applied experimentation at NRG Energy (Fortune 150), where he and his team run more than 100 controlled experiments per year on customer-facing surfaces. He is the creator of the PRISM Method, a framework for high-velocity experimentation programs at large enterprises. He writes regularly about the statistical and operational details of A/B testing — the parts most CRO content skips.

Keep exploring