Skip to main content

The Pre-Test Feasibility Check That Prevents Wasted Experiments

Most tests are underpowered before they start. A pre-test feasibility check — MDE by week, runtime flags, traffic thresholds — prevents months of wasted experiment time.

A
Atticus LiApplied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method
13 min read

Editorial disclosure

This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.

Fortune 150 experimentation lead100+ experiments / yearCreator of the PRISM Method
A/B TestingExperimentation StrategyStatistical MethodsCRO MethodologyExperimentation at Scale

One of the tests in our program had been running for eleven months.

Not because it was a complex multi-variant design that needed extended data collection. Because nobody had checked, before the test launched, whether the page had enough traffic to reach statistical significance within any reasonable timeframe. The page received a small but consistent stream of organic visitors each week. The hypothesis was sound. The test was designed correctly. It simply had no practical path to a result.

When I ran the power calculation retroactively — something that should have been done before the test was ever built — the math was unambiguous. At the page's observed traffic level, with a realistic minimum detectable effect for a change of that type, reaching 95% confidence would require somewhere between nine and fourteen months of runtime, assuming no seasonality and stable conversion rates throughout.

Nobody had looked at that number before launch. The test was already two months old when I did the calculation. It ran for another nine months before it was eventually stopped — inconclusive, as it had to be — and the development time that had been spent building the variant was never recouped.

That is one test. The broader audit of our program found that a majority of tests across the full portfolio were underpowered at launch — either because the traffic was insufficient for the planned MDE, or because the planned MDE was implausibly large, or because nobody had run a power calculation at all and the test had been greenlit on intuition about how quickly results would come in.

This is not a problem unique to our program. Underpowered tests are the most common failure mode in conversion rate optimization, and they are nearly always preventable.

What "Underpowered" Actually Means in Practice

Statistical power is the probability that a test will detect a real effect of a given size if that effect actually exists. A well-powered test — conventionally, 80% power at the minimum detectable effect — has an 80% chance of producing a statistically significant result if the true effect is at least as large as the MDE you specified.

An underpowered test has a lower probability of detecting the real effect. This means one of two outcomes, neither of which is useful:

The first outcome is a false negative: the test runs, the variant performs better, but the difference does not reach statistical significance because the sample size was too small to distinguish a real effect from noise. The test concludes "no significant result" when there actually was one. The variant gets discarded. An intervention that would have improved conversion is not implemented.

The second outcome is the eleven-month test: the test keeps running in search of significance that the traffic level cannot deliver within any reasonable window. Resources are tied up. The development work is committed. The test slot is occupied. And eventually the test is stopped — not because it reached a conclusion, but because it has been running long enough to be embarrassing.

Both outcomes waste resources. The false negative wastes the resources spent designing and building a variant that worked. The indefinite-runtime test wastes those resources plus all the ongoing operational overhead of managing a running experiment that cannot conclude.

The question a feasibility check answers is: given this page's traffic level, this baseline conversion rate, and this effect size, will this test ever reach significance in a timeframe we can actually plan around?

Answering that question before a test is built prevents both failure modes.

The Three Inputs That Determine Feasibility

A pre-test feasibility calculation requires three inputs: baseline conversion rate, expected weekly traffic, and minimum detectable effect. Understanding what each input means — and what happens when any of them is incorrect — is the practical prerequisite for useful feasibility analysis.

Baseline Conversion Rate

The baseline conversion rate is the current rate at which visitors complete the primary metric action — form submissions, checkout completions, enrollments, whatever the test is measuring. It is drawn from historical analytics data, ideally from the most recent period that reflects current conditions.

The baseline rate matters because the power calculation is fundamentally about the variance in conversion events. A very low baseline rate — say, 1% — means that individual data points are almost always zero: a visitor either did not convert, which is the overwhelmingly common case, or did convert, which is rare. Detecting a meaningful difference in rare events requires much more data than detecting a difference in common ones. At 1% baseline conversion, reaching significance on even a substantial effect requires tens of thousands of visitors per variant.

A common error in feasibility planning is using an inflated baseline rate — pulling from a high-traffic period, or from an analytics segment that does not match the test audience. An inflated baseline leads to optimistic power calculations. The test appears feasible on paper and underpowered in practice.

Expected Weekly Traffic

Traffic to the test pages determines how quickly the sample accumulates. The feasibility question is really a time question: given the weekly traffic going into the test, how many weeks will it take to accumulate the sample size required for the specified power and MDE?

Weekly traffic estimates should be conservative. Traffic levels are rarely as stable as they look in a trailing average — seasonality, marketing spend changes, and external events all introduce variance. A weekly average drawn from an unusually high-traffic period will underestimate the runtime required. Seasonality adjustments matter most for tests that span multiple months: a test with a nine-month minimum runtime will span whatever seasonal patterns your traffic shows, and the average traffic across that period may be substantially different from the peak-period traffic that was used in the original estimate.

Traffic should also account for the split: in a simple two-variant test, each variant receives roughly half the traffic going to the test. The per-variant sample is what matters for the power calculation, not the total traffic.

Minimum Detectable Effect

The minimum detectable effect is the smallest effect size the test is designed to detect. It is the critical input in feasibility planning, and it is the input that produces the most confusion in practice.

The MDE is not the effect size you hope to see. It is the effect size below which you would consider the result inconsequential — too small to justify implementation costs, too small to warrant the ongoing maintenance of the change, too small to represent a meaningful improvement in the user experience or the business metric.

Choosing an inflated MDE — "we expect at least a 15% lift" when a 2% lift would actually be significant and implementable — produces optimistic feasibility calculations. The test appears feasible at a 15% MDE because the sample size required to detect a 15% lift is much smaller than the sample size required to detect a 2% lift. The test runs. It does not find 15% lift. It is declared inconclusive. What it actually found — if anything — is unknown, because it was underpowered for the effect size that would have mattered.

The right way to choose an MDE is to work backward from the business context: what is the smallest lift on this metric that would change a decision? If a 2% improvement in form completion is operationally meaningful, the MDE should be set at 2%. If a 10% improvement is the minimum that would justify the development cost of implementing the change, the MDE should be set at 10%.

This calculation is harder than it appears, because it requires honest thinking about what "meaningful" means in the specific context of the test — which varies by page, by metric, and by the cost of the proposed change.

MDE-by-Week: Why a Table Is More Useful Than a Single Number

The conventional output of a power calculation is a single number: the minimum sample size required to detect the specified effect at the specified power and significance level. That number is useful, but it is less actionable than a table that shows MDE as a function of runtime.

An MDE-by-week table answers a different question: not "how long does this test need to run?" but "what effect size can we detect if we run this test for N weeks?"

This framing is more operationally useful because it reframes the decision. Instead of checking whether a planned test is feasible at a fixed MDE, the team can look at the full range of detectable effects across the range of practical runtimes and ask: is any of these outcomes worth pursuing?

At two weeks, the test can only detect effects larger than 18% on this page. At four weeks, it can detect effects larger than 12%. At eight weeks, effects larger than 8%. At twelve weeks, effects larger than 6%.

Now the decision is explicit: is a 12% lift realistic for this type of change on this page? Is a 6% lift, at twelve weeks, a good use of a test slot? The conversation shifts from "let's run it and see" to "what are we actually committing to, and is it worth committing to it?"

That is a more productive planning conversation. It forces the team to be specific about what success looks like at each point in the runtime range, and it surfaces implausible optimism about effect sizes before the test is built.

When the Right Answer Is "Don't Test This"

One of the outputs that a pre-test feasibility check should produce — and that most testing programs are uncomfortable producing — is a clear recommendation not to run a test.

Not every hypothesis warrants an experiment. Not every page with a testable element has enough traffic to generate actionable results. Not every change with a plausible mechanism has an effect size that is practically significant and also detectable within a reasonable runtime.

When the feasibility check shows that a proposed test cannot reach significance in under a year at any effect size that would be worth implementing, the correct decision is not to run the test. The correct decision is to either find a higher-traffic page to address the same hypothesis, redesign the hypothesis to address a higher-traffic moment in the funnel, or pursue the answer through qualitative research — user sessions, interviews, analytics deep dives — that can generate directional evidence without requiring a sample size.

The discomfort with this recommendation comes from the sunken cost of ideation: the team has spent time developing a hypothesis, and declaring it infeasible before testing feels like waste. But the waste is minimal compared to the waste of running an underpowered test for six or eight months and getting a result that cannot be interpreted.

A pre-test feasibility check that produces a "don't test" recommendation is not a failure. It is the tool working correctly. The alternative — running the test anyway — does not eventually produce a result just because the team is committed to it. It produces months of resource commitment with no interpretable output.

The Overrun Problem: Tests That Run Double Their Planned Duration

The second class of feasibility failure is different from the underpowered-at-launch problem. It is the test that was feasible on paper, ran past its planned duration, and still could not reach significance — because the real effect was much smaller than the effect size that was used in the power calculation.

We had a test in our program that ran for more than double its planned duration before it was stopped. The test had been planned for six weeks, based on a power calculation that assumed a 10% minimum detectable effect. At six weeks, the result was directional but not significant. The team extended the runtime.

At twelve weeks, the result was still directional but not significant. The team extended again.

The test was eventually stopped at fourteen weeks with a final result that was directional, below significance threshold, and entirely consistent with the interpretation that the true effect was somewhere around 3% to 4% — far below the 10% MDE that had been used in the planning calculation.

The root cause was an MDE that was set optimistically rather than based on the minimum commercially meaningful effect for that change. The team had assumed a 10% lift because "10% feels like a real effect" rather than because 10% was the smallest lift that would change the implementation decision. In fact, a 3% improvement in that metric, at that point in the funnel, would have been worth implementing. The test had been planned to be underpowered for the effect size that actually mattered.

If the feasibility check had been run using an MDE of 3% rather than 10%, the required sample size would have been substantially larger, the planned runtime would have been around twenty weeks, and the team would have faced a decision: is a twenty-week test on this hypothesis the best use of a test slot? That conversation might have led to a more realistic commitment or to a different test design. It did not happen, because the feasibility check was run with an MDE that felt achievable rather than one that was defined by the business context.

Key Takeaway: The MDE used in feasibility planning must be defined by the smallest effect that would be worth implementing — not by the effect the team expects or hopes to see. An MDE set by optimism produces planned runtimes that are too short and tests that run long past their planned completion without reaching significance.

When Qualitative Research Is the Right Next Step

The feasibility check does not just prevent bad experiments. It redirects resources toward approaches that are more likely to produce useful information given the constraints.

When a page has insufficient traffic to power a meaningful test in a reasonable timeframe, the right move is usually qualitative research: session recordings, user interviews, usability testing, or analytics analysis that generates insight into what is driving the current behavior without requiring a controlled experiment.

This is not a consolation prize. For low-traffic pages, qualitative research often produces better insights than underpowered experiments, because the insights from qualitative work describe the behavioral mechanisms at play — why users are hesitating, what they are misunderstanding, what information they are looking for — rather than just measuring whether a proposed change moved a number.

An underpowered test that runs for eight months and produces an inconclusive result tells you nothing about why users behaved as they did. A series of twelve user sessions on the same page, watched carefully with the hypothesis in mind, produces specific behavioral observations that can inform both the next test design and the design of the page itself.

The feasibility check that flags a test as impractical is also, implicitly, a recommendation to pursue qualitative research on the same question. GrowthLayer's feasibility calculator makes this recommendation explicit: when the traffic level is below the threshold for a practical test at any implementable effect size, the output includes a flag recommending qualitative research before returning to experiment design.

How GrowthLayer's Pre-Test Feasibility Calculator Works

The feasibility calculator in GrowthLayer is built into the test ideation and planning flow. Before a test brief is finalized, the system asks for: the URL or page location, the baseline conversion rate for the primary metric, the estimated weekly traffic to the test, and the minimum effect size the team would consider worth implementing.

From those inputs, the system generates an MDE-by-week table showing the detectable effect size at each runtime increment from one to twenty-six weeks. It identifies the minimum runtime to reach significance at the specified MDE, and it flags tests where that minimum runtime exceeds twelve weeks as requiring review — not an automatic disqualification, but a deliberate decision.

Tests where the minimum runtime exceeds twenty-six weeks, or where no practically meaningful effect is detectable within twenty-six weeks, are flagged as infeasible with a recommendation to pursue qualitative research or to redesign the hypothesis around a higher-traffic page or metric.

The output is stored with the test record, so the planned MDE and the power calculation assumptions are part of the permanent test documentation. When the test concludes, the system can compare the actual effect size to the planned MDE and flag tests that ran substantially longer than the planned duration — providing feedback into the feasibility planning process over time.

The goal is not to prevent ambitious tests. It is to ensure that every test in the pipeline has a realistic path to an interpretable result, that the resources committed to the test are commensurate with the information it can realistically produce, and that decisions about runtime and scope are made deliberately rather than by default.

Building the Habit: Pre-Test Feasibility as a Gate

The most common implementation failure for feasibility analysis is treating it as optional. Teams run the calculation when they remember to, skip it when they are under deadline pressure, and use results to confirm decisions that have already been made rather than to inform decisions that have not.

Feasibility analysis produces value only when it is a gate, not a checkbox. It must run before the test is built — not while it is running, not after it has been approved for development. The question "can this test produce an interpretable result in a practical timeframe?" has to be answered before resources are committed, or the analysis cannot prevent the failure it is designed to prevent.

Structurally, this means building the feasibility check into the workflow at the point where a test moves from idea to brief. The brief should not be written until the feasibility analysis is complete. The development request should not be sent until the brief includes the feasibility output.

This is a workflow change, not just a tool change. The tool makes the calculation easy. The workflow ensures the calculation is done before the point where the answer can change the outcome.

Conclusion

The eleven-month test was the most visible example of the feasibility problem in our program, but it was not the only one. The full audit revealed that the majority of tests across the portfolio had launched without a pre-test power calculation, and a substantial share of those would have been flagged as infeasible or as requiring substantially longer runtimes than were planned.

The resources committed to those tests — design time, development time, QA time, analyst time — represent the true cost of skipping the feasibility check. It is not an abstraction. It is weeks of team time committed to experiments that could not produce the information they were run to generate.

The pre-test feasibility check is not complex. It requires three inputs, one calculation, and one honest conversation about whether the expected effect size is realistic and whether the minimum implementable effect is actually what the test is powered to detect.

That conversation, held before the test is built, is the most efficient investment a testing program can make in its own productivity.

_GrowthLayer's pre-test feasibility calculator shows MDE-by-week for every proposed test, flags impractical experiments before resources are committed, and recommends qualitative research when traffic is insufficient for a powered experiment. Run your first feasibility check._

Run the same feasibility check yourself with the free Sample Size Calculator and MDE Calculator, or browse all 12 free A/B testing calculators.

About the author

A
Atticus Li

Applied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method

Atticus Li leads applied experimentation at NRG Energy (Fortune 150), where he and his team run more than 100 controlled experiments per year on customer-facing surfaces. He is the creator of the PRISM Method, a framework for high-velocity experimentation programs at large enterprises. He writes regularly about the statistical and operational details of A/B testing — the parts most CRO content skips.

Keep exploring