Most of Our Tests Were Underpowered: The Pre-Test Calculation That Would Have Saved Months of Wasted Runtime
the vast majority of our tests were underpowered — coin-flip chance of detecting real effects. One needed nearly a year. Here's the pre-test calculation that prevents this waste.
Editorial disclosure
This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.
I want to tell you about a test we ran on a page with fewer than fifty daily visitors.
It was a reasonable-looking test on paper. The page was real, the hypothesis was grounded, the form was a genuine conversion barrier. Someone had done the work to identify the problem and design a solution. The test was built, QA'd, and launched.
What nobody had done was run the sample size calculation.
If they had, they would have discovered that to detect a 10% lift on that page at 80% power with a two-tailed test at 95% confidence, the test needed nearly a year of runtime. Nearly a full year. And that is assuming the page traffic held steady — which it would not, because seasonal variation in the category meant Q4 traffic was roughly 40% lower than Q3.
The test would never have had a meaningful chance of detecting the effect it was looking for. It was not a test. It was an observational study with a variant toggle.
When we ran a comprehensive power audit of the test portfolio — pulling every completed experiment, calculating retrospective power given actual traffic and the minimum detectable effect that would have been commercially meaningful — the vast majority of tests had achieved statistical power below 50%. Most of them sat between 25% and 45%. A coin flip has 50% power. Most of our tests were worse than a coin flip at detecting real effects.
This is not a story about a poorly run program. It is a story about a structural problem that affects most experimentation programs, including well-funded ones at sophisticated organizations. And it has a solution that is genuinely simple — not easy to implement organizationally, but simple to understand and execute technically.
What Statistical Power Actually Means for Your Test Program
Power is the probability that your test will detect a real effect, given that the effect actually exists. An 80% power test, set up correctly, has an 80% chance of returning a statistically significant result when the true effect size equals or exceeds your minimum detectable effect (MDE). A 45% power test has a 45% chance.
The implication that most teams underestimate: a test with 45% power is not just "less certain" than a test with 80% power. It is actively misleading. When a 45% power test returns a non-significant result, you cannot distinguish between "the hypothesis is wrong" and "the test was too small to see the effect." The result is genuinely uninformative — it tells you nothing useful. You have spent testing velocity, developer time, and weeks of real traffic to generate a null result that you cannot interpret.
Now multiply that by the vast majority of your portfolio.
What I found when I audited the full dataset was not that we had been running tests and learning slowly. We had been running tests and not learning at all on most of them — or worse, learning the wrong thing. There were six tests in the dataset that showed strong Bayesian directional signals — probabilities of 80% or higher that the variant was genuinely better than control — that were closed as inconclusive because they never reached frequentist significance. Each of those tests needed only 2-3x more traffic to cross the threshold. They were killed 40-50% of the way to a real result.
Each of those six tests represented a winning hypothesis that was abandoned. Some of those hypotheses were later retested. Some were not. The ones that were not represent a permanent knowledge gap in the program.
Key Takeaway: A test with 45% power does not learn slowly — it actively misleads. A non-significant result from an underpowered test cannot distinguish between a wrong hypothesis and an undetectable effect. the vast majority of tests in our audit were in this zone.
Why Pre-Test Traffic Estimates Fail: The Three Root Causes
The power problem almost always starts with a bad traffic estimate. When I traced back through the pre-test planning documents for the underpowered tests, the same failure modes appeared repeatedly.
Root cause 1: Using aggregate site traffic instead of page-specific traffic.
One test estimated 57,611 weekly visitors to the relevant page. The planning assumption had been built from a site-wide or section-wide traffic figure that someone had pulled from the analytics dashboard without filtering to the specific page and the specific audience the test was targeting. The actual weekly traffic to the test page, for the target audience segment, was 6,520. The estimate was off by 88%.
At 57,611 weekly visitors, the test needed roughly 3 weeks to achieve adequate power. At 6,520 weekly visitors, it needed 26 weeks. These are not in the same universe operationally. The test that was planned as a 3-week sprint was either closed prematurely at week 4 with insufficient data, or held open for months with no one paying attention to the statistical situation.
Root cause 2: Using the wrong baseline conversion rate.
A second test used a baseline conversion rate of 21.9% in its sample size calculation. The actual baseline conversion rate, as observed in the control variant during the test, was 18.1%. The discrepancy was traced to a data collection issue: two distinct pages shared the same analytics page name, so the metrics dashboard was blending the conversion data from both. The analyst doing the pre-test calculation was looking at the blended rate, which was inflated by the higher-converting second page.
A 21.9% baseline versus an 18.1% baseline changes the required sample size meaningfully — roughly 20-25% more traffic needed for the same power at the same MDE. The test launched with a false sense of adequacy and ran into the same underpowered outcome.
Root cause 3: Not accounting for test dilution from multi-page tests.
Several tests ran across multiple pages where not all pages had equal traffic. The sample size calculation had assumed that traffic would be evenly distributed, or had been calculated using the highest-traffic page as the baseline. In practice, traffic was concentrated on one page, and the conversion events were distributed unevenly. The statistical model assumed independence that the actual data did not exhibit.
The fix for all three root causes is the same: validate your traffic estimate and baseline conversion rate against actual analytics data for the specific page and specific audience you are targeting, pulled as close to the test launch date as possible. Not aggregate data. Not historical annual averages. The specific number, for the specific page, from the last 4-8 weeks.
Key Takeaway: Most power failures trace to three root causes: aggregate traffic estimates instead of page-specific ones, wrong baseline conversion rates from blended analytics, and multi-page tests calculated on best-case assumptions. Fix the input, and the power calculation takes care of itself.
The MDE-by-Week Table: The Most Useful Pre-Test Artifact Nobody Makes
The standard pre-test power calculation answers one question: "How long do I need to run this test to detect a given effect size at a given power level?" This is useful. But it answers the question in the wrong direction for most business contexts.
The more useful question is: "Given the traffic I actually have, what is the smallest effect I can reliably detect at each point in time?" This is the MDE-by-week table, and it is the most valuable pre-test artifact I have found.
Here is why it matters: the first question assumes you have decided on an MDE and are calculating the runtime. But in practice, MDEs are not decided — they are assumed. Someone picks 10% because it sounds reasonable. Or 5% because a colleague mentioned that number once. The actual commercial significance of different effect sizes at different pages is never examined.
The MDE-by-week table forces this examination. It shows the business stakeholder, before the test launches: if we run this test for 2 weeks, we can only detect effects of 25% or larger. If we run it for 6 weeks, we can detect effects of 15% or larger. If we run it for 12 weeks, we can detect effects of 10% or larger. Which of these effect sizes would actually be commercially meaningful on this page?
If the answer is "we need to detect 10% effects to justify the test," and the page traffic makes a 10% MDE achievable only at 12 weeks of runtime, the team now has a decision to make: commit to the 12-week runtime, find a higher-traffic equivalent page, or acknowledge that this page is not the right test vehicle and pivot to qualitative research instead.
The MDE-by-week table does not tell you what to do. It tells you what the actual options are. That is the information that was missing from most of the underpowered tests in our audit — not analytical capability, but a clear picture of the tradeoffs before the test launched.
Building this table is not technically complex. It requires the actual traffic number, the actual baseline conversion rate, and a standard power calculation formula iterated across different runtime weeks and MDE values. The output is a 6x4 grid that takes 20 minutes to build and potentially saves months of wasted runtime.
GrowthLayer auto-generates this table for every test based on the traffic and baseline data entered during test setup, and flags any test where the minimum commercially meaningful MDE is not achievable within the planned runtime at 80% power. The flag happens before the test launches — not after you have run for 8 weeks and are wondering why you have no result.
Key Takeaway: The MDE-by-week table is the pre-test artifact that prevents most power failures. It shows what effect sizes are detectable at different runtimes given your actual traffic, forcing an explicit decision about whether the test is worth running before resources are committed.
The 37-Visitor-Per-Day Page: When A/B Testing Is the Wrong Tool
Let me return to the test I opened with.
The very-low-traffic-per-day page needed nearly a year to test a 10% lift. If the actual effect size was 5% — which is a perfectly plausible outcome for a form optimization test — it would have needed over 1,200 days. That is not a testing problem. That is a statistical impossibility.
What is the right tool for low-traffic pages?
The answer depends on what kind of question you are trying to answer.
If the question is "does this change improve conversion rates," and you cannot answer that question with statistical confidence given your traffic, then you have two options: find a proxy metric that has sufficient volume (microconversions — scroll depth, click events, form interaction — can sometimes stand in for full conversion), or switch from quantitative to qualitative research.
Qualitative research on low-traffic pages is often dramatically underutilized. A page with fewer than fifty daily visitors might generate 10-15 user sessions per week that could be reviewed via session replay. If the change you are considering involves a form, a content block, or a call-to-action, watching 20-30 sessions on the current design before building a variant will often reveal friction points that no A/B test is needed to confirm. The session replay tells you: users are abandoning at this field, users are reading this section and then leaving, users are missing this CTA entirely because of where it is positioned.
The tactical rule I now apply: any page with fewer than 200 daily visitors in the target audience gets a mandatory qualitative review before an A/B test is scoped. If the qualitative review reveals an obvious, uncontroversial UX problem — something that every session replay shows happening — that problem goes on the fix-it list, not the A/B test list. A/B testing is for testing, not for fixing things you already know are broken.
Key Takeaway: Pages with fewer than 200 daily visitors in the target segment should receive a mandatory qualitative review before an A/B test is scoped. If the session replays reveal an obvious friction point, fix it without testing. Reserve A/B tests for questions that are genuinely uncertain.
The "Run It Longer" Trap: Why Extending Underpowered Tests Rarely Works
When an underpowered test fails to reach significance, the most common organizational response is to extend the runtime. "Let's give it a few more weeks." This sounds reasonable. It is usually the wrong call.
Here is the problem: the reason the test did not reach significance is typically not that it did not run long enough. It is that the real effect is smaller than the planned MDE. The test was powered to detect a 10% lift. The actual effect is 3%. Running longer does not change the effect size — it changes the precision of your estimate of the effect size. With more data, you will eventually confirm that the true effect is 3%, which at 95% confidence would require roughly 10x the sample you were planning for.
This matters organizationally because extending a test has costs. The developer who maintains the test in production. The potential negative effect on the small percentage of users in the variant. The opportunity cost of the slot on the test platform. And most importantly, the organizational attention cost: a test that has been running for 16 weeks with no clear result is a distraction, a conversation that never resolves, and a piece of technical debt in the codebase.
The cases where extending makes sense: when the test was cut short by a technical incident or a promotional period that distorted the data, and the clean data accumulation started late. When the traffic estimate was correct but a tracking issue caused the first several weeks of data to be unreliable. When the test is showing a strong directional signal (above 85% Bayesian probability) and is within 20-30% of its required sample — in this case, a modest extension is justified.
The cases where extending does not make sense: when the test has accumulated the full planned sample and the observed effect is materially below the planned MDE. This is not a test that needs more time. This is a test that needs a different design — a different page, a different audience, a more aggressive variant, or a different hypothesis entirely.
Key Takeaway: Extending a test that has run to full sample rarely produces a different result — the real effect is smaller than planned, and running longer confirms the size of a small effect rather than detecting a large one. Recognize when to redesign versus when to extend.
Sequential Testing and CUPED: Legitimate Methods to Boost Power Without P-Hacking
There are two legitimate statistical methods for improving the power of your tests without inflating your false positive rate, and both are underused in most CRO programs.
Sequential testing (using methods like alpha-spending or always-valid inference) allows you to check your results continuously during the test while maintaining a controlled false positive rate. Standard frequentist tests require you to commit to a sample size before launching and not peek at the results until the sample is reached — peeking and stopping early inflates your false positive rate. Sequential testing re-derives the significance threshold at each observation to account for the ongoing nature of data collection, which gives you legitimate early stopping when an effect is very large.
In practice, sequential testing is most valuable for tests on high-volume pages where large effects might be detectable in days rather than weeks. It prevents you from running tests longer than necessary when a strong signal emerges quickly.
CUPED (Controlled-Experiment Using Pre-Experiment Data) uses pre-experiment covariate data to reduce the variance in your treatment effect estimate. If you have historical behavioral data for each user — previous session behavior, prior conversion events, product category exposure — incorporating that data as a covariate reduces the noise in the experiment outcome measurement, effectively increasing the sensitivity of your test without increasing sample size.
CUPED is particularly powerful in contexts where there is substantial user-level variation in the baseline metric. In enrollment flow tests, where individual users vary enormously in their baseline intent and behavior, CUPED can produce 20-40% variance reduction, which translates to a proportional reduction in required sample size. A test that needed 50,000 visitors at baseline CUPED might only need 35,000.
Neither sequential testing nor CUPED is a shortcut. Both require correct implementation and appropriate statistical assumptions. But both are legitimate alternatives to the "run it longer and hope" approach that most teams default to when tests are struggling.
Key Takeaway: Sequential testing and CUPED are the two legitimate statistical methods for improving test power without inflating false positive rates. Both require careful implementation but offer real efficiency gains — particularly CUPED in high-variance enrollment funnels.
The Six Directional Tests That Should Have Won
I want to close with a specific accounting of what the power failures in our audit actually cost.
Six tests in the dataset had Bayesian probabilities of 80% or higher that the variant was genuinely better than control. Under a Bayesian framework, these were not inconclusive — they were strongly directional signals that the tested change was probably working. Under the frequentist framework used to make go/no-go decisions, they were closed as inconclusive because they had not reached 95% confidence.
When I calculated what each of these tests needed to cross the significance threshold, the answer was uniformly modest: between 2x and 3x the traffic they had accumulated at the time they were closed. Not 10x. Not a year of additional runtime. Two to three times the sample.
Some of those tests were never iterated. The hypotheses were quietly abandoned — not because the data indicated they were wrong, but because the test had been closed as inconclusive and there was no systematic process for flagging high-probability-but-underpowered tests for relaunch.
The aggregate value of those six tests, estimated conservatively from the directional effect sizes and the page traffic volumes involved, represents a meaningful block of conversions that either were not captured or were delayed by the additional time required to re-identify and retest the same hypothesis. This is the hidden cost of the power crisis — not just wasted test runtime, but the compounding cost of valid hypotheses that are permanently lost.
The fix is not complicated. A test closure process that includes a mandatory power review — checking whether the test had adequate power before attributing the null result to hypothesis failure — would have flagged all six. An automated system that calculates retrospective Bayesian probability on test close and surfaces high-probability-but-underpowered tests for relaunch consideration would have caught them automatically.
GrowthLayer flags exactly this scenario: when a test is closed without reaching significance, it calculates the retrospective Bayesian probability and the additional traffic required to reach significance, and surfaces tests above a configurable probability threshold for relaunch review. The goal is to prevent valid hypotheses from being silently abandoned because the original test was underpowered.
Conclusion
The statistical power problem in most experimentation programs is not a knowledge gap. Most CRO practitioners know that underpowered tests are a problem. It is a process gap — the work of building accurate pre-test calculations with actual page-specific traffic and correct baseline conversion rates is not institutionalized, and so it does not happen consistently.
The consequences compound over time. Each underpowered test that returns a null result creates a false belief that the hypothesis was wrong. Each directional test closed at 75% probability is a potentially winning change that gets abandoned. Each traffic estimate built from aggregate site data instead of page-level data is a miscalculation that sets the entire test up to fail before it launches.
The MDE-by-week table takes twenty minutes to build. The pre-test traffic validation requires one additional step in the planning process. The retrospective power calculation on test close adds five minutes to the readout. None of these are heavy lifts. All of them, consistently applied, would have saved months of wasted runtime and preserved six hypotheses that the data suggested were probably worth pursuing.
Run the calculation. Before the test launches.
If you want pre-test power calculations built into your test setup workflow, automatic flagging of underpowered tests before launch, and retrospective Bayesian probability on every closed test, GrowthLayer makes statistical rigor a default, not a discipline.
Applied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method
Atticus Li leads applied experimentation at NRG Energy (Fortune 150), where he and his team run more than 100 controlled experiments per year on customer-facing surfaces. He is the creator of the PRISM Method, a framework for high-velocity experimentation programs at large enterprises. He writes regularly about the statistical and operational details of A/B testing — the parts most CRO content skips.
Keep exploring
Browse winning A/B tests
Move from theory into real examples and outcomes.
Read deeper CRO guides
Explore related strategy pages on experimentation and optimization.
Find test ideas
Turn the article into a backlog of concrete experiments.
Back to the blog hub
Continue through related editorial content on the main domain.