One test in our program needed nearly a year to reach statistical significance at its planned effect size.

Not because it was a difficult test. Not because the traffic was unusually volatile. Because when the test was greenlit, someone had looked at a single sample size number — "we need 22,000 visitors per variant" — noted that the page received some traffic, and concluded the test was feasible. Nobody had translated that visitor count into a runtime estimate. Nobody had asked how long 22,000 visitors per variant would actually take to accumulate.

The answer was eleven months. That is how long it took before somebody finally ran the math.

By the time I joined the program and audited its backlog, the test had been running for seven months with no end in sight. The hypothesis was reasonable, the implementation was clean, and the test would never produce a conclusive result in any timeframe that mattered to the business. We stopped it, wrote off the development investment, and moved on.

The thing that would have prevented it is not a more sophisticated statistical method. It is a two-column table that takes about fifteen minutes to build.

What a Single Sample Size Number Hides

Every sample size calculator gives you one number. You put in your baseline conversion rate, your expected effect size, your desired significance level, and your desired power, and the calculator returns: you need X visitors per variant.

That number is not wrong. It is just incomplete in a way that causes real damage.

The number tells you the threshold. It does not tell you when you will cross it. It gives you the destination without telling you how long the drive takes. And at different traffic levels — which vary week to week, page to page, and season to season — the same required sample size implies wildly different runtimes.

A test requiring 20,000 visitors per variant reaches that threshold in roughly:

4 weeks on a page receiving 10,000 visitors per week
10 weeks on a page receiving 4,000 visitors per week
20 weeks on a page receiving 2,000 visitors per week
Nearly a full year on a page receiving 450 visitors per week

Each of those tests has identical power characteristics. They require the same sample. But only the first one is practically runnable inside a normal planning horizon. The fourth one is a trap.

A single sample size number obscures this. An MDE-by-week table makes it impossible to miss.

What the MDE-by-Week Table Is

An MDE-by-week table inverts the standard sample size calculation. Instead of asking "how many visitors do I need to detect effect size X?", it asks "what is the smallest effect I can detect after W weeks, given my actual weekly traffic?"

You build the table before the test starts. The rows are weeks — week 1, week 2, week 3, week 4, week 6, week 8, week 12 — and the single column that matters is the MDE at that sample size. You are essentially reading the feasibility landscape for your test across its full possible runtime.

Here is an example. A checkout confirmation page receives approximately 1,800 visitors per week. The baseline completion rate for the primary metric is 8%. You are testing a redesigned trust element.

Week | Visitors/Variant | MDE (Absolute) | MDE (Relative)

1 | 900 | 4.1% | 51%

2 | 1,800 | 2.9% | 36%

3 | 2,700 | 2.4% | 30%

4 | 3,600 | 2.1% | 26%

6 | 5,400 | 1.7% | 21%

8 | 7,200 | 1.5% | 18%

12 | 10,800 | 1.2% | 15%

16 | 14,400 | 1.0% | 13%

What does this tell you? It tells you that at four weeks — a common default test duration — you can only detect effects of 26% relative or larger. If your hypothesis predicts a 10% relative improvement in the primary metric, this test cannot confirm or deny that hypothesis in four weeks. It cannot confirm or deny it in twelve weeks. You would need to run for well over a year to detect a 10% relative lift with reasonable power at this traffic level.

That is a test you should not run. Not because the hypothesis is bad, but because the traffic does not support the measurement. The table tells you that before you spend a week in development.

How to Build the Table

The math uses the standard two-sample proportions power formula. For 80% power at a 95% confidence level, the required sample size per variant for an absolute effect of delta is:

n = (2 * p * (1 - p) * (z_alpha/2 + z_beta)^2) / delta^2

Where:

p = baseline conversion rate
z_alpha/2 = 1.96 (for 95% confidence, two-tailed)
z_beta = 0.84 (for 80% power)
delta = the absolute effect you want to detect

The combined constant (1.96 + 0.84)^2 = 7.85. So the formula simplifies to:

n = (2 * p * (1 - p) * 7.85) / delta^2

To build the MDE-by-week table, you flip this: given n (your accumulated sample after W weeks), solve for delta:

delta = sqrt((2 * p * (1 - p) * 7.85) / n)

Where n = (weekly visitors per variant) * W.

For a spreadsheet implementation, you need three inputs at the top: baseline conversion rate, total weekly visitors to the test page, and split ratio (usually 50/50, so visitors per variant = weekly visitors / 2). Then a row for each week, with n = (weekly visitors per variant) * week number, and the MDE formula applied to that n.

The relative MDE is simply (absolute MDE / baseline conversion rate) * 100.

That is the whole table. It takes about fifteen minutes to set up in a spreadsheet and five minutes to populate for any new test.

Why Optimism Sets the MDE, Not Math

The second failure mode — which I saw repeatedly in the early stages of our program — is not a traffic problem. It is a psychology problem. Teams set the MDE based on what they want to find, not what they can realistically detect or what the change is likely to produce.

This produces what I think of as the "optimism MDE." The team wants to detect a 5% relative improvement because that seems like a meaningful result. So they set the MDE at 5%. The sample size calculator gives them a required sample. They check the traffic, see that the page could technically reach that sample size in, say, nine weeks, and greenlight the test.

But there is a step they skipped: checking whether a 5% relative improvement is a plausible outcome for the change being tested. If you are testing a button color variation, a 5% relative lift in purchase completion is almost certainly not what you are going to get. The change is too minor. The realistic effect, if any, is much smaller — 1% to 2% relative, if anything. And at those effect sizes, the test is wildly underpowered.

The MDE-by-week table forces this conversation into the open. When you look at the table and see that at four weeks you can detect effects of 26% relative or larger, someone on the team has to answer: "Is this change plausibly going to produce a 26% relative improvement?" If the honest answer is no, the test needs to be redesigned or abandoned.

In one case from our program, a test on a high-value enrollment form had been planned to run for six weeks with an MDE of 8% relative. The MDE-by-week table showed that six weeks of traffic on that page would support detection of effects around 14% relative — not 8%. The team had set the MDE at 8% because that was their internal threshold for "a meaningful lift," without checking whether the traffic supported detection at that level.

The test was redesigned. Instead of testing a minor copy change on the existing form, the team redesigned the entire first step of the enrollment flow — a change with a plausible mechanism for producing a larger effect. That test ran for eight weeks and produced a statistically significant result.

The table did not create the insight. It created the forcing function: the need to match hypothesis ambition to detection capability.

The "Runtime Flag" Column

One enhancement that makes the table more actionable is adding a runtime flag column. After computing the MDE for each week, add a column that marks each week with one of three labels:

Too early: The MDE is so large that only implausibly big effects could be detected. No decision should be made, even if results look decisive. For most tests, this is weeks 1 and 2.

Useful range: The MDE is in the range of plausible effects for the change being tested. The test is accumulating meaningful data. For a well-designed test with a realistic effect size hypothesis, this is the operating window.

Diminishing returns: The test has been running long enough that additional runtime produces only marginal MDE improvement. Further runtime is unlikely to change the conclusion. If significance has not been reached by this point, the test is probably detecting a null result.

The flag column translates the math into a decision protocol. At the start of each week, anyone on the team can look at the table and know whether the test is currently in a window where a result means something.

This sounds obvious, but most teams check their test results the moment they feel ready — which is often in the "too early" zone — and make decisions based on directional trends that the sample size cannot support. The runtime flag column makes the "too early" period explicit and defensible.

The Test That Should Have Been Flagged

The eleven-month test I described at the start was never going to work at the traffic levels on that page. If someone had built the MDE-by-week table before the test launched, the "useful range" window would have started somewhere around month six and continued indefinitely — because the traffic was so low that even after a year of runtime, the MDE would still have been far larger than any realistic effect from the change being tested.

The table would have shown: at four weeks, you can detect effects of 60% relative or larger. At twelve weeks, 35% relative or larger. At twenty-six weeks, 23% relative or larger. At fifty-two weeks, 16% relative or larger.

Nobody designing a test on a low-traffic informational page would honestly claim to expect a 16% relative lift in a completion metric from the change being tested. The table would have made that expectation untenable in the kickoff meeting.

The development resources spent building that test could have been deployed to a page with adequate traffic for the hypothesis. The test slot — occupied for eleven months — could have been used for a test that produced a decision. The MDE-by-week table is not just a statistical artifact. It is a resource allocation tool.

A Template You Can Copy

Here is the structure for a reusable MDE-by-week table. Build it in a spreadsheet with three input cells at the top:

Input cells:

Baseline conversion rate (e.g., 0.04 for 4%)
Weekly visitors to test (total, before split)
Split ratio (enter 0.5 for 50/50)

Derived cell: Weekly visitors per variant = Weekly visitors * Split ratio

Table rows (one per time period):

Column | Formula

Week | 1, 2, 3, 4, 6, 8, 10, 12, 16, 20

Sample per variant | Week * Weekly visitors per variant

Absolute MDE | SQRT((2 * baseline * (1 - baseline) * 7.85) / sample)

Relative MDE | Absolute MDE / baseline

Runtime flag | If relative MDE > 40%: "Too early" / If relative MDE < 8%: "Diminishing returns" / Else: "Useful range"

Adjust the thresholds in the runtime flag column based on the typical effect size range for your program. In our program, we use 30% as the "too early" threshold and 5% as the "diminishing returns" threshold, calibrated to our historical win sizes.

I track this for every test in our program through GrowthLayer, which stores the pre-test feasibility data alongside the live results — so I can see, in a single view, whether the test is currently in a window where results are interpretable.

What Changes When You Have This Table

The MDE-by-week table changes how pre-test conversations happen. Instead of "do we have enough traffic?", the question becomes "what effects can we detect, and are those effects realistic for this change?"

That is a fundamentally more useful question. It focuses the team on the plausibility of the hypothesis rather than the mechanics of the setup. It catches both the traffic problems (pages where no realistic effect size is detectable in any reasonable window) and the optimism problems (tests where the MDE is technically reachable but only by detecting effects that the change cannot plausibly produce).

And it creates a shared language for post-test analysis. When a test ends without a statistically significant result, the MDE-by-week table lets you distinguish between two very different outcomes: the test ran in the useful range and detected no effect (the hypothesis was probably wrong), versus the test ran below the useful range for most of its life and the result is simply uninformative (the test was structurally incapable of answering the question).

Those are different conclusions. They warrant different responses. The table makes the distinction visible.

Getting Started Today

Building an MDE-by-week table for your next test requires one spreadsheet, three inputs, and about fifteen minutes. If you want to avoid the eleven-month test, the six-week test that produces noise, and the development investment that gets written off as inconclusive, that is the place to start.

For teams running multiple concurrent tests, the table also highlights portfolio-level issues: if most of your tests are on low-traffic pages with high MDEs, your program's win rate will be structurally limited regardless of hypothesis quality. The table aggregates to a diagnostic.

If you want a head start, GrowthLayer includes MDE-by-week projections as part of the pre-test setup flow — so every test in your pipeline has a runtime forecast before it launches.

Run the table before you run the test. Your future self will be grateful.

Build your own MDE-by-week

Generate the table for your own page with the free MDE Calculator and Sample Size Calculator. See all 12 free A/B testing calculators.

The MDE-by-Week Table: The Single Most Useful Pre-Test Artifact Nobody Makes