Skip to main content

Why the vast majority of Our Tests Had Negligible Effect Sizes (And What That Means for Your Testing Roadmap)

When we measured Cohen's h across our full test portfolio, over 90% of tests produced negligible effect sizes. Here is what that reveals about which tests are worth running — and which never had a chance.

A
Atticus LiApplied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method
10 min read

Editorial disclosure

This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.

Fortune 150 experimentation lead100+ experiments / yearCreator of the PRISM Method
A/B TestingExperimentation StrategyStatistical MethodsCRO MethodologyExperimentation at Scale

When I ran a Cohen's h analysis across our program's full test history, I expected to find a mixed distribution. Some negligible effects, some small effects, a handful of medium effects, maybe one or two large effects. The kind of spread you would expect from a mature program testing a range of hypothesis types and change magnitudes.

What I found was more stark: over 90% of completed tests produced Cohen's h values below 0.2. Below 0.2 is conventionally classified as a negligible effect size — not merely small, but in the range where the measured difference may not be practically meaningful regardless of statistical significance.

That finding changed how I think about test selection, how I communicate with stakeholders about what testing can and cannot deliver, and what a roadmap should look like if you want to produce results that are actually detectable with available traffic.

What Cohen's h Measures and Why It Matters

Cohen's h is an effect size measure for differences between two proportions. It is the appropriate standardized effect size metric for A/B tests on conversion rate metrics — the most common metric type in CRO.

The formula is:

h = 2 * arcsin(sqrt(p1)) - 2 * arcsin(sqrt(p2))

Where p1 and p2 are the two proportions being compared. The arcsin transformation accounts for the fact that a 1-percentage-point difference at 2% baseline conversion represents a much larger practical change than a 1-percentage-point difference at 50% baseline conversion.

Cohen's h benchmarks:

  • Small effect: h ≈ 0.2
  • Medium effect: h ≈ 0.5
  • Large effect: h ≈ 0.8

Below 0.2 is negligible. Above 0.8 is large by the standards of controlled human behavior research — the kind of effect size associated with major structural interventions, not incremental design changes.

Why does this matter for testing? Because the sample size required to detect an effect scales inversely with the square of the effect size. A test targeting a medium effect (h = 0.5) requires approximately six times fewer visitors per variant than a test targeting a small effect (h = 0.2). A test targeting a negligible effect — which is what the vast majority of our tests turned out to be measuring — requires sample sizes that are difficult or impossible to achieve on most pages within any reasonable planning horizon.

If over 90% of your tests produce negligible effect sizes, and you are running those tests on pages with typical enterprise traffic levels, you have a structural power problem that will persist regardless of how sophisticated your statistical methods are.

The Pattern in What Produced Large Effects

Before I describe what did not work, it is worth describing what did. The tests that produced medium or large Cohen's h values — the roughly 8% of tests that produced effects large enough to be clearly detectable — shared a set of characteristics.

They changed something structural, not cosmetic. The tests that produced the largest effects were not CTA copy tweaks, color changes, or minor layout adjustments. They were changes to the fundamental structure of the page or flow: a redesigned enrollment step that removed fields and restructured the layout, a confirmation page redesign that changed the information hierarchy, a flow change that combined two steps into one.

In one case, a redesign of a mid-funnel confirmation page produced an effect in the medium range. The page had previously shown several pieces of information in a sequence that confused visitors about what they had and had not committed to. The redesign clarified the information hierarchy and removed the ambiguity. The effect on completion rate at that step was large enough that it was detectable within the first two weeks of the test.

They targeted friction with a clear mechanism. The winning tests had specific hypotheses about a mechanism of action: this element confuses visitors about X, removing it will reduce abandonment. This layout buries the primary action, moving it above the fold will increase engagement. The hypothesis was not "this looks better" or "this is cleaner." It was "this specific thing is causing this specific behavior, and removing it will produce this specific outcome."

They operated on high-decision-weight moments. The largest effects came from changes at steps in the flow where visitors were actively deciding whether to continue. Changes at low-stakes moments — informational pages, confirmation screens after commitment — produced smaller effects. Changes at the moment of decision — enrollment step one, checkout review page, upgrade prompt — produced larger effects.

They tested meaningful changes to the choice architecture. A change that gives visitors a clearer sense of what they are getting, reduces cognitive load at a decision moment, or removes a source of doubt will produce a larger effect than a change that shuffles the presentation of information visitors already have.

What Produced Negligible Effects

The over-90% that produced negligible Cohen's h values fell into recognizable categories.

CTA copy variations. In our program, tests on CTA copy — button text, headline variants, subheadline changes — almost universally produced negligible effect sizes. Not no effect, but effects too small to be detected with the available traffic on most pages. Some copy tests did show directional trends, but the Cohen's h values were below 0.1 in the majority of cases.

This does not mean copy does not matter. It means that the marginal difference between two reasonable copy variants is typically small. If you are testing "Get Started" against "Start Your Free Trial," you are measuring the difference between two adequate options, not the difference between a bad option and a good one. The ceiling on the effect is constrained by the similarity of the variants.

Color and visual treatment changes. Button color tests, background changes, typography variations — these produced some of the smallest effect sizes in the portfolio. The pattern was consistent: when a test is isolated to a visual treatment that does not change the information, the choice architecture, or the clarity of the decision, the effect is negligible.

Repositioning elements without changing them. Tests that moved a trust element, a testimonial, or a feature bullet to a different location on the page — without changing the content — produced small to negligible effects. The information was present in both control and variant. The question was only whether its placement made visitors more likely to act. In most cases, the answer was: minimally.

Incremental form field changes. Adding or removing a single optional field on a form produced smaller effects than removing multiple required fields or restructuring the field order to match the visitor's mental model of the sequence. Incremental changes accumulate, but the individual experiment measuring each increment is typically underpowered for the effect it is trying to detect.

The Structural Implication

If over 90% of your tests are producing negligible effect sizes, and your traffic is typical for a mid-to-large enterprise program — hundreds of thousands of monthly visitors on your highest-traffic pages, tens of thousands on your typical test pages — then the majority of your tests are structurally underpowered.

Here is the math. To detect a negligible effect (h = 0.15) with 80% power at 95% confidence, you need approximately 3,500 visitors per variant. To detect an effect at h = 0.05, you need approximately 31,000 visitors per variant.

If your test page receives 2,000 visitors per week and you split 50/50, you accumulate 1,000 visitors per variant per week. At h = 0.15, you need about 3.5 weeks. At h = 0.05, you need 31 weeks. For effects below h = 0.05 — which some of the CTA copy tests in our program were producing — you need far more than a year of runtime on a typical page.

This is why I argue that the question "is our win rate high enough?" is partially the wrong question. The right question is: "are our tests capable of detecting the effects they are designed to detect?" For most programs, a substantial fraction of tests cannot answer that question — not because the hypothesis was wrong, but because the effect was too small to measure.

The implication for test selection: if you are going to test on lower-traffic pages, test big changes. If you are going to test incremental changes, test them on your highest-traffic pages. And if you cannot construct a high-traffic page or a big change hypothesis, reconsider whether the test belongs in the queue at all.

What This Means for Your Roadmap

The Cohen's h analysis changed how I approach roadmap construction in several specific ways.

Explicit effect size estimation as a pre-test requirement. Before a test enters the queue, I now require an estimated effect size range, not just a traffic and MDE calculation. The MDE calculation tells you whether the test is powered for an assumed effect. The effect size estimate forces the question: is the assumed effect realistic for this type of change on this type of page?

A test that assumes h = 0.3 and relies on that assumption to show adequate power needs to defend that assumption. If it is a CTA copy test, h = 0.3 is not realistic based on historical data. If it is a structural flow redesign, h = 0.3 might be plausible. The estimate does not need to be precise — it needs to be honest.

Calibrating ambition to traffic. Low-traffic pages should not be used to test incremental changes. If a page receives fewer than a few thousand weekly visitors, the only tests worth running are changes large enough to produce at least a small effect (h > 0.2). Anything smaller cannot be detected. The test slot should either be used for a genuinely large change or left for a higher-traffic page.

Separating "learn" tests from "win" tests. Some tests are designed to learn something specific about visitor behavior, even if the expected effect is too small to detect reliably. These are fundamentally different from tests designed to win — to ship a variant that measurably improves a metric. Confusing the two categories creates frustration. A test designed to learn should be assessed on whether it generated insight, not on whether it reached significance.

Prioritizing structural tests on high-leverage pages. The tests that produced medium and large effects were structural changes on high-decision-weight moments. That combination — structural change, high-decision-weight moment — is where large effects live. It is also where the effort of designing and building the test is highest. The roadmap should front-load these tests, not defer them in favor of quick copy and color experiments that fill the queue but rarely move metrics.

The Counterintuitive Conclusion

The standard advice in experimentation is to test everything: the more tests you run, the more you learn, and small wins compound into large improvements over time. That advice is not wrong, but it is incomplete in a way that leads programs astray.

Testing small changes on low-traffic pages is not a neutral activity. It consumes development resources that could be building a structural test on a high-traffic page. It occupies test slots in the queue. It produces inconclusive results that erode stakeholder confidence in the value of testing. It generates a misleadingly low win rate that makes the program look ineffective, when the real problem is that it is running tests that cannot win.

The productive version of "test everything" is: test everything that has a realistic path to a detectable result. For the majority of incremental changes, on the majority of non-high-traffic pages, that path does not exist.

When I reviewed our program's test history through GrowthLayer and sorted tests by effect size alongside their traffic and runtime data, the pattern was immediately visible: the smallest effects came from the tests with the widest confidence intervals and the longest runtimes. The largest effects came from the tests that reached significance fastest. The roadmap should have been loaded with the latter and light on the former.

A Practical Filter for Your Next Roadmap Review

When reviewing your test backlog, apply this filter to each candidate:

1. Classify the change type. Structural (flow, layout, decision architecture) or cosmetic (copy, color, element reorder). Structural changes have a plausible path to medium effects. Cosmetic changes typically do not.

2. Estimate the realistic Cohen's h range. Based on what type of change this is and what your program history shows for similar changes, what is a realistic h range? For CTA copy, h < 0.15 is typical. For structural redesigns, h = 0.2 to 0.5 is plausible.

3. Check available weekly traffic. At the estimated effect size, how many weeks would the test need to run? If the answer is more than eight weeks, scrutinize hard. If the answer is more than sixteen weeks, consider whether this test belongs in the queue at all.

4. If the test passes the filter, make the change bigger. If a test idea does not pass — because the estimated effect is too small for the available traffic — ask whether you can combine it with other changes into a bolder hypothesis that would produce a larger effect. Some tests that fail the filter individually become viable as part of a multi-element redesign.

The Bottom Line

The the vast majority negligible-effect-size finding is not a failure of our program. It is a property of CRO: most incremental changes produce small effects, and the testing infrastructure we build to measure those effects is often not powerful enough to detect them.

The response is not to stop testing. It is to test smarter. Reserve the test queue for changes with a realistic path to detectable effects. Test structural changes on high-traffic pages. Be honest about effect size when estimating feasibility. And build the kind of test history — logged, tagged, and analyzable by effect size — that allows you to calibrate future estimates against past results.

That calibration is where programs get better over time. Not from running more tests, but from running tests that can teach you something.

If you want to understand the effect size distribution of your own program — and start building a roadmap that targets the changes most likely to produce detectable results — GrowthLayer gives you the infrastructure to log, tag, and analyze your test portfolio over time.

About the author

A
Atticus Li

Applied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method

Atticus Li leads applied experimentation at NRG Energy (Fortune 150), where he and his team run more than 100 controlled experiments per year on customer-facing surfaces. He is the creator of the PRISM Method, a framework for high-velocity experimentation programs at large enterprises. He writes regularly about the statistical and operational details of A/B testing — the parts most CRO content skips.

Keep exploring