The A/B Test Design Checklist That Would Have Saved Us 60% of Our Failed Tests
After auditing dozens of enterprise A/B tests, I built a scoring rubric — then discovered it was flawed. Here's the honest checklist that actually predicts test outcomes.
Editorial disclosure
This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.
I want to tell you about a mistake I made — one that took me months to admit and that made me a better practitioner once I did.
After auditing dozens of enterprise A/B tests across a multi-brand energy company, I built a 1-to-10 scoring rubric. I called it a design quality score. I published early versions of the framework. I was proud of it.
Then I noticed the problem: I had assigned those scores after knowing the outcomes.
That is not a minor methodological footnote. It is a fundamental flaw. A score assigned post-hoc — even with good intentions, even using consistent criteria — reflects hindsight, not prediction. The strong correlation I saw between high scores and wins was, at least partially, circular. I was scoring tests highly because they had already worked.
The honest version of what I learned is less tidy than a predictive scoring rubric. But it is more useful. What survives a rigorous self-audit are five process checks — not scores — each one mapped to a specific documented failure mode from the enterprise dataset. Run these checks before any test launches, and you will catch the failure patterns that consumed the majority of our wasted test slots.
This is that checklist.
Why a Checklist, Not a Score
The appeal of a scoring rubric is that it promises a single number you can gate on. Score above 7, ship. Score below 7, revise. It feels scientific.
The problem is that a scoring rubric compresses five different types of failure into one dimension, which makes it harder to act on. When a test scores 5.5, you do not know which failure to fix. When a test fails check three, you know exactly what to address.
A checklist also resists the temptation to average out serious problems. A test that gets 9s on four dimensions and a 0 on one still "scores" reasonably on a rubric. On a checklist, it fails — because one critical structural problem cannot be offset by other dimensions being solid.
The five checks below are each binary: pass or fail. A test with any single failure should not launch. Not because the other four dimensions are unimportant, but because each of these five checks maps to a documented failure type that produces either a loss or an uninterpretable result.
Key Takeaway: A pre-test checklist is more actionable than a predictive score. Each check maps to a specific failure mode. One failure = do not launch.
Check 1: Is the Hypothesis Research-Backed?
The failure this catches: Informed hypotheses won at approximately 48% in the dataset. Uninformed hypotheses won at approximately 7%. That is a meaningful gap — though I want to be honest about what it actually shows.
Research-backed tests also tend to be more carefully designed in other ways. Teams that do the research to ground a hypothesis also tend to think harder about metric selection, audience targeting, and feasibility. The research backing and the win rate are correlated, but the relationship is not purely causal — it is also a proxy for overall test rigor and investment.
A research-backed hypothesis traces directly to a documented user signal. Session recordings showing where users abandon. Form analytics showing which fields produce hesitation. Exit survey responses identifying specific concerns. Customer service transcripts revealing recurring questions.
The check: Can you identify the specific documented user signal that this hypothesis addresses? If not, go back to the research.
Key Takeaway: Research-backed hypotheses win at roughly six times the rate of uninformed ones. Require documentation of the signal before a test enters the queue.
Check 2: Do All Changes Serve One Behavioral Mechanism?
This is the check that challenges the most widely held convention in CRO, so I want to be precise about what it says and what it does not say.
What it does not say: "Test one variable at a time, always, no exceptions."
What it says: Every change in the variant must serve the same behavioral mechanism.
Three of the most impactful winners in the enterprise program changed five or more elements simultaneously. One confirmation page redesign that produced over 200% lift changed the layout, the content structure, the language, the visual hierarchy, and the next-steps guidance. Every single change served one mechanism: guide users to complete post-enrollment tasks and reduce buyer’s remorse. Five changes, one mechanism. It won decisively.
Contrast this with a homepage test that bundled two changes — new hero content and new routing architecture. The content change was positive when isolated later. The routing change created downstream friction. The two changes served different mechanisms. When combined, the routing problem neutralized the content benefit.
Key Takeaway: Mechanism coherence, not variable count, is what predicts whether multi-variable tests succeed.
Check 3: Is the Primary Metric the First Downstream Action in the Exposed Population?
The primary metric must be the first measurable action directly downstream of the change, measured only in the population actually exposed to the change. Every word in that rule is load-bearing.
The check: Map from the change to the behavior it affects, then to the population it affects. Define the primary metric as the first thing that population can do in response to that change.
Key Takeaway: Metric dilution is the silent test killer. Always define your exposed population before you define your primary metric.
Check 4: Is Traffic Times MDE Realistic?
In the dataset, several tests were structurally impossible to conclude within any reasonable timeframe. One page received approximately 37 daily visitors. The minimum detectable effect at 95% confidence required over 300 days of runtime. The test was scheduled for eight weeks.
The check: Run the sample size calculation with the actual baseline conversion rate, the smallest effect you would act on, and the realistic weekly traffic. If required runtime exceeds eight weeks, revisit the test design.
Key Takeaway: The most expensive test in your program is not the one that loses — it is the one that cannot conclude.
Check 5: Has the Variant Been QC’d Against the Hypothesis?
Implementation errors are invisible in outcome data. A test can fail because the hypothesis was wrong, the metric was misconfigured, the traffic was insufficient — or because the variant simply did not work as intended. Without a hypothesis-aligned QC pass, you will never know which one it was.
The check: Before launch, sit down with the hypothesis statement and the implemented variant side by side. For every element that differs from control, confirm it is visible, functioning, and targeted correctly.
Key Takeaway: Implementation errors produce misleading results that can block valid future hypotheses. A hypothesis-aligned QC pass is the last gate before traffic is exposed to the variant.
Putting the Checklist Together
The five checks form a sequential review. Run them in order, because each builds on the previous. You start with research backing because it determines whether there is a real mechanism to test. You move to mechanism coherence because a coherent mechanism is necessary before you can evaluate whether the changes serve it. You address metric selection because the metric must match the mechanism and the population. You run the feasibility calculation because a well-designed test on insufficient traffic is still a wasted slot. And you close with variant QC because even a perfectly designed test can be undone by an implementation gap.
A test that passes all five is not guaranteed to win. What the checklist guarantees is that when a test loses, you have enough structural integrity to learn from the result and iterate.
Conclusion
The scoring rubric I originally built was a mistake I learned from rather than published as truth. What is reliable are process checks grounded in documented failure modes. Research backing, mechanism coherence, metric alignment, feasibility, and hypothesis-aligned QC each address a specific, recoverable problem. Together they catch the structural failures that accounted for the large majority of tests that produced no usable result.
A checklist that prevents those failures is worth more than a score that predicts outcomes you have already seen.
Applied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method
Atticus Li leads applied experimentation at NRG Energy (Fortune 150), where he and his team run more than 100 controlled experiments per year on customer-facing surfaces. He is the creator of the PRISM Method, a framework for high-velocity experimentation programs at large enterprises. He writes regularly about the statistical and operational details of A/B testing — the parts most CRO content skips.
Keep exploring
Browse winning A/B tests
Move from theory into real examples and outcomes.
Read deeper CRO guides
Explore related strategy pages on experimentation and optimization.
Find test ideas
Turn the article into a backlog of concrete experiments.
Back to the blog hub
Continue through related editorial content on the main domain.