The Hypothesis Intake Standard: How to Stop Burning Test Cells on Unfalsifiable Submissions

A submitted A/B test hypothesis is the cheapest place to fix a bad test. Once Design starts building, the cost goes up. Once the test ships, the cost compounds — every week the variant runs is a week the surface is not being tested for something else.

Atticus LiApplied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method

May 14, 202613 min read

Editorial disclosure

This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.

Fortune 150 experimentation lead100+ experiments / yearCreator of the PRISM Method

A/B TestingExperimentation StrategyStatistical MethodsCRO MethodologyExperimentation at Scale

The Hypothesis Intake Standard: How to Stop Burning Test Cells on Unfalsifiable Submissions

Most CRO teams accept whatever shows up in the intake field. Free-text paragraph, vague IF clause, no data signal, no power calculation. The team builds it because the requester is a stakeholder, the test runs, and the result comes back uninterpretable. Six months later, the test library has fifteen tests tagged "checkout clarity" with no record of what was actually changed in each one.

This is the intake standard I run my own program against. It is built around three diagnostics, three pushback scripts, and one required pre-flight check. It is opinionated, and it should be — the intake form is the highest-leverage process surface a CRO manager controls.

The Three Intake Diagnostics

Every submitted hypothesis goes through three checks before it enters the build queue. If any check fails, the request gets rerouted, not rejected.

Diagnostic 1: Origin — where did this hypothesis come from?

This is the question most intake forms do not ask, and most weak tests fail it.

There are three good origins:

A funnel signal. Drop-off at a specific step is materially higher than at adjacent steps, or has degraded over time.
A qualitative signal. Session replays, support tickets, moderated study findings, or sales-call transcripts that cluster around a specific confusion or friction.
A failed prior test. A losing variant from a previous test that suggested a sharper hypothesis worth testing next.

There are two common bad origins:

A taste signal. "I think the dashboard layout looks crowded." "The pricing page just doesn't feel premium enough." Real instinct from the requester, but unpaired with evidence the surface is actually underperforming.
A competitor signal. "Notion does X. We should test X." "Linear's onboarding looks great, let's copy it." Outside-in reasoning without internal alignment.

The taste and competitor patterns are not rejection-worthy. They are research-stage proposals dressed up as test-stage proposals. The right answer is "let's verify there's a problem on our funnel before we commit a test cell to this."

The intake question: _"What internal signal — quantitative or qualitative — tells us the current state is a problem worth testing?"_ A blank or vague answer routes the request to investigation, not build.

Diagnostic 2: Intervention — could a designer build the variant from the IF clause alone?

The IF clause is the experiment's specification. It must be concrete enough that:

A designer can build the variant without re-interviewing the requester.
An analyst can identify the single behavior the change is supposed to move.
A future reader, six months later, can understand what was tested.

The most common failure mode is the requester writing a goal in the IF slot:

"IF we make pricing clearer..."
"IF we improve the onboarding flow..."
"IF we reduce friction on the trial signup..."

These are themes, not interventions. The fix is one rewrite:

| Theme submission | Intervention rewrite |

| --------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |

| "Make pricing clearer" | "Display monthly price, annual price, and renewal date under each plan card so all costs are visible without an interaction" |

| "Improve onboarding" | "Replace the three-step setup wizard with a single guided checklist that links to each step inline, allowing skipping" |

| "Reduce friction on trial signup" | "Reduce the trial signup form from seven fields to two (email + workspace name), moving the rest to a post-signup profile step" |

| "Make the upgrade CTA more prominent" | "Move the 'Upgrade plan' CTA from the user-menu dropdown to a sticky banner on the dashboard for users within 7 days of trial expiration" |

| "Improve feature discovery" | "Add a 'What's new' banner above the workspace-switcher pointing to the three features most relevant to the user's current plan tier" |

| "Make the dashboard feel less crowded" | "Hide the secondary-action toolbar behind a collapsible accordion that is expanded by default for users with >30 days of tenure" |

Each rewrite names the element, the location, the specific change, and (often) the audience. That is what makes it falsifiable.

Diagnostic 3: Measurement — is the primary metric close enough to the intervention to be interpretable?

The third common mistake is choosing a primary metric three steps downstream of the intervention.

A copy change on the trial-signup CTA most directly affects clicks on that CTA. Not "annual recurring revenue per signup." Not "30-day retention rate." Those metrics will move only if every step between the CTA click and revenue also responds — so a flat downstream result does not falsify a copy-clarity theory by itself.

| Intervention | Bad primary metric | Better primary metric |

| --------------------------------------- | ----------------------------------- | --------------------------------------------------------------------------- |

| Trial-signup CTA copy change | 30-day retention | Click rate on the CTA, then trial-signup completion rate |

| Plan-card pricing copy change | Annual recurring revenue | Plan-card click rate, then trial-start rate from the plan-selection step |

| Onboarding-step removal | 60-day product-qualified lead rate | Completion rate of the next step, then time-to-first-value |

| Feature-discovery banner | Plan upgrades | Click rate on the banner, then activation rate of the discovered feature |

| Dashboard layout simplification | Long-term retention | Time-to-first-action on the dashboard, then 7-day return rate |

Pick the metric one step downstream of the intervention. Add the further-downstream metrics as secondary diagnostics — they explain whether the immediate effect propagated.

The Pre-Flight Power Check

Before any test enters the build queue, run a one-step power calculation on the proposed surface. The inputs:

Baseline conversion rate on the target surface
Minimum detectable effect (MDE) the team would care about
Daily or weekly traffic on the surface
Confidence level (95% is standard)

If the result says the test needs eighteen weeks at current traffic and the requester wants results in six, the test cannot answer the question regardless of how good the variant is. Surface that constraint at intake, not after build.

A worked example, with illustrative numbers:

Surface: pricing-page plan cards.
Baseline trial-start rate: 4.2%.
MDE: 10% relative lift.
Weekly sessions to the surface: 8,500 (split 50/50 = 4,250 per arm).
Required sessions per arm at 95% confidence, 80% power: roughly 16,000.
Weeks required: ~3.8.

That is a runnable test. Change the MDE to 5%, and the required sessions jump to ~62,000 per arm → ~14.6 weeks. That is rarely runnable on a single surface, and the team should know that before committing to the test.

Run the calculation in the GrowthLayer A/B test duration calculator at intake. If the duration exceeds what the team can wait, the conversation shifts: either accept a larger MDE, find a higher-traffic surface, or treat the request as research rather than test.

The competitor-imitation pattern fails this check more often than any other. Most teams have one or two surfaces with enough traffic to test small lifts. The rest produce tests that statistically cannot resolve the question being asked, no matter how appealing the variant looks.

The Intake Form Structure

Replace the free-text "hypothesis" field with this structure. Make each field required. Reject submissions where the required fields are empty or filled with theme words.

```

ORIGIN — WHAT INTERNAL SIGNAL JUSTIFIES THIS TEST?

Choose one:

[ ] Funnel data — link or screenshot of the drop-off

[ ] Qualitative — link to session replay, study, or support theme

[ ] Prior test — link to the previous test's readout

[ ] (other — describe the internal data signal here)

CURRENT EXPERIENCE

What does the user see and do on this surface today? Include screenshot.

PROPOSED CHANGE

Specific element, specific change, specific location, specific audience

if applicable. Include mockup or copy spec.

HYPOTHESIS — IF

[specific change to a specific element]

HYPOTHESIS — THEN

[specific user behavior or metric, one step downstream of the change]

HYPOTHESIS — BECAUSE

[specific reason — user psychology, friction, motivation, decision barrier]

PRIMARY METRIC

The behavior closest to the intervention.

SECONDARY METRICS

Diagnostic behaviors that explain whether the primary effect propagated.

GUARDRAIL METRICS

What should NOT get worse, even if the primary metric moves.

POWER CHECK (link to calculator output)

Baseline rate, MDE, weekly traffic, required duration.

If duration > available time → resubmit with a higher MDE or different surface.

LEARNING GOAL

What should the team know after the test, win or lose?

```

The form does the gatekeeping the manager would otherwise do verbally. A stakeholder cannot submit "make pricing clearer" because the IF field rejects vibe-level inputs by structure. A stakeholder cannot submit a competitor-driven request because the ORIGIN field is empty without an internal signal. A stakeholder cannot submit an underpowered test because the POWER CHECK field forces the calculation up front.

The Three Pushback Scripts

Most intake conflicts get resolved in two messages or fewer. The framing matters more than the content. The goal is to keep the requester engaged and reroute their effort, not to mark their submission incomplete.

For a vague IF clause:

"The problem statement here is strong. Can we make the IF clause a little more concrete — are we changing the label, the helper text, the confirmation screen, or some combination? Want me to draft a rewrite we can iterate on?"

For a competitor-imitation request:

"Before we commit a test cell to this, can we look at our own funnel data on that surface? If our data shows the same friction the competitor's pattern is solving, we can write a specific hypothesis around it. If our data shows that surface is fine, we will have saved the cell and learned something about where our actual leverage points are."

For a preference-driven request without internal data:

"I want to make sure we can defend whatever result comes back. Let me pull the funnel data on that surface first — if it confirms a problem, we write a hypothesis around the specific friction. If it shows the surface is performing fine, we redirect to a higher-leverage surface."

Each script does the same three things: acknowledges the requester's underlying instinct as valid, reframes the next step as investigation rather than rejection, and offers a concrete near-term action. Most stakeholders accept this immediately because what they actually want is the outcome the test was supposed to produce, not the test itself.

A Worked Example — Trial-to-Paid Prompt

Origin. Funnel data shows that users in the seven-day trial who reach the dashboard but do not complete the activation checklist convert at less than half the rate of users who complete it. Session-replay review on a sample of these users reveals confusion about which checklist items are required vs optional.

Current experience. The activation checklist on the dashboard shows all twelve items with identical visual weight. There is no indication that the first three are the "critical path" while the rest are nice-to-have.

Proposed change. Visually separate the activation checklist into a "Required to start" group (three items) above a divider, and an "Optional next steps" group (nine items) below. Add a one-line caption above the first group: "Complete these three steps to see your first results."

Hypothesis.

IF we group the activation checklist into "Required to start" (three items) and "Optional next steps" (nine items), with a caption above the first group explaining what completing them unlocks,
THEN more trial users will complete the three required items within their first session on the dashboard,
BECAUSE the visual grouping and caption reduce the user's decision cost about which items to prioritize, eliminating the implicit "all twelve look equally required" friction.

Primary metric. First-session completion rate of the three required checklist items.

Secondary metrics. Completion rate of the optional items, time-to-first-completed-item, session length on the dashboard.

Guardrails. Trial-to-paid conversion rate (do not want to optimize the checklist at the expense of overall conversion), churn-during-trial.

Power check. Baseline first-session completion of the three required items is 38%. MDE is 10% relative. Weekly trial signups reaching the dashboard: 1,200. Required sessions per arm: ~6,400. Weeks required: ~10.7. Test is runnable but not fast — flag to the requester before build.

Learning goal. Establish whether visual grouping of multi-step checklists reduces friction enough to improve completion of the required subset. If the test wins, the next test should examine whether the same pattern applies on the post-paid onboarding checklist.

Notice the differences from a typical submission: the intervention is buildable from the IF clause alone, the primary metric is one step downstream of the intervention, the mechanism is specific (decision cost reduction via visual grouping), and the power check has surfaced a 10-week duration constraint that the team and requester need to agree on before the build starts.

The Six Things That Catch 80% of Bad Submissions

If you do not adopt the full intake form structure today, adopt these six rules:

Reject any IF clause that uses only theme words. ("better," "clearer," "easier," "more premium," "less crowded.") Specific element + specific change is the minimum.
Require an internal data signal for every submission. No exceptions for senior stakeholders. The form does not care who submitted it.
Run a power calculation at intake. If the surface lacks traffic, the test cannot answer the question — surface that before build, not after.
Pick the primary metric one step downstream of the intervention. Further-downstream metrics are secondary diagnostics, not primaries.
One intervention per test. Bundled changes are valid business tests but not clean learning tests. Label them as bundles if they ship.
Document the test for the future reader. Write the hypothesis as if someone six months from now will read it without you in the room. They will.

Analyst Checklist

Never approve a hypothesis whose IF clause uses only theme words.
Verify the ORIGIN field is populated with an internal signal, not opinion or competitor screenshots.
Choose the primary metric closest to the intervention. Push back if the requester insists on a downstream business KPI.
Run the power check before approving the test. If the duration exceeds the team's runway, return the request with either a larger MDE or a different surface.
Read every approved hypothesis aloud to yourself before greenlighting. If it sounds like a goal, send it back.

CRO Manager Checklist

Make the intake form the gate. No test enters build without passing the three diagnostics and the power check.
Train stakeholders on hypothesis structure, not just analysts. A 30-minute internal session pays for itself within a quarter.
Audit the existing test library quarterly. Pull the last 20 tests. How many have an IF clause specific enough that you could build the variant from it today? The ones that don't are candidates for retrospective rewriting.
Tag tests by intervention type and mechanism in the library, not by theme. "Visual grouping," "field reduction," "default pre-selection" are interventions. "Clarity," "friction," "trust" are themes that index poorly later.
Connect the intake standard to the readout standard. Tests with strong hypotheses get readouts that name the intervention and the mechanism. The two standards reinforce each other.

FAQ

Won't requiring a power calculation at intake slow down stakeholder requests?

In the short term, slightly. In the long term, no. The alternative is shipping tests that statistically cannot resolve the question being asked, then having to explain to the stakeholder six weeks later why the result is "inconclusive." Surfacing the constraint up front converts the conversation from "did the test work" to "do we have the surface to answer this question," which is a more useful conversation regardless of the answer.

What if a stakeholder submits a competitor-driven request and refuses to do the investigation step?

Hold firm and escalate if needed. The investigation step costs hours; running the test costs weeks of a test cell. If the stakeholder cannot make the case for why our funnel has the problem the competitor's pattern is solving, the team's resources are better spent on a request that can. This is not a popularity contest; it is resource allocation.

How specific does the ORIGIN signal need to be?

Specific enough that the analyst can verify it independently. "Trial-signup drop-off is high" is too vague. "Trial-signup completion has dropped from 67% to 58% over the last 60 days, segmented to mobile users, attached funnel chart" is specific enough.

What about exploratory tests where we genuinely don't know the mechanism?

Label them as exploratory. The hypothesis still needs to specify the intervention; what becomes optional is the BECAUSE clause being a fully formed theory. An exploratory test asks "does this variant move the metric" without committing to why. Valid, but should be the minority of the program.

How do I handle a leadership-directive test?

Translate it into the intake form before building. "Test the new pricing page" is not a hypothesis. The translation is: which specific element on the new pricing page is the team changing first, what behavior should that move, and why? If leadership has only given a directional ask, the translation is the test plan, and the team should walk it back to leadership for confirmation before build.

What's the single highest-leverage change a CRO manager can make on this?

The intake form change. One afternoon to update the form, a 30-minute stakeholder training, a quarterly audit. The compounding return on every test from that point forward is the largest available process change in most programs.

The Real Goal

A program's test library should compound into institutional knowledge. The question to ask, six months from any given test, is: can a new analyst pull this readout, understand what was changed and why, and apply the learning to a related surface? If yes, the program is compounding. If no, the program is accumulating numbers.

The intake form is where compounding begins. Everything downstream — the build, the analysis, the readout, the library tag — derives from how specific the hypothesis was at submission. A vague hypothesis cannot produce a specific readout no matter how good the analyst is.

GrowthLayer enforces the intake standard by default: every test requires an origin signal, a falsifiable IF clause, a power-check confirmation, and an explicit primary metric. The test library indexes by intervention and mechanism rather than theme, so future analysts can search "tests of visual-grouping interventions on activation surfaces" and get the right history back. For the deeper reasoning on why this matters — and why vague hypotheses break meta-analysis at the program level — see the companion piece on falsifiability and the experiment's memory.

FAQ

Won't requiring a power calculation at intake slow down stakeholder requests?

What if a stakeholder submits a competitor-driven request and refuses to do the investigation step?

How specific does the ORIGIN signal need to be?

What about exploratory tests where we genuinely don't know the mechanism?

How do I handle a leadership-directive test?

What's the single highest-leverage change a CRO manager can make on this?

About the author

Atticus Li

Applied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method

Atticus Li leads applied experimentation at NRG Energy (Fortune 150), where he and his team run more than 100 controlled experiments per year on customer-facing surfaces. He is the creator of the PRISM Method, a framework for high-velocity experimentation programs at large enterprises. He writes regularly about the statistical and operational details of A/B testing — the parts most CRO content skips.

atticusli.com

Keep exploring

Browse winning A/B tests

Move from theory into real examples and outcomes.

Read deeper CRO guides

Explore related strategy pages on experimentation and optimization.

Find test ideas

Turn the article into a backlog of concrete experiments.

Back to the blog hub

Continue through related editorial content on the main domain.