The best hypothesis I ever tested did not come from a heuristics walkthrough, a user interview, or a best practice article.

It came from looking at the full history of our testing program and asking a simple question: what mechanism has produced the most consistent wins in our specific context? The answer — friction removal at action-stage funnel moments — was sitting in our own data for years before I found it. Nobody on the team had seen it, because nobody had compared behavioral mechanisms across the full portfolio simultaneously.

AI changed that. The ability to classify hundreds of test records by behavioral mechanism, compute win rates per mechanism, and generate new hypotheses that prioritize the highest-performing patterns is now operationally feasible.

The Core Problem With Most Test Ideation

Most CRO teams generate hypotheses the same way: heuristics evaluations, competitive analysis, user research synthesis, and industry best practice frameworks. These inputs are legitimate starting points, especially for programs without much historical data. But they share a structural weakness: they are not calibrated to your specific program's evidence.

What AI Does Well: Pattern Detection at Scale

The most powerful application of AI to hypothesis generation is extracting the structured patterns from your program history that should be driving ideation but are invisible in unstructured data. When I ran behavioral mechanism classification across our full testing portfolio, friction removal as a mechanism won at rates two to three times higher than the portfolio average. Social proof tests performed well below average.

The Mechanism-First Approach

A mechanism-first approach asks: "Which behavioral mechanisms have produced reliable wins in this funnel stage, for this audience type, in our program history — and what specific treatments could activate those mechanisms here?" The outputs are grounded in your program's evidence about mechanism performance.

Where AI Hypothesis Generation Fails

AI cannot do user research. AI cannot understand organizational constraints. AI cannot interpret novel user behavior. AI will reflect the quality of your historical data. Garbage in, garbage out, regardless of how sophisticated the pattern detection layer is.

The Human-AI Split in Practice

AI generates a prioritized list of hypothesis candidates, classified by mechanism, ranked by historical performance. Humans evaluate each candidate against four criteria that AI cannot assess: Is the underlying user need real? Is the implementation feasible? Does this conflict with a planned product change? Has this variant been tested before?

AI generates better hypotheses than most CRO teams — not because it is more creative, but because it has access to the distributional patterns in your historical data that no human analyst can hold simultaneously in mind. Build the foundation. Let the AI read the patterns. Then let humans decide what to test.

How AI Generates Better A/B Test Hypotheses Than Most CRO Teams

The Core Problem With Most Test Ideation

What AI Does Well: Pattern Detection at Scale

The Mechanism-First Approach

Where AI Hypothesis Generation Fails

The Human-AI Split in Practice

Keep exploring