I Audited Our Testing Program and Found We'd Been Lying to Ourselves: An Honest Retrospective
We claimed dozens of test winners. The real number was a fraction of that. Here's the honest audit that exposed inflated win rates, impossible data, and circular frameworks.
Editorial disclosure
This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.
I built a testing program I was proud of. Over three years, we logged 57 winning experiments. We presented those wins to leadership. We used them to justify headcount. We cited the win rate as evidence of a mature, rigorous experimentation practice.
Then I ran an audit.
After four weeks of going back through every test in our history, applying consistent standards to all of them, the 57 winning experiments became 17. A 70 percent reduction. We had not been lying deliberately. We had been doing something more insidious: making small, defensible choices that individually seemed reasonable and collectively amounted to a systematic distortion of our results.
This is that audit.
How 57 Became 17: The Five Causes
1. Duplicate Counting
Twelve tests were the same experiment run twice. Sometimes we ran the same hypothesis on different segments and counted each as a separate win. Sometimes we relaunched a test that had been paused because of a data quality issue, and the second run was counted separately even though the first run had already consumed some of the effect.
When we consolidated these into single experiments, we went from 57 to 45.
2. Early Stopping on Positive Trends
Eleven of our remaining 45 tests had been stopped before they reached the pre-planned sample size. Not dramatically early—usually 60 to 80 percent of the way through. The justification at the time was always some version of "the effect is clear enough" or "we need to ship this feature."
The problem is that p-values are not stable when you check them repeatedly. Stopping a test at the first significant result is a form of data dredging, even if it feels like responsible shipping. When we applied a corrected threshold to account for the optional stopping, 7 of those 11 results dropped below statistical significance.
We were now at 34 wins from 45 reviewed.
3. Underpowered Tests Calling False Positives
We had designed our power calculations to detect a 10 percent relative lift at 80 percent power, which was reasonable. What we had not done was check whether our actual experiment durations matched the required sample sizes.
Of our remaining 34 wins, 9 had collected fewer conversions than our power calculation required. For tests with small baseline rates and ambitious lift targets, this is common—traffic estimates are imperfect, conversion rates fluctuate seasonally, and tests get cut short when a product sprint ends.
Applying Bayesian credible intervals instead of frequentist p-values to the underpowered tests revealed that 6 of the 9 had confidence intervals that crossed zero. The lift was real but plausibly zero.
28 wins remaining.
4. Proxy Metric Substitution
This one was the most uncomfortable to document. Seven of our wins had been declared on a proxy metric—typically click-through rate or form start rate—when our planned primary metric was enrollment or purchase.
In each case, there was a post-hoc explanation: "The test was paused before we could collect enough downstream data, but click-through rate was trending positively, so we called it." Or: "The enrollment rate was directionally positive (p = 0.11) and the click rate was significant, so we used click rate as confirmation."
We reviewed what happened to downstream conversions in the weeks after these tests were shipped. Four of the seven showed no enrollment improvement. One showed a regression.
These seven tests were reclassified from wins to inconclusive.
21 wins remaining.
5. Regression to the Mean After Rollout
The last four wins fell into the hardest category: they had been statistically valid, properly powered, and measured on the right metric during the test. But when we reviewed 90-day post-rollout performance, the metric improvements had reversed.
This is regression to the mean in action. Some tests capture real effects that are context-specific to a particular time window—a seasonal peak, a traffic quality shift, a temporary competitive gap. When those conditions change, the "improvement" disappears.
We did not reverse-classify these tests as losses. The methodology was sound. But we flagged them as "context-dependent wins" that could not be generalized to other contexts or used as evidence of durable improvements.
17 durable wins from the original 57.
What This Means for CRO Programs
The audit was not a condemnation of our team or our process. Every one of those five failure modes was the result of reasonable-seeming decisions made under realistic constraints. We were a real team, shipping real products, under real timelines.
The lesson is structural: without systematic audit standards applied consistently at the time tests conclude, every CRO program will drift toward inflation. The incentives all point the same direction. Stakeholders want wins. Teams want to show impact. Stopping a test slightly early does not feel like fraud. Substituting a proxy metric feels pragmatic. Counting the same hypothesis twice feels like thoroughness.
The only fix is a pre-committed definition of what constitutes a win, applied at test design rather than after results come in.
In our program, we now use a pre-registration process: before any test launches, we write down the primary metric, the minimum sample size, the planned duration, and the stopping rule. No test can be called a win if it does not meet these pre-registered criteria—regardless of what the results look like.
Our apparent win rate dropped from roughly 40 percent to 19 percent after adopting this process. Our leadership took that honestly: a lower win rate from a rigorous program is more valuable than a high win rate from a permissive one.
Using GrowthLayer to Prevent Win Rate Inflation
The pattern I described above—teams with good intentions, reasonable local decisions, and systematically inflated aggregate results—is exactly what GrowthLayer's test library is designed to prevent. When every experiment requires a pre-registered hypothesis, a primary metric, and a pre-planned sample size before it can be created, the structural conditions for inflation are removed.
You cannot declare a win on a metric you did not pre-specify. You cannot claim a result from a test that did not reach its planned sample size. The discipline is built into the process rather than depending on individual judgment under pressure.
Explore GrowthLayer's experiment management features to see how pre-registration and structured result documentation work in practice.
Key Takeaways
- 57 claimed wins became 17 after a rigorous audit—a 70 percent reduction. The inflation was not deliberate fraud but the cumulative result of five defensible-seeming local decisions.
- Duplicate counting, early stopping, underpowered tests, proxy metric substitution, and regression to the mean are the five mechanisms through which CRO programs systematically inflate win rates.
- Optional stopping is the most common source of false positives. Stopping a test 20 percent early to ship a feature is not a neutral act—it biases toward inflated significance.
- Proxy metric substitution is the hardest to catch because it always comes with a plausible post-hoc explanation. Pre-specify your primary metric before the test launches and do not change it.
- A lower win rate from a rigorous program is more credible than a high win rate from a permissive one. Stakeholders who understand statistics will trust your program more, not less, when they see you apply consistent standards.
- Pre-registration—documenting your primary metric, sample size, and stopping rules before launch—is the only structural fix. Process discipline at design time prevents inflation that would require painful audits later.
FAQ
What is a realistic A/B test win rate for a mature experimentation program?
After applying consistent audit standards, our program's durable win rate settled at around 17 to 20 percent. Academic literature on online controlled experiments at technology companies suggests 10 to 30 percent is normal, with most companies clustering around 25 to 33 percent. A program claiming 50 to 70 percent win rates is almost certainly using inconsistent or inflated standards.
What is "optional stopping" in A/B testing and why does it cause false positives?
Optional stopping is the practice of checking your test results while the test is running and stopping when you see a significant p-value. Because p-values fluctuate during a test and will cross the 0.05 threshold by chance multiple times in a long experiment, stopping at the first significant reading inflates false positive rates substantially. If you check daily and stop at any significant reading, your actual false positive rate can exceed 30 percent even though each individual check appears to use a 5 percent threshold.
How do I conduct a win rate audit on my own program?
Start by pulling every experiment your team has called a win in the last two years. For each, check: Did it reach the pre-planned sample size? Was it measured on the primary metric specified before the test launched? Was it counted once or multiple times? Did post-rollout performance sustain the test result? Apply these criteria consistently. The number of wins that survive will tell you how accurate your historical claims have been.
Is pre-registration feasible for fast-moving product teams?
Yes. Pre-registration does not require academic-style protocols. A one-paragraph experiment brief that specifies the hypothesis, primary metric, minimum sample size, and planned duration takes five minutes to write and prevents the most common sources of inflation. Teams that resist pre-registration usually do so because they want to preserve the ability to change the rules after seeing early results—which is precisely what pre-registration prevents.
Related Reading
The win rate audit patterns described here connect directly to the broader question of which process-level decisions compound into program failure. The A/B Testing Mistakes That Kill Programs, Not Just Tests covers seven organizational mistakes—including underpowered tests and proxy metric substitution—that create the conditions for the kind of inflation uncovered in this audit.
Applied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method
Atticus Li leads applied experimentation at NRG Energy (Fortune 150), where he and his team run more than 100 controlled experiments per year on customer-facing surfaces. He is the creator of the PRISM Method, a framework for high-velocity experimentation programs at large enterprises. He writes regularly about the statistical and operational details of A/B testing — the parts most CRO content skips.
Keep exploring
Browse winning A/B tests
Move from theory into real examples and outcomes.
Read deeper CRO guides
Explore related strategy pages on experimentation and optimization.
Find test ideas
Turn the article into a backlog of concrete experiments.
Back to the blog hub
Continue through related editorial content on the main domain.