Skip to main content

We Used AI to Audit Our Entire Testing Program in Hours (What Took Weeks Manually)

A manual meta-analysis of our testing program took weeks of spreadsheet work. AI can do the same audit in hours. Here's exactly what it covers and what it finds.

A
Atticus LiApplied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method
2 min read

Editorial disclosure

This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.

Fortune 150 experimentation lead100+ experiments / yearCreator of the PRISM Method
A/B TestingExperimentation StrategyStatistical MethodsCRO MethodologyExperimentation at Scale

The first time I audited our testing program manually, it took the better part of two weeks.

I exported every test record from the database, pulled the raw data into a spreadsheet, wrote formulas to recompute every significance calculation from scratch, manually reviewed records for duplicates, checked every conversion count against traffic counts for impossible values, and tried to categorize behavioral mechanisms across several years of tests written by different analysts using different terminology.

By the time I finished, I had found enough data integrity problems to reduce our stated win rate by roughly 70%.

What a Proper Testing Program Audit Actually Covers

A rigorous testing program audit addresses six distinct failure modes: statistical recomputation, impossible data detection, duplicate identification, behavioral mechanism classification, cross-test pattern detection, and win rate transparency.

The Statistical Recomputation Layer

For every test record that contains visitor counts and conversion counts for both control and variant, compute the Z-statistic from the raw numbers and compare the resulting p-value to the stored p-value. In our historical database, this check found discrepancies in roughly a quarter of all records. The most systematic cause was test type inconsistency: some analysts had used one-tailed tests when the program standard specified two-tailed.

Impossible Data and Duplicate Detection

Any test record where the reported number of conversions exceeds the reported number of visitors is factually impossible. The most common source was spreadsheet column transposition during data migration. Duplicates emerge through migration events and reporting conventions — the triple-count pattern where a test result is entered once when concluded, again when validated, and a third time when reported in a quarterly review.

Cross-Test Pattern Detection

With consistent mechanism classification, the pattern detection layer computes win rates and average effect sizes by mechanism category. In our program, friction removal was performing at two to three times the portfolio average win rate. Social proof mechanisms were performing well below average.

The Honest Win Rate

After running the full audit, the reduction was roughly 70%: from a reported win count in the high fifties to a verified win count of seventeen. The honest win rate is not a number to be ashamed of. A 35-40% win rate on tests that meet data integrity and statistical validity requirements is a reasonable result for a program operating in a high-consideration funnel.

Run the audit. Trust the lower number. Build on data you can defend.

About the author

A
Atticus Li

Applied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method

Atticus Li leads applied experimentation at NRG Energy (Fortune 150), where he and his team run more than 100 controlled experiments per year on customer-facing surfaces. He is the creator of the PRISM Method, a framework for high-velocity experimentation programs at large enterprises. He writes regularly about the statistical and operational details of A/B testing — the parts most CRO content skips.

Keep exploring