Skip to main content

I Audited My Own A/B Testing Program. Here's What Broke.

A 30-test audit revealed data integrity issues in 8 tests, misclassified winners, and a shipping decision based on a post-hoc secondary metric. The methodology fixes that follow.

G
GrowthLayer
6 min readUpdated May 25, 2026

Editorial disclosure

This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.

Key takeaways

  • 8 of 30 tests had missing or incorrect raw numbers in storage
  • Post-hoc decision rules silently inflated false positive rate to ~25% on fixed-horizon tools
  • Expected Loss + Bayesian probability flipped 5 inconclusive decisions vs p-value alone
  • Placement tests win 2.4× as often as copy tests; portfolio was 62% copy
  • Larger tests have MORE tracking errors — sample size doesn't fix methodology

TL;DR

I spent 30 hours auditing 30+ CTA tests from my own CRO program. The results were humbling. Half the tests had data integrity issues. A quarter of the "winners" were actually losers if you used the right decision framework. And one test shipped on a secondary metric after the primary lost. Here's what I found—and what I'm changing.

Why I Audited My Own Program

Last month, I decided to pull all CTA tests from my admin database and validate them. Not because I suspected problems—because I wanted to write about what worked. I assumed the tests were solid. I'd used Speero (a commercial stats engine), logged results in a clean schema, and shipped decisions based on p<0.05.

What I found: systematic gaps in data integrity, decision-making frameworks, and statistical methodology. Not catastrophic gaps. But gaps that meant I'd misinterpreted ~30% of tests and shipped one that shouldn't have shipped.

This article walks through the five most consequential findings, with real test IDs, real numbers, and the process changes I made.

Finding 1: Data Integrity Was Worse Than I Thought

I started by spot-checking five tests. Three had issues:

HPT-66 (DE OAM/HBE, phone CTA copy). The database stored identical numbers for control and variant: 700 conversions out of 84,500 traffic each. Impossible. The real numbers from HeyMarvin (my integration platform): control 82/700 (11.7%), variant 53/700 (7.6%), p=0.049. The stored data would have computed to p=1.0 (no difference). For two years I'd been reading this as "no effect" when the real test showed a significant copy degradation.

HPT-26 (DE homepage phone CTA). Stored clicks as the conversion count and traffic as 0. This test was actually 1,051 phone calls from 158,304 eligible homepage visitors. Re-derived: the control phone CTA drove a 0.66% call rate; the variant drove 0.77%—a +16% relative lift in phone calls.

HPT-24 (DE enroll button). Outcome marked as "loser", p=0.18 on the primary metric (PCV). But notes said "ship it." The actual story: primary metric lost, but secondary metric (enrollment starts) won. I'd never recorded that secondary metric in the schema, so I'd misread the test as a failure and almost shipped it anyway.

After auditing all 30, I found 8 tests with missing or incorrect raw numbers. Eight.

The lesson is uncomfortable: if your testing platform stores summary stats and not raw events, you have no way to detect this class of error. The fix isn't more rigor at logging time—it's a periodic raw-data audit against an external source of truth (HeyMarvin, in my case).

Finding 2: The Decision Rule Trap

I used a single decision rule: p<0.05, assume it's a winner.

Speero's p-value is one-tailed and assumes fixed-horizon testing (you stop when you hit power, not when the calendar says stop). But I'd been interpreting it like a two-tailed sequential p-value. For about 23% of tests, this inflated false positive rate from 5% to roughly 25% due to peeking.

Take HPT-65 (GME Pay Later button). Stored as "winner" with -8% lift and p=0.06. But the metric was "number of deposit holds"—lower is better. The test had a 94.9% probability that the variant reduced holds. That's a win. But I'd never marked the metric direction, so I'd read it as a loser.

The core problem: I made up the decision rule after seeing data. For HPT-24, I decided to ship on secondary metrics because the primary lost. For HPT-65, I had to invent a direction convention mid-audit. For every test, the decision logic was "what does p<0.05 tell me?" rather than "I pre-specified that I'd ship if P(B>A)>95% and Expected Loss<0.5pp."

Post-hoc decision rules are how you talk yourself into shipping anything. The cure is pre-registration: lock the rule in writing before the test launches.

Finding 3: Expected Loss Changes Decisions

For the 12 tests that were "inconclusive" (p=0.05 to 0.25), I calculated Expected Loss—the downside risk of shipping the variant. This is the second number you need alongside p-value.

Expected Loss = E[max(0, control − variant)] over the joint posterior. Plain English: across the entire distribution of plausible outcomes, how much conversion rate do I lose if I ship the variant and it turns out the control was actually better?

Example: HPT-91 (Reliant value prop CTA, p=0.27). The Bayesian posterior said 59% probability the variant beats control. Expected Loss if I shipped the variant: 0.3pp (absolute). Expected Loss if I kept control: 0.8pp. In expected value terms, shipping was better—even though p>0.05.

For two inconclusive tests, Expected Loss plus Bayesian probability suggested shipping, even though p>0.05. For three, it suggested keeping control even though p looked marginal. The five flipped decisions represent roughly $40K/month in re-allocated lift.

P-value tells you "is there an effect?" Expected Loss tells you "what's the cost of being wrong?" You need both.

Finding 4: Brand × Device Asymmetry

When I grouped wins by brand and device, the program-wide averages hid sharp asymmetries.

Brand · Copy Win Rate (Mobile) · Copy Win Rate (Desktop) · Placement Win Rate

Brand: DE · Copy Win Rate (Mobile): 67% · Copy Win Rate (Desktop): 42% · Placement Win Rate: 75%

Brand: GME · Copy Win Rate (Mobile): 37% · Copy Win Rate (Desktop): 63% · Placement Win Rate: n/a (few tests)

Brand: Reliant · Copy Win Rate (Mobile): 25% · Copy Win Rate (Desktop): 14% · Placement Win Rate: 88%

The pattern: placement beats copy 2.1:1 overall, but mobile copy works better than desktop copy. On desktop, users see the full page and can scan CTAs. On mobile, CTA placement (floating, sticky, above fold) matters more than wording because real estate is scarce.

I'd been treating "CTA testing" as a portfolio. Really, it's three separate problems: DE mobile copy, DE desktop placement, GME desktop copy. The same hypothesis applied to the wrong cell rarely wins.

Finding 5: The Tracking Validation Paradox

The tests with the biggest sample sizes—most traffic, longest duration—are the ones most likely to have tracking issues. HPT-66 (the identical control/variant issue) was run by the integration team, had 84,500 traffic, and should have been the most reliable. It was the least.

Larger tests touch more systems (analytics, CRM, payment pipeline, email). The bigger the test, the more moving parts, the higher the chance of misconfiguration. Yet larger tests feel safer because the numbers are bigger.

The mitigation is unsexy: run an AA test for the first 3-5 days of every major test. If the two "control" buckets diverge by more than chance, you have a tracking problem. Catch it before it taints the variant comparison.

What I'm Changing

  1. Pre-register before launch. Every test now locks: primary metric, decision rule (Bayesian threshold + Expected Loss bound), guardrail metrics, kill criteria. In writing. No exceptions.
  2. Use two decision numbers. P-value alone is not enough. I now use Bayesian probability + Expected Loss. For inconclusive tests, Expected Loss decides the tiebreaker.
  3. Stratify by brand and device. CTA copy testing applies primarily to mobile. Desktop gets placement and layout tests. Within each brand, copy angles are tailored to the mental model (urgency for DE, confidence for GME, reliability for Reliant).
  4. Audit tracking before shipping. Before a test goes live, I validate: sample ratio (is 50/50 split actually 50/50?), event sequence (do enrollments actually follow the CTA click?), baseline match (does test-period control match pre-test baseline?). Three spot checks before launch, plus an AA period.
  5. Backfill phone revenue attribution. HPT-26 proved phone CTAs drive calls. But phone calls weren't in my revenue dashboard. I'm building a phone-call-to-revenue mapping so phone tests have a real KPI, not a proxy.

The Honest Takeaway

I shipped one test I shouldn't have (HPT-24, secondary metric). I kept one test alive I should have killed (HPT-65, misread the direction). I misinterpreted five others because of data integrity issues.

Over three years, if 30% of tests were misinterpreted, and each test influenced go-forward strategy, that's real money. The program still worked—directional signal is hard to hide. But precision suffered.

The good news: all five issues are fixable with process, not code. Pre-registration, expected loss, brand stratification, and tracking validation are hygiene, not research. They cost nothing except discipline.

If you're running an A/B testing program, I'd recommend the same audit. You'll probably find similar gaps. And fixing them is worth more than running 20 more tests on top of broken methodology.

About the author

G
GrowthLayer

GrowthLayer is the system of record for experimentation knowledge. We help growth teams capture, organize, and learn from every A/B test they run.

Keep exploring