Skip to main content

Your A/B Test Stats Are Probably Wrong: Why Automated Recomputation Matters

When we audited our A/B test database, stored p-values disagreed with recomputed values in a significant share of tests. Here's why that happens — and how to prevent it.

A
Atticus LiApplied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method
12 min read

Editorial disclosure

This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.

Fortune 150 experimentation lead100+ experiments / yearCreator of the PRISM Method
A/B TestingExperimentation StrategyStatistical MethodsCRO MethodologyExperimentation at Scale

When I ran the first complete audit of our testing program's statistics, I expected to find some discrepancies. Minor rounding differences from different calculators. A few tests where the data had been entered in slightly different formats. The kind of small inconsistencies that are inevitable when humans are manually logging numbers across a multi-year program.

What I found was significantly worse than that.

Stored p-values disagreed with recomputed values in a meaningful share of all tests in the database. Among those discrepancies, a substantial number were large enough to change the outcome classification — tests labeled as statistically significant winners that, when recomputed from the raw visitor and conversion counts, did not meet the program's stated 95% confidence threshold.

The implications of that finding are not academic. Every one of those mislabeled tests represented a decision — to implement, to brief a follow-up, to cite as evidence for a behavioral hypothesis — that was made on the basis of statistics that were wrong. Some of those decisions had already been implemented in the product. Some had informed subsequent test designs. The error had propagated.

This article is about how that happens and what the only reliable fix looks like.

The Manual Entry Problem

The most common source of statistical errors in testing databases is the most obvious one: humans manually entering computed statistics are error-prone, and the errors are invisible unless the underlying data is also stored.

The typical workflow in a testing program that does not have automated data capture looks like this: a test concludes in the testing platform. An analyst exports the results. The analyst runs the significance calculation — sometimes in the platform's built-in calculator, sometimes in a spreadsheet, sometimes in a dedicated significance tool. The analyst copies the key statistics — p-value, confidence interval, observed lift, sample sizes — into the program's documentation system. The test is classified and filed.

Every step in that chain is a potential point of failure.

The export may contain more than one version of the data — pre-QA, post-QA, segmented by device — and it may not be obvious which version was used. The significance calculation may use a test type — chi-square, Z-test, t-test, one-tailed or two-tailed — that differs from the program standard, and that choice may not be documented. The copy step may introduce transposition errors. And if the input data is later revised — a quality assurance check removes bot traffic after the fact, for example — the stored statistics may not be updated to reflect the revision.

None of these failure modes require carelessness. They are inherent in manual multi-step workflows. The question is not whether your manually entered statistics contain errors — they almost certainly do — but whether your program has any mechanism to detect and correct those errors before they become the official record of what happened.

Most programs have no such mechanism. The stored statistic is the record. Nobody checks it.

The Three Most Common Statistical Errors We Found

When I examined the discrepancies in our database systematically, they sorted into three primary categories, each with a distinct cause and a distinct failure signature.

Column Swaps

The single most common source of large discrepancies was a data migration error: control and variant data had been swapped when test records were imported or copied from a legacy system.

The signature of a column swap is unmistakable once you know to look for it: the control variant shows higher conversion counts than expected given the traffic, and the variant shows lower conversion counts, or vice versa. When you recompute the statistics with the columns corrected, the direction of the result may not change — but the effect size changes, and the computed significance may change substantially.

In our case, several tests had records where the control had a higher observed conversion rate than the variant, the test was labeled as a variant win, and the stored p-value showed high confidence in the variant's superiority. The only way that combination of values is coherent is if the columns were swapped — the "variant" data was actually the control's performance. When I corrected the column assignment and recomputed, several of these tests were wins in the opposite direction from what the record showed. The variant that had been implemented as the winner was, in fact, the control.

These tests had been cited in pattern analyses. One had been used to justify a design direction in a subsequent test brief. The mislabeling had consequences.

One-Tailed vs. Two-Tailed Disagreements

The second category of discrepancy was more subtle: statistical tests that had been run one-tailed rather than two-tailed, or the reverse.

A one-tailed Z-test answers the question: "Is the variant significantly better than the control?" A two-tailed test answers: "Is there a significant difference between the variant and the control in either direction?" For the same observed data, a one-tailed test produces a p-value that is exactly half the p-value of the two-tailed test, because it ignores the probability of a result in the opposite direction.

Several testing platforms use a one-tailed test as their default significance calculation. When analysts copy p-values from these platforms and the program standard specifies a two-tailed test, the stored p-values are systematically too low — inflating confidence in the results.

In our database, we found tests where a stored p-value of 0.04, apparently significant at the 95% threshold, recomputed to approximately 0.08 using a two-tailed test on the same data. At the program's stated threshold, 0.08 is not significant. The test would not have been classified as a win if the recomputed value had been used.

These discrepancies were not the result of analytical dishonesty. The analysts who recorded the statistics had used the numbers their testing platform reported. They did not know — or did not check — whether the platform's default test type matched the program standard. The platform's documentation was not prominent about this choice, and the program standard document did not specify the test type clearly enough for it to function as an operational check.

P-Values That Were Never Updated After Data Revisions

The third category was the least visible and the most systemic: p-values that had been computed correctly at the time of entry but were never updated when the underlying data was subsequently revised.

The most common trigger for data revision is post-hoc quality assurance: bot traffic removal, session deduplication, exclusion of internal team visits. Testing platforms typically allow retroactive filtering of results, and analysts sometimes apply these filters after the initial statistical analysis. When that happens, the visitor and conversion counts in the record should be updated — and the stored p-value should be recomputed from the new counts.

In practice, the data was often revised and the p-value was not. The record showed updated traffic numbers but a p-value computed from the original, unfiltered data. In cases where the quality assurance removal was significant — removing a large volume of bot traffic, for example — the updated counts produced substantially different statistics. Tests that appeared significant based on the original data were not significant after quality filtering was applied.

The only way to detect this class of error is to recompute the significance from the stored visitor and conversion counts and compare the result to the stored p-value. If they disagree beyond a small tolerance for rounding, the record has a problem that needs investigation.

Key Takeaway: The three most common sources of statistical errors in testing databases are column swaps from data migrations, one-tailed versus two-tailed test type mismatches, and p-values that were not updated after post-hoc data revisions. Each class of error is invisible without recomputation. Each can change outcome classifications.

The "Winners" With Zero Actual Traffic Data

Before I describe the recomputation approach, I want to describe one category of error that recomputation cannot fix, because it surfaces a different class of problem: tests with zero underlying data.

In our database, we had records that described clear wins — narrative outcome descriptions, confident implementation decisions, citations in subsequent briefs — that contained no statistical data at all. No visitor counts. No conversion counts. No p-values. Just text: "Variant B showed significant improvement in enrollment rate."

These records had been imported from presentation slides that an analyst had produced as meeting summaries for stakeholder reviews. The slides summarized results in narrative form because the audience was not technical. The summaries were then imported as test records because they described outcomes, and the program's documentation practice at the time was to capture every described outcome in the database.

The records looked like test records. They were not. They were descriptions of what someone had reported in a meeting, with no link to the underlying data.

Several of these narrative imports were in the program's official list of validated wins. When I removed them, the win count dropped and, with it, the apparent success rate of the program.

What makes this class of error especially important is the confidence it generates. A record that says "Variant B showed significant improvement" reads the same as a record with a stored p-value of 0.03. The confidence that a team member draws from that record — in a subsequent discussion, when proposing a follow-up test, when building a best practices document — is not calibrated to the absence of data behind it. The narrative description creates the impression of evidence where there is none.

Recomputation addresses this by making the absence of data visible. If a record contains no visitor counts and no conversion counts, there is nothing to recompute — and that gap is precisely the flag. A knowledge base that stores only computed statistics with no underlying raw data cannot catch this problem. One that requires raw inputs and recomputes from them makes the gap unmissable.

Why Storing Raw Numbers Matters More Than Storing Computed Statistics

The core insight behind automated recomputation is a point about what the minimum viable verifiable record for a test looks like.

If you store only the p-value, you cannot verify it.

If you store the p-value and the visitor counts and conversion counts for each variant, you can verify it at any time by recomputing.

This seems obvious, but most testing databases — spreadsheets, Confluence wikis, custom documents — are built around storing the results that analysts report, not the inputs that results are derived from. The reason is that the results feel like the important part. The analyst ran the test, the platform reported the statistics, the important thing is to record what the test found. The raw numbers feel like implementation detail.

They are not implementation detail. They are the audit trail. They are the difference between a testing database that produces verifiable records and one that produces assertions. A database of assertions is only as accurate as the original reporting, and the original reporting — as described above — is systematically error-prone in ways that are invisible unless the assertions can be checked.

Storing raw visitor and conversion counts does not add much to the burden of documentation. Most analysts have this data available when they are writing up results. The friction is minimal. The return — a database where every statistical claim can be verified and any future error can be detected — is substantial.

Key Takeaway: Storing computed statistics without raw inputs makes verification impossible. Storing raw visitor and conversion counts alongside computed statistics creates a permanent audit trail that can detect errors introduced at any point — at initial entry, during data migration, or after post-hoc revisions.

The Propagation Problem: Why Errors Are Not Self-Correcting

Statistical errors in testing databases do not stay isolated. They propagate.

A mislabeled winner gets implemented. The implementation becomes the reference point for the next test in that area. The next test brief cites the prior win as evidence for the mechanism hypothesis. The follow-up test produces a result. The follow-up result is interpreted in light of the prior win, which is still wrong.

Over time, the mislabeled win becomes part of the program's institutional knowledge. Team members learn that "social proof works on this page" — not because a well-designed test demonstrated it, but because an error-propagated record in the database says so. The error is invisible because the institutional knowledge it generated looks exactly like the institutional knowledge that valid results generate.

This propagation dynamic is why statistical integrity is not just a methodological concern. It is a strategic one. The model your team builds of what works for your users is constructed from your test history. If your test history contains systematic errors, your model is wrong — and your wrong model is directing your prioritization, your ideation, and your resource allocation.

Catching and correcting statistical errors is not cleaning up after the past. It is protecting the validity of every decision the program will make in the future.

Automated Recomputation: What It Does and Does Not Solve

Automated statistical recomputation from raw inputs is the only reliable mechanism for detecting the class of errors described above. Let me be precise about what it does and does not address.

It detects: column swaps, test type mismatches between stored and recomputed values, p-values that were not updated after data revisions, and records where the raw data is internally inconsistent — where conversions exceed visitors, for example.

It does not detect: test designs that were fundamentally flawed, outcomes that were mislabeled before entry (if both the raw data and the outcome label are wrong in the same direction), or narrative imports that contain no data at all. These require different checks — pre-launch documentation requirements for flawed designs, and mandatory raw data fields that cannot be bypassed for the narrative import problem.

The combination of automated recomputation and required raw data fields addresses the majority of the failure modes described in this article. It does not make a testing program error-free. It makes statistical errors visible at the point where they can be caught and corrected, rather than invisible until they have propagated into downstream decisions.

How GrowthLayer Handles Statistical Recomputation

When we built GrowthLayer's test logging flow, automated recomputation from raw inputs was a core design requirement rather than optional feature. The reason should be clear from everything above: a testing database that stores only computed statistics is a database that cannot be audited. That is not a knowledge base; it is a record of claims.

When you log a test in GrowthLayer, you enter the raw visitor and conversion counts for each variant. The platform recomputes the significance from those inputs using a consistent statistical method — two-tailed Z-test for proportions, which is appropriate for the vast majority of conversion rate tests. If you also enter a stored p-value, the system flags any discrepancy beyond a small rounding tolerance.

Data quality flags are generated automatically for records where conversions exceed visitors, where the computed confidence level does not match the stored outcome classification, or where required fields are absent. These flags appear in the test record and in the program-level data quality dashboard.

The result is a database where every statistical claim is verifiable at the time it is entered, and where errors introduced during data migration or post-hoc revision are surfaced rather than silently absorbed.

That is not a complicated system. The core logic is a significance calculation applied consistently to every record. But the discipline of applying it consistently — and flagging every deviation — is what transforms a collection of test records into a database you can actually trust.

Running Your Own Audit

If you manage a testing program and you have not run a recomputation audit of your historical records, the instructions are straightforward.

Pull every test record that has both stored raw data (visitor and conversion counts) and a stored p-value or confidence level. For each record, recompute the significance from the raw data using your program's stated statistical method. Compare the recomputed value to the stored value. Any discrepancy larger than a rounding tolerance warrants investigation.

Separately, check every record for impossible values: conversion counts that exceed visitor counts for either variant. Check for records that have outcome classifications but no statistical data. These are the records most likely to represent narrative imports or migration errors.

The audit will take longer for larger programs, but the basic checks are mechanical. What you will find will depend on how your historical records were created, but based on the pattern across programs I have worked with and analyzed, you will find discrepancies.

The question is not whether they are there. The question is whether you catch them before they shape another cycle of decisions.

Conclusion

The test statistics in your database are probably wrong. Not all of them, not dramatically in most cases, but in a meaningful share of records and in ways that can change outcome classifications.

That is not a reflection on the skill of the analysts who ran those tests. It is a reflection on the structural characteristics of manual multi-step data entry, the inconsistency of statistical methods across platforms and analysts, and the absence of any systematic mechanism to catch and correct errors in most testing programs.

The fix is not complicated. Require raw visitor and conversion counts on every test record. Recompute the significance from those inputs automatically. Flag discrepancies. Make data quality visible in the program dashboard.

Those four steps turn a database of assertions into a database of verifiable claims. That is the minimum requirement for a testing program whose conclusions can be trusted — and whose institutional knowledge about what works for your users is actually accurate.

_GrowthLayer recomputes statistical significance from raw inputs on every test import, flags data quality issues automatically, and prevents mislabeled outcomes from contaminating your program's institutional knowledge. Start building a testing database you can trust._

Recompute it yourself

Validate any test result with the free Significance Calculator and Chi-Squared Test Calculator. Browse all 12 free A/B testing calculators.

About the author

A
Atticus Li

Applied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method

Atticus Li leads applied experimentation at NRG Energy (Fortune 150), where he and his team run more than 100 controlled experiments per year on customer-facing surfaces. He is the creator of the PRISM Method, a framework for high-velocity experimentation programs at large enterprises. He writes regularly about the statistical and operational details of A/B testing — the parts most CRO content skips.

Keep exploring