By Atticus Li -- Applied Experimentation Lead at NRG Energy (Fortune 150). Creator of the PRISM Method. Learn more at atticusli.com

I have a confession: I once shipped a test result that was completely wrong. Not slightly wrong. The variant we shipped actually performed worse than the control. We just did not know it because we never verified that our measurement system was working correctly.

That was five years ago. Since then, I have never launched a major experiment without running an AA test first. At NRG Energy, where we run 100+ experiments per year, AA testing is not optional. It is the first step in the PRISM framework for a reason.

If you are not running AA tests, you are building your experimentation program on sand.

What Is an AA Test?

An AA test is an experiment where both groups see the exact same experience. No changes. No variants. The "A" group and the "A" group get identical content, identical functionality, identical everything.

The purpose is to verify that your experimentation platform, your analytics tracking, and your randomization are working correctly. If you split traffic 50/50 and both groups see the same thing, you should see no statistically significant difference in any metric. If you do see a difference, something is broken -- and that something would have contaminated every real test you run.

Think of it like zeroing a scale before you weigh something. If the scale reads 3 pounds with nothing on it, every measurement you take will be off by 3 pounds.

Why Teams Skip It

I get it. AA tests feel wasteful. You are using real traffic and real time to test... nothing. Stakeholders see it as a delay. Product managers want to know why you are "wasting a sprint" before running the real experiment.

Here is what I tell them: an AA test takes 1-2 weeks. A false positive from a broken measurement system wastes months of development time shipping a change that does not help -- or worse, actively hurts the business. The AA test is insurance.

The other reason teams skip it: they do not know what to look for when the results come in. So let me walk through exactly that.

What You Are Looking For

1. Sample Ratio Mismatch (SRM)

This is the most important check. If you split traffic 50/50, you should get approximately 50/50. Not exactly -- randomization introduces natural variance -- but close. A chi-squared test on the observed split will tell you if the deviation from expected is suspicious.

At NRG, we flag any split where the SRM test returns p < 0.01. In practice, this means if you expect 10,000 visitors per group and one group has 10,400 while the other has 9,600, that is a red flag worth investigating.

Common causes of SRM:

Bot filtering that differs by group. If your bot detection removes different percentages of traffic from each group, your sample sizes will diverge.
Redirect latency. If one variant involves a redirect and some users leave before the redirect completes, they drop out of one group but not the other.
Caching issues. CDN caching can serve one variant more than the other if cache keys are not properly configured.
User-level vs session-level assignment conflicts. If your tool assigns at the user level but your analytics count sessions, users with multiple sessions can inflate one side.

2. Metric Parity

Check your primary metrics between the two identical groups. Conversion rate, click-through rate, revenue per visitor -- whatever you plan to measure in real tests. They should be statistically indistinguishable.

If your AA test shows a "significant" difference in conversion rate between two identical experiences, your measurement is broken. Period. Common causes:

Event tracking firing inconsistently. The experiment script loads at different times for different groups, causing events to fire for some users but not others.
Third-party scripts interfering. Ad blockers, consent management platforms, or other scripts that interact with your testing tool can cause measurement differences.
Goal definition mismatches. Your experimentation platform and your analytics tool might define "conversion" slightly differently, and the discrepancy might correlate with group assignment.

3. Segment Consistency

Break your AA test results down by device type, browser, geography, and any other segments you plan to analyze. Each segment should also show no significant difference. If mobile shows a 15% conversion difference between two identical experiences, you have a mobile-specific instrumentation problem.

This is the check most teams miss even when they do run AA tests. The top-level numbers might look fine, but segment-level issues can hide underneath.

How to Run an AA Test: Step by Step

Here is the exact process I follow, which is part of the PRISM framework's calibration step:

Step 1: Set up the experiment exactly as you would a real test. Same platform, same targeting, same traffic allocation. Create two variants, but make them identical. Do not skip the variant creation step by just splitting traffic in your analytics tool -- the point is to test the full experimentation pipeline.

Step 2: Define your metrics in advance. List every metric you plan to measure in upcoming tests. You want to verify all of them, not just your primary KPI.

Step 3: Run for at least one full business cycle. At minimum, 7 days to capture weekday/weekend variation. For B2B sites with monthly cycles, consider running longer. At NRG, we typically run AA tests for 14 days.

Step 4: Analyze with the same rigor as a real test. Do not just glance at the dashboard. Run your full analysis pipeline: check SRM, check primary metrics, check secondary metrics, check segments. Document everything.

Step 5: If something fails, diagnose and fix before proceeding. An AA test failure is not a minor inconvenience. It means your experimentation infrastructure has a problem. Fix it. Run the AA test again. Do not proceed to real tests until you get a clean AA result.

Real Examples of What We Have Caught

The consent management platform incident. An AA test at NRG showed a 7% conversion rate difference between identical experiences. Investigation revealed that our cookie consent banner was interacting with the Optimizely snippet differently depending on group assignment. Users in one group were more likely to get a delayed script load, which meant their conversion events sometimes did not fire. Without the AA test, we would have launched real experiments with systematically biased measurement.

The mobile Safari caching issue. An AA test showed perfect parity on desktop but a significant difference on iOS Safari. The cause: aggressive caching was causing some mobile users to see a stale version of the page that did not include the experiment assignment script. Those users were being counted in the "control" bucket regardless of their actual assignment.

The analytics filter problem. An AA test showed normal traffic splits but a 12% difference in revenue per visitor. The cause: an analytics filter was excluding certain transaction types, and the exclusion correlated with experiment group assignment due to how the data pipeline processed events. The transactions were real -- our measurement just was not counting all of them consistently.

Every one of these would have produced false results in real experiments. Every one was invisible until we looked with an AA test.

Quick Checklist for Running an AA Test

Use this before every new testing initiative or after any infrastructure change:

[ ] Experiment set up with identical variants (not just traffic splitting)
[ ] All planned metrics defined and tracking verified
[ ] Traffic allocation set to match planned real test allocation
[ ] Run duration: minimum 7 days, ideally 14
[ ] SRM check performed (chi-squared test, p < 0.01 threshold)
[ ] Primary metric parity confirmed (no significant difference)
[ ] Secondary metrics checked
[ ] Segment-level analysis completed (device, browser, geography)
[ ] Results documented
[ ] Any failures diagnosed, fixed, and re-tested

When to Re-Run AA Tests

You do not need to run one before every single test. But you should re-run after:

Any change to your tag management or analytics setup
Platform updates to your experimentation tool
Changes to your consent management or privacy setup
New third-party scripts added to the page
Infrastructure changes (CDN, hosting, edge workers)
After a significant period without testing (3+ months)

At NRG, we run quarterly AA tests on our highest-traffic properties as a preventive measure. It takes minimal effort and has caught issues we did not know existed.

The Bottom Line

AA testing is not glamorous. It does not produce exciting results. No one writes case studies about their AA test.

But it is the single most important step in building a trustworthy experimentation program. Every test result you report, every decision you make based on data, every change you ship to production -- all of it depends on your measurement system working correctly.

Verify it. Then verify it again. Your future self will thank you.

Atticus Li leads enterprise experimentation at NRG Energy. AA testing is a core component of his PRISM framework. Learn more at atticusli.com.

AA Testing: The Step Most Teams Skip (And Why It Ruins Their Results)