Skip to main content

Sample Ratio Mismatch Nearly Invalidated Our Best Test: A Practical Guide to SRM Detection

Our biggest winner (more than triple) had SRM on desktop. We saved the test by detecting it early and segmenting by device. Here is the practical guide to SRM detection every CRO practitioner needs.

A
Atticus LiApplied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method
10 min read

Editorial disclosure

This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.

Fortune 150 experimentation lead100+ experiments / yearCreator of the PRISM Method
A/B TestingExperimentation StrategyStatistical MethodsCRO MethodologyExperimentation at Scale

The test looked too good to be true. A redesigned deposit completion flow showing more than triple on the primary metric. The team was ready to call it our biggest win of the year.

Then we ran the sample ratio mismatch check.

Desktop had SRM. The traffic split was off in a way that could not be explained by chance. If we had shipped that result without investigating, we would have made a major product decision based on corrupted data. Instead, we caught it early, segmented by device, confirmed the mobile data was clean, and saved the test.

Sample ratio mismatch is one of the most underdiagnosed validity threats in A/B testing. I have seen programs that run for years without a formal SRM detection process. This article covers what SRM is, the five cases I have encountered in my own testing program, and a practical framework for catching it before it corrupts your decisions.

What SRM Is and Why It Matters

Sample ratio mismatch occurs when the observed traffic split between your control and variant groups differs significantly from the intended split.

If you set up a 50/50 test and end up with 49,200 sessions in control and 51,800 in variant, that is a problem. The difference may look small in absolute terms, but it signals that something in the assignment or measurement process is not working as intended.

The reason SRM matters is subtle. When your traffic split is off, it almost certainly means the two groups are no longer comparable. The assignment process is selectively including or excluding certain types of users, and you cannot be sure which ones. Your test is no longer measuring the effect of your treatment. It is measuring a confounded mixture of your treatment and whatever selection bias the assignment process introduced.

This is why SRM does not just add noise to your results. It can produce directionally wrong conclusions. The variant that appears to win may actually be losing if the assignment process systematically routed lower-quality or higher-intent users into one group.

The standard detection method is a chi-square goodness-of-fit test comparing observed group sizes to expected group sizes. For a 50/50 split, the formula is:

chi-squared = sum of ((observed - expected)^2 / expected) for each group

At a threshold of p < 0.01 rather than the usual p < 0.05, you are looking for strong evidence of imbalance before flagging. The more conservative threshold reduces false positives in SRM detection while still catching genuine problems.

Key Takeaway: SRM does not just add noise -- it can make a losing test look like a winner. Any program running tests without SRM checks is operating without a critical validity guardrail.

Five Real SRM Cases and What We Did With Each

Case 1: The more than triple Winner With Desktop SRM

This was our most consequential SRM discovery. A redesigned deposit completion flow showed a massive lift on the primary metric. But when we ran the chi-square test, desktop sessions had SRM at p = 0.003. The observed split on desktop was materially off from the 50/50 intent.

Mobile, however, was clean. The chi-square on mobile sessions showed p = 0.34 -- well within normal variation.

Rather than throwing out the entire test, we segmented the analysis. Mobile results were treated as valid. Desktop results were marked as untrustworthy and excluded from the decision. The mobile-only result still showed a substantial lift, and we proceeded to ship based on that subset.

The cause turned out to be a redirect configuration issue on desktop browsers. A specific desktop browser type was not receiving the variant cookie correctly, causing it to fall back to control more often. This is one of the most common SRM causes in redirect-based testing architectures.

Case 2: The Form Chunking Test With Borderline SRM

A test breaking a long enrollment form into shorter sequential chunks showed borderline SRM at p = 0.01. That is exactly at our flagging threshold.

This case was instructive because it was not clearly invalid. The imbalance was small in absolute terms. We documented the SRM, noted it in our test record, and analyzed the data with appropriate skepticism. We did not call a definitive winner. Instead, we used the results as directional signal and scheduled a follow-up test with a cleaner implementation.

Borderline SRM cases like this are where judgment matters. A p-value of 0.01 does not mean the data is definitely corrupted. It means there is meaningful evidence of an imbalance that warrants caution. We treat these as "proceed with documented uncertainty" rather than "invalidate entirely."

Case 3: The Enrollment Steps Test -- Full Invalidation on Desktop

A test restructuring the steps in an enrollment flow showed SRM on desktop at p = 0.002. This was not borderline. The desktop data was flagged as untrustworthy and excluded.

Unlike Case 1, mobile data here was also not entirely clean -- the mobile p-value was 0.04. Not technically below our 0.01 threshold, but worth noting. We made the decision to invalidate the desktop data entirely and treat mobile with caution.

The lesson here was not just about the SRM itself but about the cause. The enrollment steps test had been implemented with a redirect on desktop (users were sent to a different URL for the variant) while mobile used an in-page variation. Redirect tests have a fundamentally higher SRM risk because the redirect introduces a second step in the assignment process where users can drop out or get assigned incorrectly.

Case 4: The Credit Check Test With Traffic Imbalance

A test modifying the credit check presentation showed borderline SRM at p = 0.02, combined with a raw traffic imbalance of approximately 7%. Both signals together elevated our concern.

When we see SRM alongside a visible traffic imbalance in the raw counts, the combination is more worrying than either signal alone. A 7% imbalance means roughly 7 out of every 100 users who should have been in the variant ended up elsewhere -- or vice versa. That is enough to meaningfully shift the composition of each group.

We investigated the implementation and found a cookie persistence issue. Users who cleared cookies partway through a session were sometimes being re-assigned to control. The fix was implemented, but the first three weeks of data were unusable.

Case 5: The Streamlined Enrollment Test -- Data Thrown Out

The most expensive SRM case in our program. A test of a streamlined enrollment flow launched with a redirect from an older URL. Within the first week, SRM was detected at p < 0.001.

The redirect implementation was causing a subset of users -- specifically those arriving via bookmarked URLs -- to land on the control version even when they had been assigned to the variant. The test was relaunched with a corrected implementation, but the first four weeks of data were discarded entirely.

Four weeks of traffic is a significant cost. The time lost delayed the decision by over a month. This case drove us to create an explicit pre-launch checklist for redirect tests that must be completed before any redirect-based test goes live.

Key Takeaway: Redirect tests are the single highest-risk implementation pattern for SRM. Every redirect test should be treated as an SRM risk until proven otherwise.

The Chi-Square Test in Practice

You do not need a statistics background to run an SRM check. The calculation is simple and can be done in a spreadsheet.

For a 50/50 split with N total sessions:

Expected control = N x 0.5 Expected variant = N x 0.5 Chi-squared = ((control_observed - control_expected)^2 / control_expected) + ((variant_observed - variant_expected)^2 / variant_expected)

Look up the chi-squared value against a chi-squared distribution with 1 degree of freedom. Or use an online SRM calculator -- several are available that take your observed counts and intended split as inputs and return a p-value directly.

Use p < 0.01 as your flagging threshold, not p < 0.05. The more conservative threshold is standard practice in SRM detection because the cost of a false positive (investigating a valid test) is much lower than the cost of a false negative (making decisions on corrupted data).

Run the check separately for any segment you plan to analyze. If you intend to report desktop and mobile results separately, check for SRM in each segment independently. A test can be clean overall but show SRM within a specific device type, browser, or acquisition channel.

The "SRM on One Device" Pattern: When You Can Save the Test

Cases 1 and 3 illustrate a pattern worth naming explicitly: device-level SRM.

When SRM appears on one device type but not another, and the SRM has a plausible causal explanation tied to that device type (a redirect behavior, a cookie setting, a browser-specific implementation), you can often salvage the test.

The process is:

  1. Confirm the SRM is isolated to one segment (desktop, mobile, a specific browser)
  2. Identify a plausible implementation-based cause for the imbalance in that segment
  3. Confirm the other segment is clean with a separate chi-square test
  4. Analyze and report the clean segment's results as valid
  5. Document the SRM, its cause, and your segmentation decision in the test record

This approach does not apply when the SRM is global (affecting all sessions equally) or when you cannot identify a plausible causal explanation. If the SRM is unexplained, the test data should be treated with significant caution regardless of which segments look clean.

I use GrowthLayer to track SRM findings alongside other test quality indicators. Having the SRM status, the detected p-value, and the resolution documented in the same place as the test results makes post-program audits much easier.

Building SRM Detection Into Your Monitoring Workflow

SRM checks should happen at three points:

At launch (first 48 hours). Run a preliminary SRM check within two days of launch. You will not have much statistical power yet, but extreme SRM from implementation bugs will often be visible early. The sooner you catch a launch bug, the less data you contaminate.

At the 25% sample mark. When you have accumulated approximately one quarter of your target sample size, run a formal SRM check. This is when you still have time to relaunch if needed without losing too much time or data.

At analysis. Always run SRM as part of your standard analysis checklist before reporting results. No test result should be reported without a documented SRM status.

Some testing programs automate these checks. If your testing platform surfaces SRM warnings, pay attention to them. But do not rely solely on platform-generated warnings -- understand the calculation well enough to run it yourself when something looks off.

Key Takeaway: SRM detection is not a one-time check at the end of a test. It should be a monitoring process that runs continuously from launch through analysis.

When SRM Means "Throw Out the Data" vs. "Segment and Analyze Carefully"

This is the judgment call that distinguishes experienced practitioners from those who apply rules mechanically.

Throw out the data (or at minimum, do not ship a decision) when:

  • SRM is global and affects all sessions
  • You cannot identify a plausible causal explanation
  • The imbalance is large (more than 5% of sessions in the wrong group)
  • The SRM is severe (p < 0.001)
  • The affected metric is your primary metric or a critical guardrail

Segment and analyze carefully when:

  • SRM is isolated to a specific, identifiable segment
  • You have a clear implementation-based explanation
  • The clean segment alone provides adequate statistical power
  • The imbalance is modest and the direction of bias can be reasoned about

Document your decision either way. The reasoning behind how you handled SRM is as important as the SRM detection itself. Future audits, leadership reviews, and post-program analyses all benefit from knowing not just that SRM occurred but how it was resolved.

The Redirect Test Rule

If I had to give one rule to a team starting a new testing program, it would be this: treat every redirect test as a category-one SRM risk.

Redirect tests reroute users from one URL to another based on their variant assignment. The problem is that the redirect itself is a second assignment event. Users can drop out at the redirect step, get cached versions, land on the wrong variant, or fail to redirect entirely. Each of those scenarios produces SRM.

The solution is not to avoid redirect tests entirely -- sometimes they are the only practical way to test a major page redesign or a fundamentally different URL structure. But when you run a redirect test, you should:

  • Build extra monitoring time into the launch window specifically for SRM checking
  • Confirm that your tracking fires after the redirect, not before it
  • Check SRM by traffic source, because redirect failures often affect specific channels (email, bookmarks, paid traffic with certain redirect parameters)
  • Have a relaunch plan ready if SRM is detected in the first week

GrowthLayer includes a test quality checklist that flags redirect tests as requiring additional SRM monitoring. Building this kind of structural reminder into your program infrastructure ensures that redirect risk is never something that falls through the cracks.

Conclusion: Make SRM Detection Automatic

Sample ratio mismatch is not a rare edge case. In a mature testing program running diverse implementation types across multiple device types and traffic channels, SRM will occur. The question is whether you catch it before it corrupts your decisions.

The five cases I have described represent a range from borderline concerns worth noting to severe imbalances that invalidated weeks of data. In each case, the cost of detecting SRM was minimal -- a chi-square calculation and a few hours of investigation. The cost of missing it would have been business decisions made on false premises.

Build SRM detection into your launch workflow, your mid-test monitoring, and your analysis checklist. Check by segment, not just overall. Know when to segment and analyze versus when to discard. And treat redirect tests with the extra vigilance they require.

If you want a structured way to log SRM findings alongside your test results, GrowthLayer was built for exactly this kind of test quality tracking. Sign up free and bring your next test into a workflow where data quality is a first-class concern, not an afterthought.

Run an SRM check

Detect SRM in your own tests with the free Sample Ratio Mismatch Calculator and Chi-Squared Test Calculator. Browse all 12 free A/B testing calculators.

About the author

A
Atticus Li

Applied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method

Atticus Li leads applied experimentation at NRG Energy (Fortune 150), where he and his team run more than 100 controlled experiments per year on customer-facing surfaces. He is the creator of the PRISM Method, a framework for high-velocity experimentation programs at large enterprises. He writes regularly about the statistical and operational details of A/B testing — the parts most CRO content skips.

Keep exploring