The most dangerous number in your A/B test results is the aggregate conversion rate. Not because it is wrong — it is accurate. But because it hides the test's real story behind a population average, and in that average, real wins get buried and real losses get papered over.

In a enterprise audit I ran across a multi-brand enterprise testing program, every single test showed a different result by device than it did in aggregate. Not some tests. Every test. In some cases the direction flipped: a test that looked flat or slightly negative in aggregate was actually a strong win for desktop users and a moderate loss for mobile users, and those two effects happened to cancel each other out when combined.

If your program reads aggregate results without breaking down device splits, you are not reading your results. You are reading the weighted average of two different populations that respond to changes in fundamentally different ways.

This article shows you the specific device-split patterns I found in those enterprises, explains why desktop and mobile users behave differently, and gives you a framework for using device splits to find wins your aggregate is hiding.

The Form Chunking Paradox: Desktop showed double-digit gains, Mobile declined, Aggregate -2%

The clearest example of a hidden device story came from a form chunking test. The hypothesis was straightforward: take a long single-page enrollment form and split it into multiple shorter steps. The logic was that a visually shorter form would feel less daunting and improve completion rates.

In aggregate, the test was a loser. Completion rates in the chunked variant were approximately 2% lower than the control, and the result was trending negative with enough confidence to call the test. The team filed it as a failure: "form chunking hurts enrollment."

That conclusion was wrong. When I pulled the device split, the picture was entirely different.

On desktop, the chunked form produced a 13% lift in completion rates. Users who could see the full original form on a desktop screen — with its visual length clearly visible before they started — were meaningfully more likely to complete the shorter-looking chunked variant.

On mobile, the same change produced a 5% decline. Mobile users were already completing the form field by field, scrolling as they went. For them, the multi-step structure added friction — specifically, additional page loads and transitions between steps — without providing the benefit of hiding visual length, because they never saw the full form length anyway.

The aggregate -2% result was a statistical artifact of these two opposite effects averaging out across a mixed device population. The test was not a failure. It was a desktop win and a mobile loss, with the device populations balanced in a way that made both effects invisible in the aggregate.

This test represents a significant missed opportunity. A device-conditional experience — chunked form for desktop, single-page form for mobile — would have captured the desktop win while avoiding the mobile penalty. Instead, the test was killed as a loser.

Key Takeaway: Aggregate results can hide both wins and losses. A flat or slightly negative aggregate result is often two opposing effects canceling each other out across device populations. Always check the device split before concluding a test was a failure or a success.

Why Desktop and Mobile Users Have Fundamentally Different Interaction Patterns

The reason device splits tell different stories is not screen size per se — it is the interaction model, the consumption context, and the amount of the page users actually see.

Desktop users see more of the page at once. A desktop viewport at 1440px wide might show 80% of a typical form, a full plan comparison table, or a complete copy block in a single view. This creates a "browsability" effect: desktop users can visually scan the full page before interacting with any of it. Changes that affect the page's apparent complexity — form length, table structure, copy density — have immediate visual impact on desktop that mobile users simply do not experience the same way.

Mobile users see a narrow slice. On a 390px-wide mobile viewport, the same form might show three fields at a time. The same plan comparison table requires horizontal scrolling. The same copy block requires four or five scrolls to read in full. Mobile users are processing the page sequentially, not scanning it holistically. A change that affects visual complexity is largely invisible to them; a change that affects flow, step count, or interaction friction hits them immediately.

Desktop users read more. Session recording analysis across the program consistently showed that desktop users spent significantly longer reading copy, particularly in consideration-stage contexts. They read guarantees, terms, explanations, and supporting information. Mobile users in the same context spent less time per page and moved more quickly through or off the flow.

Mobile users are more likely to use the phone as a phone. In the tests that included phone CTAs, 88% of clicks came from mobile users and 12% from desktop users. This is not surprising — a user who is already on their phone, interacting with a mobile interface, faces zero friction in initiating a call. A desktop user who wants to call has to pick up a separate device.

These behavioral differences are stable and predictable. They mean that some categories of test changes will consistently perform better on one device than the other, regardless of the specific implementation.

The Satisfaction Guarantee: 10x Stronger on Desktop Than Mobile

A copy test that added a satisfaction guarantee statement to a plan page showed a clear example of the reading behavior gap between devices.

The guarantee was positioned near the primary CTA — a mid-length copy block of about 40 words explaining that users could cancel within a defined window if unsatisfied. The hypothesis was that it would reduce perceived commitment risk and increase conversion.

In aggregate, the test was a modest positive — small but directionally clear. When segmented by device, the desktop result was approximately 10x stronger than the mobile result. The lift on desktop was meaningful and statistically reliable. On mobile, the effect was present but small enough to be within noise.

The reason was reading. Desktop users were reading the guarantee. Mobile users were largely scrolling past it — not because they would not have valued the information, but because the guarantee was positioned in a way that was easy to skip on a mobile scroll and easy to read on a desktop view.

This test generated an important follow-up question: if the guarantee influenced desktop users who read it, what would happen if the mobile presentation was redesigned to make it harder to scroll past? A more prominent mobile treatment — a sticky banner, a modal trigger, a visually distinct card — might have produced a mobile lift closer to the desktop lift.

That test was never run. But the device split made the hypothesis obvious and specific. Without the device split, the aggregate result would have been noted as a modest positive and filed. The insight — that there was a potentially much larger win available if the mobile experience was redesigned to surface the guarantee more prominently — would have been invisible.

The Phone CTA Split: 88% Mobile, 12% Desktop

The phone CTA tests across three brands in the dataset showed the most dramatic device split of any category tested.

When a prominent phone number and call CTA was added to enrollment pages, click-through rates on the phone CTA broke down approximately 88% mobile and 12% desktop, consistently across all three implementations.

This is useful not just as a data point about phone CTA usage, but as a framework for thinking about what device-specific functionality means for your funnel. The phone CTA was not generating equal engagement across devices — it was effectively a mobile intervention. The desktop result existed, but it was minor.

For test design, this has a direct implication: if you are trying to measure the impact of a phone CTA on your conversion funnel, segmenting by device is not optional. An all-device result will be dominated by the mobile population (in most modern traffic mixes, mobile is 55-70% of sessions) but will obscure the very different story on desktop.

The converse is also true: if your test result on a phone CTA looks weak in aggregate, check whether desktop is dragging down a strong mobile result. A desktop user who clicks a phone CTA has to make a separate call from a mobile device — the call-to-click-to-call friction is much higher. Desktop aggregate results for phone CTAs will almost always understate the mobile value.

Key Takeaway: Phone CTAs are functionally mobile interventions. Pricing transparency, copy-heavy changes, and informational updates tend to have stronger effects on desktop. Form structure changes tend to hurt mobile more than desktop. These patterns are predictable enough to build into your test design.

The Pricing Transparency CTA: A Loss That Was Larger on Desktop

Not every device-split story is about finding a hidden win. Sometimes the split tells you that a loss is concentrated in one population, and that population is the one you care most about commercially.

A test that added pricing transparency language to a CTA button — replacing a generic "Get Started" with specific language about how pricing was structured — showed a negative result in aggregate. Both devices showed losses, but the desktop result was approximately -3% and the mobile result was approximately -0.7%.

The desktop user population in this funnel had significantly higher average contract value than the mobile population — they were more likely to be comparing plans thoroughly, more likely to be evaluating the commercial terms of the product, and more likely to be influenced by copy that specifically addressed pricing framing.

The aggregate result was negative, which would have correctly led the team to reject this change. But the device split told a more specific story: the pricing language was creating uncertainty for desktop users specifically, who were the highest-value cohort in the funnel. The mobile loss was small enough that the intervention might have been neutral or even worth running on mobile alone, but the desktop concentration of the loss was a clear signal about where the framing was landing badly.

This kind of directional specificity — "this change hurts desktop users most" — is actionable. It tells you something about the mechanism (desktop users reading carefully, encountering the pricing language, and finding it ambiguous or off-putting) that you can use to redesign the next test.

When to Run Device-Specific Tests vs All-Device Tests

The natural conclusion from the device-split patterns above might be: "We should always run device-specific tests." That is not quite right. The decision depends on traffic volumes, test complexity, and what question you are trying to answer.

Run all-device tests when:

- You need the combined traffic to reach significance in a reasonable timeframe. Splitting by device cuts your sample size for each segment, potentially doubling or tripling your required runtime. - You are running a strategic test where the business question is "does this work for our users" and you have sufficient traffic on each device to read the split as a post-hoc analysis. - The change you are testing is unlikely to produce opposite effects by device — functional improvements, load time optimizations, accessibility fixes.

Run device-specific tests when:

- You have strong prior evidence (from previous device splits) that the effect will differ significantly by device. - You are testing an interaction-model change (form structure, step count, navigation pattern) that is likely to hit mobile and desktop users differently. - You are testing a mobile-specific surface (mobile navigation, sticky CTAs, touch-specific interactions) where desktop behavior is irrelevant. - You have enough device-specific traffic to run concurrent tests on each population without excessive runtime.

The practical middle ground for most programs: run all-device tests, but make device segmentation a mandatory part of your results analysis, not an optional deep-dive. Build the device split into your results template so it is reviewed for every test, not only when results look surprising.

Key Takeaway: Device-specific tests are not always necessary, but device-split analysis of all-device tests is always necessary. The question "what did desktop do, and what did mobile do?" should be answered before any test is called a winner or a loser.

The Device-Conditional Approach: Different Experiences by Device

The form chunking paradox — desktop showed strong double-digit gains while mobile declined — points toward a strategic option that most testing programs underutilize: device-conditional experiences.

A device-conditional approach uses the same test infrastructure to serve different variants to desktop and mobile users. Desktop users get the chunked multi-step form; mobile users get the original single-page form. This is not a new A/B test — it is a targeted implementation based on test results.

This approach requires more development effort than a single-variant rollout. But in cases where device splits produce strongly opposite effects — and the form chunking case is a clear example — the value of the device-conditional approach is high. Deploying the chunked form universally based on the aggregate result would have been a losing decision. Deploying the single-page form universally would have left the desktop win on the table. The device-conditional approach captures both.

[GrowthLayer](https://growthlayer.app) treats device-conditional results as a specific outcome type in test documentation — separate from "winner" and "loser" — because they require a different kind of rollout decision than a straightforward win or loss.

How to Read Device Splits in Your Existing Results

If you have a backlog of completed tests without device split analysis, here is a practical approach to retroactive review:

Start with the tests that had flat or slightly negative aggregate results. These are the highest-probability candidates for hidden wins. A test with aggregate +8% lift probably has positive results on both devices; a test with aggregate -1% lift might be hiding a +5% desktop win and a -5% mobile loss.

For each test, pull the primary conversion metric segmented by desktop and mobile. You are looking for:

- Direction flip: Desktop positive, mobile negative, or vice versa. These are the most actionable device splits — they suggest a device-conditional rollout. - Magnitude difference: Both positive, but desktop 5x stronger than mobile (as in the guarantee copy example). These suggest a mobile-specific optimization opportunity. - Dominant device: One device driving nearly all of the aggregate result (as in the phone CTA case). These clarify where the test's actual mechanism was operating.

Document what you find. Even retroactive device split analysis on completed tests builds a library of device-specific insights that will directly inform how you design future tests.

Why Aggregate Results Are Lying to You (and When They Are Not)

To be precise: aggregate results are not lying. They are accurately reporting the average treatment effect across your mixed device population. The question is whether that average is the right number to make a decision from.

In a testing program where desktop and mobile populations have similar behavioral patterns, similar funnel structures, and similar response to changes, aggregate results are a reasonable basis for decisions. In a program where the device populations are behaviorally distinct — as they are in virtually every high-consideration funnel I have worked with — aggregate results systematically hide the most actionable information your tests produce.

The times aggregate is still the right read: when you are measuring something that does not vary by device (brand perception, post-enrollment satisfaction, NPS), when you are making a decision that applies to all users regardless of device, or when you have explicitly verified through prior split analysis that your device populations respond similarly.

The times aggregate is not the right read: when the test involves interaction patterns (form structure, step count, navigation), copy depth (longer explanatory text, guarantees, pricing language), or device-specific functionality (phone CTAs, click-to-call, mobile-specific UI components).

Practical Implementation: Building Device Split Analysis Into Your Program

The device split analysis described in this article is not technically complex. It requires one additional segmentation in your analytics platform, applied to the same metrics you are already measuring. The complexity is organizational: making it a consistent part of how results are read, not an occasional deep-dive.

Build device split into your results template as a required section, alongside the aggregate result. Make it standard practice that no test is called a winner, a loser, or inconclusive until both the desktop and mobile results have been reviewed.

When you find device splits with meaningfully different results, document them explicitly and use them to generate device-specific test ideas. The form chunking example should have produced a follow-up test: "Optimizing the mobile enrollment flow for single-page completion." The guarantee copy example should have produced: "Designing a mobile-prominent guarantee presentation that does not rely on users reading mid-page copy."

The device split is not the end of your analysis. It is the beginning of the most specific, actionable part of it.

I built device segmentation as a default dimension in [GrowthLayer](https://growthlayer.app)'s results logging because the aggregate-only read is the single most common way that genuine wins get missed in enterprise testing programs. The form chunking test alone represented hundreds of additional enrollments that could have been captured with a device-conditional rollout. The guarantee test represented a mobile optimization opportunity that was never pursued.

Conclusion

Across enterprises, the device split told a different story than the aggregate in every single case. That is not an anomaly — it is the expected result when you test changes on a population that includes two fundamentally different interaction models.

Desktop users see more, read more, and respond to informational depth. Mobile users see less, scroll faster, and respond to structural simplicity. Testing "all devices" and reading the aggregate is averaging across two populations that are not the same. The aggregate hides wins that are real, surfaces losses that are overstated, and misses the specific mechanisms that your tests are actually measuring.

The fix is not complicated. Pull the device split on every test result. Make it a required step, not an optional analysis. Build the device-conditional rollout into your deployment options when splits show opposite effects. And start with the flat results in your completed test backlog — the wins your program already generated but filed as inconclusive are the highest-value place to look first.

Want a testing pipeline that tracks device splits as a default dimension and flags device-conditional rollout opportunities? [GrowthLayer](https://growthlayer.app) is built for exactly this kind of structured results analysis — start tracking your program for free.

_Atticus Li is a CRO Strategist and the Founder of [GrowthLayer](https://growthlayer.app), a platform for managing and improving enterprise experimentation programs._

The Desktop-Mobile Paradox: Why Your A/B Test Results Are Hiding in the Device Split