A/B Testing Is Product Development: How to Run Your Testing Program Like a Product Team
Our biggest wins weren't optimization tweaks — they were product redesigns measured as tests. Here's how to run your testing program like a product team.
Editorial disclosure
This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.
# A/B Testing Is Product Development: How to Run Your Testing Program Like a Product Team
The language we use for testing programs has a subtle problem, and it shapes the quality of every test we run.
When we call something an "experiment," we imply a small, controlled manipulation of an existing thing — a variable adjusted, a parameter tweaked, a hypothesis about a marginal improvement tested against the status quo. Experiments are things scientists run in labs. They are careful. They are precise. They are, by design, limited in scope.
When we call something "product development," we imply something larger: identifying a user need, defining a solution, building it, shipping it, measuring whether it worked, and iterating. Product development involves discovery. It involves construction. It involves decisions that change what a product fundamentally does for its users.
In our enterprise testing program, I noticed something: the tests that produced the largest impact were never "experiments" in the conventional sense. They were product development that happened to be measured.
The test that more than tripled a primary metric was not a manipulation of an existing page element — it was the construction of a new dashboard capability that users had never had access to before. The test that dramatically simplified an enrollment flow was not an optimization of the existing steps — it was a product redesign that questioned whether half the steps needed to exist at all. The test that introduced a phone channel as a conversion pathway was not a copy change or a layout adjustment — it was a new product capability that opened a fundamentally different way for users to complete the journey.
These tests won not because they were well-designed experiments. They won because they built things that were genuinely better. The measurement framework confirmed the improvement — but the value came from the product thinking, not the testing methodology.
When you understand this distinction, it changes everything about how you structure a testing program.
The Spectrum: From Micro-Optimization to New Capability
Not every test needs to be product development. The most useful mental model is a spectrum:
At one end are micro-optimizations: changes to individual elements — a button label, a headline variant, a color change, a spacing adjustment. These tests are fast, cheap, and easy to run. They also have a structural limitation: the effect size of a single-element change is almost always small, which means you need large amounts of traffic and time to detect a reliable signal. When they win, the gains are typically incremental. When they lose, you learn almost nothing, because changing one element in isolation tells you very little about what actually drives user behavior.
In the middle are feature tests: changes to a functional component of a product — a new form design, a revised information architecture, a different pricing presentation, a modified checkout sequence. These tests require more build time but create more behavioral change. They can produce meaningful effects because they alter how users engage with a system, not just how they perceive a single element.
At the other end are product redesigns and new capabilities: fundamental changes to what a product does for a user — a new task management capability, a simplified end-to-end flow, a new conversion channel, a new post-conversion experience. These are not "experiments" at all in the traditional sense. They are product decisions. The test measurement confirms or disconfirms whether the product decision was the right one, but the value creation happens at the product level, not at the testing level.
The distribution of value across this spectrum is extremely skewed. In every mature testing program I have analyzed, the vast majority of business impact comes from the far end — the product redesigns and new capabilities — while the vast majority of test volume comes from the other end, the micro-optimizations.
The programs that continue to produce large impacts over time are the ones that have deliberately shifted their testing mix toward the high-value end of this spectrum. The programs that plateau are the ones that mistake testing volume for testing value.
Key Takeaway: The highest-impact tests are never micro-optimizations. They are product changes — new capabilities, redesigned flows, fundamentally different user experiences — that happen to be measured through a testing framework. The distinction between "optimizing" and "building" determines whether a testing program can sustain meaningful impact over time.
Why the Highest-Impact Tests Are Always at the "New Capability" End
The reason new capabilities produce larger effects than element-level tweaks is not mysterious. It is a direct consequence of how much behavioral change the test creates.
An element-level test changes one aspect of one moment in the user journey. Even if users notice the change and respond to it, the response is constrained by the context: a single element in a flow that has many other elements, any one of which might be the actual friction point. The ceiling for an element-level test is low because the ceiling for how much behavioral change a single element can create is low.
A new capability changes the structure of what is possible. The task dashboard that more than tripled a primary metric in our program did not just change how a page looked — it changed what users could do when they arrived at that page. They could see a structured representation of their next actions. They could check off completed steps. They could see their progress in a way that the original page had never provided. The behavioral change was not about responding differently to a stimulus — it was about having access to a fundamentally different type of experience.
A simplified enrollment flow does not just remove friction from an existing process. It changes the cognitive model of what the process requires. When users discover that something they expected to take ten steps takes four, they do not just convert at a higher rate — they arrive at the product with a different set of expectations, a different level of effort expended, and a different relationship to what they just signed up for.
A phone channel as a conversion pathway does not just add a button to a page. It opens an entirely different interaction mode — synchronous, human, real-time — for users who are not ready to complete a self-service flow. Those users would not have converted through any version of the digital experience, regardless of how it was optimized. The new capability reaches them because it is a different product, not a better version of the same product.
The "Test Idea" vs "Product Idea" Distinction
One of the most useful reframings I made in managing the enterprise program was shifting the vocabulary used in ideation sessions from "test ideas" to "product ideas."
The difference sounds trivial. It is not.
A "test idea" frames the question as: what should we change about what exists? This framing inherently limits the solution space to modifications of the current product. It produces incremental thinking because it starts from the current state and asks how to adjust it.
A "product idea" frames the question as: what could be better for the user? This framing is not constrained by what currently exists. It can produce answers that involve adding capabilities, removing steps entirely, changing the structure of the interaction, or opening new pathways that the current product does not support.
The most impactful tests in our program came from product ideation sessions, not from testing ideation sessions. The task dashboard idea emerged from a question about what information users actually needed at the key decision point in the funnel — which led to the insight that what they needed was not a better version of the existing page but a structured view of their situation that the existing page had never tried to provide. The streamlined enrollment emerged from a question about what the enrollment process was actually trying to accomplish — which led to the insight that several steps existed for historical reasons, not user-centered ones, and could be removed without losing anything of value.
When you run ideation as a product exercise rather than a testing exercise, you get fundamentally different ideas. And fundamentally different ideas produce the largest test results.
Structuring a Testing Program Like a Product Team
If the highest-value tests are product decisions measured through a testing framework, then the organizational structure and process of the testing function should look more like product development than like a quality assurance or analytics function.
In practice, this means:
Discovery before ideation. Product teams do not start by brainstorming features — they start by understanding problems. The same should be true of testing programs. Before generating test hypotheses, invest time in behavioral data analysis, user research, session recordings, and support ticket analysis to understand where users are actually struggling. The best test ideas emerge from specific, observed user problems, not from general hunches about what might improve a metric.
Definition with specificity. A product team writes a product brief before building a feature. A testing program should write a test brief before building a variant — and that brief should specify not just what will change but what behavioral mechanism is expected to produce the improvement, what secondary metrics should move and in what direction, and what would constitute a clear learning even if the primary metric does not move. Specificity at the definition stage improves both the test design and the interpretability of the results.
Build for learning, not just for winning. Product teams accept that some features do not work out and build processes to learn from those failures. Testing programs that treat every inconclusive result as a failure cannot accumulate the kind of systematic knowledge that compounds into program-level impact. The goal of any individual test should be to generate a specific, actionable learning — whether or not the variant wins.
Iteration cycles, not isolated experiments. Product development does not stop after a single release. It iterates. The most productive testing programs in our enterprise ran iteration sequences — a first test that established a baseline, a second that built on the learnings from the first, a third that refined further. Each iteration produced a better product and a sharper hypothesis for the next test. The programs that ran isolated tests, logged the result, and moved on to an unrelated next idea never accumulated this compound knowledge.
Key Takeaway: A testing program structured like product development — with discovery, definition, iteration cycles, and a bias toward building new capabilities rather than tweaking existing elements — will systematically outperform a testing program structured like quality assurance.
Sprint Cycles for Testing: How to Structure the Work
One of the ongoing debates in the enterprise program was about testing velocity: how many tests should a team run per sprint?
The case for high volume is straightforward: more tests mean more bets, more chances for a winner, more data. Teams that ran multiple tests per sprint felt productive. The backlog was moving. The calendar was full.
The case against high volume is subtler but ultimately more persuasive: tests are not lottery tickets. The value of a test is not in the running — it is in the learning, which requires adequate runtime, proper power, careful analysis, and time to incorporate the findings into the next hypothesis. A program that runs at high velocity but shallow depth accumulates a log of results without accumulating understanding.
The teams in our program that produced the most sustained impact over time were not the highest-volume teams. They were the teams that ran fewer, better-designed tests — each one backed by a specific discovery process, each one designed with a clear mechanism and diagnostic secondary metrics, each one analyzed carefully enough to produce a usable learning regardless of outcome.
The rough principle I landed on: it is better to run one test per sprint that is genuinely well-designed — with a specific mechanism, adequate power, and a clear interpretation plan — than to run three or four tests that are adequately designed and will produce ambiguous results.
Velocity is a means, not an end. The end is compound learning over time, and compound learning requires slowing down enough at each step to actually extract the insight before moving to the next idea.
The Compound Knowledge Effect
The deepest reason to run a testing program like a product team rather than like an experiment factory is what happens to organizational knowledge over time.
A product team that ships, measures, and iterates builds a progressively richer understanding of what its users need and how to deliver it. Each decision is informed by the ones that came before. The team gets better at identifying the right problems, designing the right solutions, and predicting whether a proposed change will work.
A testing program that runs isolated experiments and moves on builds no such accumulating understanding. The program knows which tests won and which lost, but it does not know why they won or lost, what patterns connect the winners, or how to construct a better hypothesis for the next test. Each test starts from scratch rather than from an incrementally better foundation.
The compound knowledge effect is what distinguishes the testing programs that continue to grow in impact over time from the ones that plateau after the first cycle of easy wins.
Building toward compound knowledge requires treating test results as inputs to a shared knowledge base, not as outputs to be filed and forgotten. It requires asking "what does this result tell us about how our users make decisions?" for every test, not just for the winners. It requires connecting insights across tests — noting when a finding in one test context illuminates a behavior observed in a different test context. And it requires using that accumulated understanding to generate progressively more specific and sophisticated hypotheses.
This is what I designed GrowthLayer to support at its core. The platform is not just a place to track test results — it is a knowledge base that connects test findings to behavioral mechanisms, surfaces patterns across tests, and uses the accumulated history of a program to improve the quality of future hypotheses. The goal is to make compound knowledge systematic rather than dependent on individual analysts holding program context in their heads.
Cross-Functional Testing: When Everyone Builds Tests
One of the structural challenges in enterprise testing programs is that test construction happens across multiple teams. The agile product team builds tests as part of their feature development process. The optimization team builds tests as part of their conversion improvement mandate. The analytics team builds tests to validate measurement decisions. In some organizations, the marketing team builds tests for campaign-related changes.
When these teams operate independently without coordination, two problems emerge.
The first is contamination: multiple tests running simultaneously on overlapping user segments can interact in ways that make both results unreliable. A user who is in both a product team test and an optimization team test is seeing a combined experience that neither team designed. If the interaction between the tests is positive, both teams may report wins that are partially attributable to the other. If the interaction is negative, both teams may report inconclusive results for changes that would have performed well in isolation.
The second is knowledge fragmentation: each team accumulates its own understanding of test results without that understanding being visible to the other teams. The product team learns something about how users engage with a new feature, but the optimization team, running tests on adjacent pages, never sees that learning. The optimization team discovers a behavioral pattern in the funnel, but the product team, designing the next feature release, builds against an outdated model of user behavior.
The fix for both problems is coordination infrastructure — a shared testing calendar, a protocol for managing test interactions, and a central knowledge base that all teams contribute to and can draw from.
In our enterprise program, the periods of highest cross-team productivity were the ones where we had a shared testing calendar with explicit protocols for interaction management, and a shared knowledge base where all test findings were documented in a consistent format. During those periods, findings from one team's tests would regularly surface in another team's hypothesis generation. The compound knowledge effect operated at the organizational level, not just within individual teams.
The New Mental Model: Tests Are Bets on Product Directions
Here is the reframe I would offer to any product manager running a testing program: stop thinking of tests as experiments and start thinking of them as bets.
An experiment is something you do to find out the truth about a small question. A bet is something you make about a direction — a commitment of resources based on a belief about where value lies.
When you think of tests as bets, the prioritization question changes. The question is not "what should we test?" but "which product directions should we invest in discovering?" The most valuable bets are the ones that, if they win, open up a significant new direction for the product — and that, if they lose, tell you definitively that a direction you thought was promising is not. The least valuable bets are the ones that neither confirm a direction nor rule one out.
Micro-optimizations are often neither. They might produce a small positive signal, but winning or losing a button color test does not tell you which product directions are worth investing in. Product redesigns and new capability tests are almost always one or the other: they either confirm that a fundamental change is right, or they reveal that the assumption behind the change was wrong. Either outcome is valuable.
Running your testing program like a product team means filling your bet portfolio with the second kind of test — the ones where winning or losing both produce information that changes what you build next.
Conclusion
The most impactful tests in our enterprise program were not experiments. They were product decisions — new capabilities built and measured, flows redesigned from first principles, channels opened that had never existed before. The measurement framework confirmed that the decisions were right, but the value came from the product thinking behind them.
Running a testing program like a product team means shifting the vocabulary from "experiments" to "product decisions," structuring the work around discovery and definition rather than backlog and velocity, treating iteration cycles as the unit of work rather than isolated tests, and building toward compound knowledge rather than isolated result logs.
When you make this shift, the testing program stops being a support function for product development and becomes the primary mechanism for product discovery. Every test is a product bet. Every result is a learning. And the program gets better — in terms of both impact and hypothesis quality — with every cycle it completes.
If you want a platform built for testing programs that operate this way — connecting test results to behavioral mechanisms, surfacing patterns across the experiment history, and supporting the compound knowledge model — GrowthLayer is designed for exactly this. Because the most valuable thing a testing program can build is not a list of winning tests. It is an understanding of how your users make decisions — and what you can build to make those decisions easier.
Frequently Asked Questions
How do you decide whether a test idea is an "optimization" or a "product development" effort? Ask whether the change builds something users have never had access to, or whether it adjusts something that already exists. If the variant introduces a new capability, a new flow structure, or a fundamentally different interaction model — it is product development. If it changes an element within an existing structure — it is optimization. Both are legitimate, but they have different expected effect sizes and require different levels of design and product investment.
Should micro-optimization tests be eliminated entirely? Not necessarily. Micro-optimizations can be valuable for validating specific assumptions, for optimizing high-traffic elements where even small improvements have meaningful scale impact, and for building the team's testing culture and infrastructure. The problem arises when micro-optimizations dominate the mix at the expense of higher-impact product bets. A healthy testing program includes both, with deliberate portfolio management.
How do you manage the tension between the optimization team's pace and the product team's development cycle? The most effective model I have seen is a shared testing calendar with defined interaction windows. Product team tests are typically longer-cycle and higher-implementation-cost; they should be flagged well in advance. Optimization team tests are typically shorter and can be scheduled around the product team's build cycles. The key is that both teams see the same calendar and have a protocol for managing segment overlap.
What is the right ratio of discovery to execution in a testing program? In a mature program, I recommend allocating roughly a third of the program's total capacity to discovery — user research, behavioral data analysis, hypothesis development — and two-thirds to execution and analysis. Early-stage programs may need to invest more in discovery to build the initial knowledge base. Programs that skip discovery and move directly to execution typically run out of high-quality hypotheses within a few cycles and fall back on incremental tweaks that produce diminishing returns.
Applied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method
Atticus Li leads applied experimentation at NRG Energy (Fortune 150), where he and his team run more than 100 controlled experiments per year on customer-facing surfaces. He is the creator of the PRISM Method, a framework for high-velocity experimentation programs at large enterprises. He writes regularly about the statistical and operational details of A/B testing — the parts most CRO content skips.
Keep exploring
Browse winning A/B tests
Move from theory into real examples and outcomes.
Read deeper CRO guides
Explore related strategy pages on experimentation and optimization.
Find test ideas
Turn the article into a backlog of concrete experiments.
Back to the blog hub
Continue through related editorial content on the main domain.