Skip to main content

The Economics of A/B Testing: Opportunity Cost, Sunk Cost Traps, and the True ROI of Your Testing Program

the vast majority of our tests were underpowered — burning traffic that could have powered better experiments. Here's the economics framework for maximizing your testing program's ROI.

A
Atticus LiApplied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method
11 min read

Editorial disclosure

This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.

Fortune 150 experimentation lead100+ experiments / yearCreator of the PRISM Method
A/B TestingExperimentation StrategyStatistical MethodsCRO MethodologyExperimentation at Scale

Most teams think about A/B testing through a statistics lens: statistical significance, sample size, confidence intervals. That framing is necessary but not sufficient. After running dozens of enterprise-scale experiments across a multi-brand program, I am convinced that the deeper problem in most testing programs is not statistical. It is economic.

The most damaging errors in the program I ran were not Type I errors or Type II errors. They were economic errors: misallocated resources, ignored information, and a persistent attachment to failing ideas because of the time already invested in them. When I applied a basic economic framework to the full enterprise dataset, the pattern of waste became quantifiable and, more importantly, preventable.

This article lays out the economic framework I now use to evaluate testing programs — not just individual tests. If you run a program of any size, these concepts will change how you think about what a test is really worth.

The Opportunity Cost of Traffic

In economics, opportunity cost is the value of the next-best alternative you forgo when you choose a particular action. In A/B testing, every visitor you route into an experiment is a visitor you could have routed into a different experiment. That foregone value is your opportunity cost.

The concept sounds abstract until you look at power calculations.

In our enterprise program, the vast majority of tests were underpowered — designed without sufficient sample size to detect the minimum effect size the team actually cared about. An underpowered test can still reach significance by chance, but its probability of detecting a real effect is below the threshold where the test earns its traffic. The experiment is, in economic terms, burning resources to generate inconclusive information.

Key Takeaway: An underpowered test does not just fail to produce a result — it consumes the traffic that could have powered a conclusive test on a different page, concept, or audience segment. The opportunity cost of running the vast majority underpowered tests is not zero. It is the value of the better-powered program you could have run instead.

Here is a concrete illustration. One test in our program ran for 95 days testing a copy change on a page with an a very high baseline conversion rate. The measured effect was 0.3%. Reaching significance on a 0.3% effect on an a very high baseline requires an enormous sample — and even if the effect is real, it represents a marginal improvement on a page that is already converting nearly everyone who reaches it.

For 95 days, a meaningful portion of organic traffic was allocated to detecting a sub-1% improvement on a page with almost no room to improve. The economic alternative — reallocating that traffic to test a redesign of a page converting at 40% — was never evaluated. The program paid a 95-day opportunity cost to generate a finding of minimal practical value.

Diminishing marginal returns is the formal economic concept here. As conversion rates approach their ceiling, each additional percentage point of improvement requires exponentially more effort, traffic, and time to detect and deliver. Testing copy on a near-ceiling page has near-zero marginal returns. Testing a structural change on a mid-funnel page with a 40% baseline has far higher return potential per unit of traffic invested.

The Sunk Cost Fallacy in Practice

The sunk cost fallacy is the tendency to continue investing in a course of action because of resources already spent, even when the expected future return no longer justifies continued investment. Behavioral economists Kahneman and Tversky documented this extensively — loss aversion causes us to weight past investment more heavily than future returns when evaluating whether to continue.

In A/B testing, the sunk cost trap looks like this: a concept gets tested, fails, gets refined, fails again, gets refined further, and keeps accumulating test cycles because the team has already invested quarters of work into the idea.

In our enterprise program, a "recommended plans" concept — presenting users with a curated subset of options rather than the full product catalog — was tested five times. Each iteration accumulated more signal that the concept was not working. Users on this product were in a high-consideration decision context: they wanted to evaluate all their options before selecting one. Curation, in this context, felt like restriction rather than help.

The test-one results were inconclusive. The test-two results leaned negative. By test three, the cumulative evidence was pointing clearly toward the conclusion that curation was the wrong lever for this audience. Tests four and five were refinements of the same concept, not genuine pivots. They produced more negative evidence at the cost of more traffic, more development time, and more weeks in the testing queue.

The economic cost of sunk cost thinking in this case: five test cycles on a concept with accumulating negative evidence, when each of those slots could have been occupied by a new hypothesis drawn from user research that the team actually had available but had not consulted.

Key Takeaway: The relevant question for any test iteration is not "how much have we invested in this concept?" It is "given everything we now know, what is the expected return on the next test of this concept versus the expected return on a genuinely new idea?" Sunk cost is economically irrelevant to this calculation. In a testing program, every slot in the pipeline has an opportunity cost. Filling a slot with a refuted concept means not filling it with a potentially winning one.

This is harder than it sounds. Teams develop attachment to ideas. Stakeholders who championed a concept do not want to hear that four tests of evidence have not supported it. The economic frame helps cut through the attachment: "We have five tests of evidence against this mechanism. What is the expected value of a sixth test versus a genuinely new approach?" The answer, modeled honestly, usually points to moving on.

Information Asymmetry: The Research Nobody Read

Information asymmetry, in economic theory, describes situations where one party to a transaction has materially better information than another. In the Akerlof "Market for Lemons" framework, the party with superior information can exploit the gap, but more broadly, information asymmetry creates market inefficiencies because decisions are made on incomplete or incorrect information.

In a testing program, information asymmetry exists when the team running tests does not have access to — or does not consult — the user research that would predict test outcomes.

In our program, the "recommended plans" concept that was tested five times had user research on file that directly addressed user behavior around plan selection. That research, conducted six months before the first recommended-plans test, showed clearly that users in high-consideration decision contexts preferred full-catalog comparison over curated recommendations. The research predicted the test failures before they happened.

The information existed. The team running the tests did not consult it. The five test cycles that followed were, in economic terms, the cost of an information asymmetry that was self-inflicted — the research was available, the asymmetry was created by process failures that prevented the testing team from accessing and acting on existing knowledge.

This is one of the structural problems that GrowthLayer is built to solve: making accumulated test knowledge and prior research discoverable at the moment a new hypothesis is being formed, so that information asymmetries between "what user research showed" and "what we are about to test" are visible before the test runs.

The economics are clear: spending six weeks and a slice of traffic to re-learn something user research already told you is a pure waste. The better use of that traffic is to test a hypothesis the research has not yet addressed.

Option Value and the Real Return on Iteration

Option theory in finance describes the value of maintaining the right — but not the obligation — to take a future action. A call option on a stock is valuable not because you will certainly exercise it, but because having the option preserves future flexibility.

In experimentation, iteration creates option value. A successful version-one test does not just deliver its measured lift. It creates the option to run a version-two test that builds on the learning, potentially delivering additional lift. The full expected return on a version-one test includes not just the direct lift but the option value of the learning it generates.

In our program, the biggest winner — a test delivering a +more than triple lift — was on a page that had never been tested before. The team had been concentrating testing on high-traffic acquisition pages while a mid-funnel page that was quietly underperforming went untouched for the entire first year of the program.

The economic concept here is comparative advantage. In economics, comparative advantage describes how resources should be allocated not just to areas of absolute advantage but to areas of relative advantage — where your return is highest relative to the alternative. A testing program has a comparative advantage in pages where the baseline conversion rate is far below its potential ceiling, where traffic volume is sufficient to reach significance in a reasonable time, and where no prior tests have already extracted the available learning.

The overlooked mid-funnel page had all three characteristics. It had a low baseline relative to analogous pages, adequate traffic, and no prior test history. The comparative advantage was obvious in retrospect. It was invisible during the program because attention was concentrated on pages that felt more important rather than pages where testing had the highest expected return.

Key Takeaway: Map your testing program not just by traffic volume but by expected marginal return per test. High-traffic, high-baseline pages often have lower marginal returns than mid-traffic, low-baseline pages that have been overlooked. Comparative advantage, not raw traffic size, should drive where you test.

The Hidden Cost of Not Having a Holdout

One of the most underappreciated economic concepts in experimentation is the cost of foregone measurement. In our program, a number of high-impact changes were deployed as 100% personalizations — meaning every eligible user received the treatment with no holdout group. These were implemented as permanent changes rather than controlled experiments.

The economic problem is stark: without a holdout, there is no counterfactual. Without a counterfactual, there is no way to measure whether the change actually produced the attributed improvement. Pre/post comparisons are confounded by seasonality, market changes, and concurrent program changes. The measurement of ROI depends entirely on having a clean counterfactual, and a 100% personalization destroys that counterfactual permanently.

The opportunity cost of not having a holdout is immeasurable in the literal sense: you cannot calculate what you cannot measure. But the economic implication is real. Every resource spent on a "personalization" that cannot be measured is a resource spent without the ability to evaluate its return. A testing program that produces unmeasurable ROI is not producing learning — it is producing activity.

GrowthLayer's pipeline management features are designed partly around this problem: distinguishing tests that generate genuine causal evidence from changes that produce activity without measurement. The distinction matters economically because only causal evidence compounds. Activity does not compound.

Building an Economically Sound Testing Program

Applying these economic concepts to practice produces a set of operational principles that differ meaningfully from the standard "run more tests, reach significance faster" advice.

Prioritize by expected marginal return, not by traffic volume. The highest-traffic page in your funnel is often the lowest-return testing opportunity if the baseline is already high or if prior tests have already extracted available learning. Calculate expected marginal return: how much lift is theoretically available, given the baseline and the ceiling, and how quickly can you reach significance given current traffic?

Set a concept retirement threshold. Decide in advance how many tests of accumulating negative evidence it takes to retire a concept from the testing queue. Two consecutive negative tests with consistent signal should trigger a deliberate evaluation of whether to continue. Five consecutive tests without a win should trigger a mandatory pivot. This converts the sunk cost question from a judgment call (which is subject to loss aversion) to a policy decision (which is not).

Create a research-to-test handoff process. Every new test hypothesis should be checked against existing user research before entering the queue. If the research already predicts the outcome, the test is not generating new information — it is consuming traffic to confirm what you already know. Information asymmetries inside your own organization are the most expensive kind because they are entirely preventable.

Power every test to detect your minimum meaningful effect. An underpowered test is not a "quick signal" — it is a traffic burn. If the minimum meaningful effect on a page is a 5% lift, power the test to detect 5%. If current traffic levels make that infeasible in a reasonable timeframe, do not run the test at all — reallocate the traffic budget to a page where meaningful effects are detectable.

Account for holdout costs in deployment decisions. Before deploying any change at 100%, calculate the cost of not having a holdout. If you cannot measure the ROI of the change without a holdout, the long-run cost of lost measurement may exceed the short-run cost of running the change as a proper experiment with a holdout group.

What the 42-Test Program Cost in Economic Terms

Running the full economic accounting on the enterprise program produces a sobering picture.

the vast majority of tests were underpowered — those tests consumed traffic that could have been allocated to well-powered experiments. One concept was tested five times despite accumulating negative evidence — the sunk cost of those five test cycles includes development time, traffic allocation, and queue slots that displacing potentially winning hypotheses. One test ran 95 days to detect a 0.3% effect on a near-ceiling page — the opportunity cost of those 95 days is the return on whatever test could have run instead. And the biggest winner of the entire program ran on a page that had been overlooked for over a year — the opportunity cost of that delay is the +more than triple lift that was available but uncaptured.

None of these are statistical errors. They are economic errors. They would not be caught by a standard test review that checks p-values and confidence intervals. They are only visible when you apply the economics lens: opportunity cost, sunk cost, diminishing marginal returns, information asymmetry, and comparative advantage.

Key Takeaway: The ROI of a testing program is not just the sum of the lifts your winning tests deliver. It is the sum of those lifts minus the opportunity cost of traffic burned on underpowered tests, minus the traffic cost of re-running refuted concepts, minus the measurement cost of uncontrolled deployments. Most programs have never done this accounting. When they do, the result is almost always a mandate to run fewer, better-designed tests rather than more tests faster.

Conclusion

The economics of A/B testing are not complicated once you apply the basic framework. Opportunity cost means that every test competes with every other possible test for the same traffic, time, and development resources. Sunk cost means that prior investment in a concept is irrelevant to the question of whether to continue. Diminishing marginal returns means that near-ceiling pages return less per test than mid-funnel pages with room to improve. Information asymmetry means that re-testing what you already know is a pure waste. And comparative advantage means that traffic should be allocated to the contexts where testing generates the highest expected return.

The enterprise program I ran got better as these principles were applied more consistently. The last twelve tests in the program outperformed the first twelve by a significant margin — not because the team got better at running statistics, but because the economic allocation of testing resources improved.

If you are managing a testing program of any size, the first step is to run the accounting. How much of your traffic is going into underpowered tests? How many test slots are occupied by refuted concepts? What is the opportunity cost of not testing the highest-return pages in your funnel? The answers will almost certainly change your priorities.

If you are building a structured testing program and want a framework to track opportunity cost, test ROI, and concept retirement decisions, GrowthLayer is designed for exactly that — a knowledge base that makes the economics of your testing program visible, not just the statistics.

About the author

A
Atticus Li

Applied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method

Atticus Li leads applied experimentation at NRG Energy (Fortune 150), where he and his team run more than 100 controlled experiments per year on customer-facing surfaces. He is the creator of the PRISM Method, a framework for high-velocity experimentation programs at large enterprises. He writes regularly about the statistical and operational details of A/B testing — the parts most CRO content skips.

Keep exploring