There is a category of test idea that feels so obviously right that running it seems like a formality. "Recommended Plans" was that idea for our testing program.

The logic was airtight: users comparing energy plans face a complex decision with multiple attributes — rate types, contract lengths, usage bands, green energy percentages. Decision fatigue research is clear that more options lead to worse decisions and higher abandonment. Recommendation systems are proven to work in e-commerce. Highlighting the best option should reduce cognitive load, build confidence, and increase conversion.

We ran the test. It failed.

We refined the implementation and ran it again. It failed again.

We brought the concept to a second brand with a different plan structure. Two tests, two failures.

We tried it with a third brand. Statistical significance reached: a double-digit decline.

Multiple tests. Multiple brands. Eighteen months of work. Hundreds of thousands of visitors. Zero wins.

This is the story of how we killed "Recommended Plans" — and what the failure taught us about user choice in high-consideration purchase contexts.

Test One: The Initial Optimism

The first implementation was straightforward. The plan selection page showed all available plans in a grid. The variant added a "Recommended" badge to the plan our data indicated was the most popular selection — the one with the highest enrollment rate among comparable customers.

The hypothesis was grounded in legitimate research: recommendation defaults are powerful, as documented in behavioral economics literature on default effects. The framing was clear. The badge was visible. The test was adequately powered.

After six weeks of runtime, the result was flat. Not a small positive, not a small negative — genuinely flat. The recommendation badge had produced no detectable effect on plan selection or enrollment completion.

The post-test review generated a list of implementation critiques: the badge was not prominent enough, the recommended plan was not the right choice for all users, the label "Recommended" was too generic. The consensus was that the concept was sound but the execution had been weak.

This is the normal response to a flat result, and it is often the right response. Most flat tests are flat because of execution gaps, not because the hypothesis was wrong. We went back to design.

Key Takeaway: A flat result after a first attempt is not evidence that a concept is wrong — it is an invitation to strengthen the execution. The mistake is treating a flat result as a green light for indefinite iteration rather than asking at what point accumulated null evidence should change your conclusion about the concept.

Test Two: The Three-Arm Mistake

The second test introduced two variants: a "Recommended" badge on the most popular plan and a "Best Value" badge on the plan with the lowest total annual cost. The logic was that the first test might have failed because "Recommended" was too vague — users might prefer a specific value label.

This was a three-arm test: control, recommended badge, best value badge.

Three-arm tests require substantially more traffic than two-arm tests to reach statistical significance on each arm. The plan selection page received moderate traffic. The sample size calculation — done after the test design, not before — revealed that achieving 80% power on both variant arms at 95% confidence would require approximately 48 days at the observed traffic rate. The test was scheduled for six weeks.

Forty-eight days in, neither arm had reached significance. The two variant arms were performing identically to each other. Both were marginally negative relative to control.

We simplified: we dropped the "Best Value" arm, consolidated traffic onto a two-arm test with "Recommended," and extended the runtime by three weeks.

At the end of the extension: flat again. The second test had consumed 48 days of underpowered three-arm runtime plus three weeks of two-arm extension to produce the same result as the first test.

The design decision — to launch three arms on a page with moderate traffic — had wasted over two months of testing calendar before producing evidence that a two-arm test would have produced in three weeks.

This is the three-arm mistake: adding variant complexity without validating that the traffic supports it. The right process is to run the sample size calculation before committing to the test structure, not after. If the calculation shows that three arms require 48 days on available traffic, the question is whether you have enough evidence for three arms to be necessary — or whether a simpler two-arm test should run first.

Key Takeaway: Three-arm tests require roughly 50% more traffic than two-arm tests to maintain equivalent statistical power on each arm. Running three arms on moderate-traffic pages produces extended, underpowered tests. Calculate sample size requirements for every arm before committing to the design.

Tests Three and Four: Bringing the Concept to a Different Brand

By this point, we had two flat results on one brand. The concept had not produced a statistically significant result in either direction.

A senior leader raised the question of brand specificity: maybe "Recommended Plans" fails on this brand's plan selection page because the plan options are too similar in price. On a second brand, where plans varied more significantly in total cost and contract structure, the recommendation might be more meaningful to users.

The third test launched on the second brand. The implementation was stronger this time: the recommended plan was identified using actual customer usage data — matching the customer's historical consumption to the plan that would produce the lowest total annual cost. This was not a generic "most popular" badge. It was a personalized recommendation based on real usage history.

The result was flat. Marginally positive: approximately +3% on plan selection, not statistically significant.

The fourth test, also on the second brand, tested the recommendation presentation rather than the badge itself. Instead of adding a label to the existing plan grid, the variant surfaced the recommended plan at the top of the list in a visually distinct card format, with a brief explanation of why it was recommended for this customer.

Still flat. The session replay analysis from this test included a finding that became the turning point in our thinking about the concept: "Users prefer a full list of plans when making a selection."

The replay data showed users clicking past the recommended card to view all available plans before making a selection — even when the recommended plan was the plan they eventually chose. Users were not ignoring the recommendation because they disagreed with it. They were setting it aside in order to complete a process they had already decided to follow: review all options, then choose.

Key Takeaway: Session replay data from the fourth test revealed that users bypassed the recommendation to view the full plan list — even when they ultimately selected the recommended plan. This is not decision fatigue behavior. This is deliberate comparison behavior. The behavioral models were different. The test concept was built on the wrong model.

Test Five: The a double-digit decline That Ended the Program

By the time the fifth test launched on the third brand, the evidence was substantial: four tests, zero wins, two flat results, one marginally positive flat result, and session replay data explicitly showing users bypassing recommendations.

The case for continuing was organizational rather than empirical. The concept had been presented to stakeholders as a promising optimization. Multiple design and development cycles had been invested. There was reluctance to declare a concept dead based on what some stakeholders characterized as "mixed results" — even though three of the four tests had produced essentially zero signal and the fourth had produced behavioral data inconsistent with the recommendation mechanism.

The fifth test was the most robustly designed of the five: clean two-arm structure, correctly powered, appropriate runtime, clear primary metric. It was the cleanest test of the concept we had run.

It reached statistical significance with a result of approximately a double-digit decline on plan enrollment.

The post-test debrief was the most useful conversation we had across the entire eighteen-month program. The question was not "how do we fix this?" It was "what did we get wrong about the mechanism?"

Why "Reduce Decision Fatigue" Does Not Apply Here

The "Recommended Plans" concept was built on the decision fatigue hypothesis: too many options overwhelm users, so curating the choice reduces the cognitive burden and improves the quality of the decision.

Decision fatigue is a real phenomenon, well-documented in behavioral economics. But it applies in a specific context: choices that are perceived as roughly equivalent, where the cost of a wrong choice is recoverable and the decision domain is low-involvement. Classic examples include food choices, product recommendations in e-commerce, and default selections on subscription sign-ups.

Energy plans are not that context. Neither are insurance policies, mortgage products, or SaaS plans with complex feature differentiation.

In high-consideration purchase categories, users are comparing plans across multiple attributes that genuinely matter to them: total annual cost, rate structure, contract commitment, renewable energy percentage, and plan flexibility. These attributes trade off against each other differently for different users. A recommendation system — no matter how good the underlying model — cannot know which attributes the specific user weighs most heavily.

The session replay data confirmed this. Users were not overwhelmed by the number of options. They were executing a deliberate comparison process. They wanted to see all plans because the comparison itself was the decision process — not a barrier to it.

This is the fundamental distinction the decision fatigue hypothesis missed: in low-consideration contexts, choice creates burden because the user does not have strong attribute preferences. In high-consideration contexts, choice creates opportunity because the user does have strong attribute preferences and needs all the options to evaluate them.

Reducing the visible option set in a high-consideration context does not reduce cognitive burden — it frustrates the comparison process users have come to perform intentionally. The recommendation adds a new meta-decision ("should I trust this recommendation or check for myself?") rather than removing a decision. Users were checking for themselves. Every time.

The Difference Between E-commerce and High-Consideration Recommendations

The recommendation hypothesis was a legitimate one in e-commerce — category-level personalization, "customers also bought," "you might like" — and the experimentation literature shows clear wins in those contexts. The analogical reasoning to utility plan selection seemed sound.

The analogy fails on three dimensions.

First, the stakes are different. An e-commerce recommendation might result in a $40 purchase. A plan selection might result in a 12-month energy contract. The user's tolerance for not checking all options scales with the consequence of a wrong choice.

Second, the attribute complexity is different. An e-commerce recommendation typically matches on one or two preference dimensions: category and price, or style and color. A utility plan comparison involves attributes that interact in non-obvious ways — a plan with a lower unit rate may be more expensive for a high-usage customer than a plan with a higher unit rate but a usage credit. Users know this complexity exists. They want to evaluate it themselves.

Third, the trust relationship is different. E-commerce recommendations come from platforms whose business model is to surface relevant products. Users have a generalized trust in that recommendation context. Utility plan recommendations come from the company selling the plans, which creates an obvious commercial conflict. The recommendation is not seen as neutral. It is seen as "which plan does the company want me to choose?"

The a double-digit decline result on test five, and the session replay finding from test four, both reflect this trust problem. Users did not believe the recommendation was made in their interest. They made their own comparison.

Key Takeaway: The decision fatigue hypothesis applies to low-consideration, low-stakes, attribute-simple choices. High-consideration purchases with multiple competing attributes and significant consequences are not decision-fatiguing — they are deliberate. Recommendation curation frustrates the comparison process rather than simplifying it.

When Recommendation Actually Works: Personalization on Actual Data

The one near-win in the five-test program was the third test — the one that used actual customer usage data to recommend the plan that minimized the customer's projected annual cost. It produced a +3% directional positive that fell short of significance.

This matters because it points to the conditions under which recommendation can work in high-consideration contexts, even if the effect is smaller than in low-consideration contexts.

Personalization based on genuine customer data performs better than generic labeling because it changes the meta-decision the user faces. With a generic "Recommended" badge, the user's meta-decision is: "Should I trust a recommendation that has no visible basis?" The natural answer is skepticism. With a data-backed recommendation — "Based on your usage last year, this plan would save you approximately £X" — the meta-decision becomes: "Is this calculation right?" That is a tractable question users can evaluate.

The difference is between asking users to take a recommendation on faith and giving users the reasoning behind the recommendation so they can assess it themselves.

In high-consideration contexts, the recommendation pathway that works is transparency, not curation. Show users the comparison. Show them why one plan is better for their specific situation. Let them validate the claim. That is a recommendation mechanism compatible with the deliberate comparison behavior the session replays showed — not a replacement for it.

How to Know When to Kill a Concept Versus When to Iterate

The "Recommended Plans" program was eighteen months long. It should have been eight weeks.

The decision error was consistently reframing null evidence as execution failure. After test one, the team concluded the badge was wrong. After test two, the team concluded the three-arm structure was wrong. After test three, the team concluded the brand context was wrong. After test four, the team had direct behavioral evidence that users were bypassing the recommendation — and still ran test five.

A useful framework for distinguishing "execution failure" from "concept failure" has three questions.

First: does the behavioral data show that users are encountering the intervention and responding to it? In the "Recommended Plans" program, session replay from test four showed users seeing the recommendation card and clicking past it. Users were encountering it. They were making an active choice to bypass it. This is not execution failure — it is behavioral evidence about the concept.

Second: have multiple implementation approaches failed in different contexts? Two flat results on brand one, tested with different implementations, are stronger evidence against the concept than one. Add two more results from brand two and a statistically significant loss on brand three, and the evidence profile is conclusive.

Third: is there a mechanism story that explains why the concept should work for this specific audience in this specific context — not borrowed from a different category? The "Recommended Plans" concept was borrowed from e-commerce without adapting the mechanism story to high-consideration purchasing behavior. When the mechanism story is borrowed rather than derived, the iteration logic tends to fix execution rather than examine whether the mechanism applies.

A testing program that answers all three questions rigorously can distinguish between a concept that needs better execution and a concept that is wrong for the context. Most programs do not ask these questions systematically, which is why they iterate on dead concepts longer than the evidence warrants.

At GrowthLayer, we log the behavioral mechanism explicitly on every test record and require teams to revisit the mechanism statement after each null or negative result. The question is not "what did we implement wrong?" — it is "what does this result tell us about whether the mechanism is operating as predicted?" That question produces faster concept kills and cleaner learning.

The Sunk Cost Trap in Testing Programs

I want to be honest about the organizational dynamics that kept "Recommended Plans" alive for multiple tests.

After the first test, a design team had built recommendation logic. After the second test, a data team had built a usage-based recommendation model. After the third test, a third brand's development team had implemented the feature. By the time test five launched, three development teams, two design teams, and a data team had invested meaningful time in a concept that the evidence had been characterizing as a failure since test two.

The sunk cost fallacy — the tendency to continue investing in a path because of the resources already committed — is well-documented in behavioral economics. It is also pervasive in experimentation programs, because each null result can be reframed as a reason to invest more in getting the execution right rather than a reason to reconsider the concept.

The antidote is a pre-specified stopping rule that is independent of the investment already made. Before the second test of any concept, define the evidence threshold at which you will declare the concept dead: two null results from adequately powered tests, a statistically significant negative result, or direct behavioral data showing the mechanism is not operating as predicted. Commit to that threshold in writing before the test launches.

When the threshold is pre-specified, the decision to continue or stop becomes a measurement question rather than organizational politics question. The answer is in the data, not in the retrospective justification of past investment.

Alternative Approaches to Plan Selection Optimization

Killing "Recommended Plans" freed up testing capacity for approaches that were actually grounded in the behavioral data we had collected across multiple tests.

The session replay finding — users prefer a full list of plans when making a selection — pointed toward a different hypothesis: if users want to compare all plans, the optimization opportunity is in improving the comparison experience, not reducing the number of options visible.

Tests that followed explored improved filtering and sorting controls, allowing users to rank plans by attribute. They explored side-by-side comparison views for users who had shortlisted two plans. They explored interactive cost calculators embedded in the plan selection page, which let users input their usage level and see projected annual costs across all plans simultaneously.

These interventions shared a common mechanism: support the deliberate comparison process users are already executing, rather than interrupting it with a pre-made recommendation. They were grounded in the same session replay data that had described the failure mode of "Recommended Plans."

Not all of them won. But the ones that won were based on a mechanism story derived from the actual behavioral data about how users in this category make decisions — not borrowed from a different category and applied by analogy.

Conclusion

Multiple tests. Multiple brands. Eighteen months. Zero wins. One statistical-significance loss at a double-digit decline.

The "Recommended Plans" concept was not badly executed — by test five, it was very well executed. It failed because the mechanism it was built on — decision fatigue reduction through curation — does not apply to high-consideration purchases where users are executing deliberate, attribute-based comparison processes.

The lesson is not that recommendation systems do not work. They work well in e-commerce, in content recommendation, in product discovery. The lesson is that mechanism transfer across categories requires examining whether the behavioral conditions that make the mechanism work are actually present in the new context.

They were not. The behavioral data told us this by test four. We ran test five anyway.

That decision is the part I would change. Not the concept — you learn by testing. But the organizational discipline to read null evidence as evidence, not as an invitation to iterate, is the capability that would have saved us eight months and two development cycles.

_Ready to build a testing program that kills dead concepts faster and accelerates learning? [GrowthLayer](https://growthlayer.app) tracks behavioral mechanisms across every test — so accumulated null evidence surfaces as a pattern before it becomes an eighteen-month sunk cost._

Avoid the same trap

Check feasibility before you launch with the free Sample Size Calculator and MDE Calculator, or browse all 12 free A/B testing calculators.

5 Tests, 3 Brands, Zero Wins: How We Killed the