Your ICE Score Is Lying to You: Why Prioritization Frameworks Fail at Predicting Test Impact
The test with our highest ICE score couldn't prove ROI. The lowest-scored test tripled the primary metric. Here's why prioritization frameworks fail at predicting A/B test impact.
Editorial disclosure
This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.
# Your ICE Score Is Lying to You: Why Prioritization Frameworks Fail at Predicting Test Impact
The test with the highest prioritization score in our entire enterprise program was a promotional deployment that never had a control group.
It scored Impact 10, Confidence 10, Ease 10. A perfect score. The kind of result that makes a prioritization framework feel credible — the team clearly agreed this was important, achievable, and high-impact. So it went to the top of the queue, got built, and got shipped.
There was just one problem: because the promotional element was live before the test infrastructure was in place, there was never a valid control condition. The variant ran. The metrics moved. And we had no way of knowing whether the movement was caused by the experiment or by the promotion itself, by seasonality, by an unrelated product change, or by pure coincidence.
The highest-priority test in the program produced zero valid learnings.
Now consider the test at the other end of the spectrum. It had been sitting at the bottom of the backlog for months. Every time prioritization conversations came up, it got passed over. The idea was speculative. The team was skeptical. The confidence score was low. When it finally ran — not because anyone was enthusiastic about it, but because the queue had cleared and there was traffic to spare — it produced the single largest win in the entire program. It more than tripled the primary metric.
The correlation between prioritization score and actual test impact, across the full history of that program, was essentially zero. If anything, it was negative.
I have been turning that finding over ever since, trying to understand what it means for how testing programs should be run. What I have concluded is that ICE scores and their variants are not just imprecise — they are structurally miscalibrated. They optimize for the wrong thing. And until you understand why, you will keep building the same backlog error: a queue full of high-confidence, low-impact tests, and a graveyard of high-potential ideas that never got prioritized.
The Inversion Problem: Why High ICE Often Means Low Impact
The pattern I described — highest score, lowest valid impact; lowest score, highest actual impact — is not a coincidence specific to our program. It is a predictable consequence of what the ICE framework actually measures.
Take the "Confidence" dimension first, because it is the most systematically distorting. In ICE scoring, confidence is meant to capture how certain you are that the test will produce a positive result. Teams typically rate confidence higher when they have prior data, analogous tests from other programs, or strong intuition from user research.
Here is the problem: the ideas you are most confident in are the ideas that look most like things you have already tried. High-confidence ideas are familiar ideas. They are the incremental version of what already exists — a button color tweak when the data shows button prominence matters, a headline adjustment when you know this audience responds to urgency, a layout change when you have already tested layout in the same direction. These ideas are high-confidence precisely because they are not surprising.
But the tests that produce dramatic lifts are almost never the obvious ones. They are the tests that challenge a fundamental assumption — that customers need to see pricing before commitment, that the primary conversion action belongs at the top of the page, that the enrollment flow requires the same steps it has always required. These ideas feel uncertain because they are uncertain. The data does not tell you they will work, because no one has tried them in exactly this context before.
When you score uncertainty as a liability, you systematically deprioritize the ideas with the highest breakthrough potential.
Key Takeaway: ICE confidence scores reward familiarity, not potential. High-confidence ideas tend to be incremental because you are most confident in ideas that look like what you have already done. The framework filters out precisely the tests most likely to produce large effects.
The Exploitation-Exploration Tradeoff
There is a well-established framework in decision theory for thinking about this dynamic: the exploitation-exploration tradeoff. Exploitation means doing more of what you already know works. Exploration means investigating things you do not yet understand.
ICE scoring is an exploitation machine. It evaluates ideas by asking how sure you are they will work — which is a question about your existing knowledge base, not about the potential of the unknown. A testing program that scores every idea on confidence will systematically over-invest in exploitation and under-invest in exploration.
This matters enormously at the program level. Early in a testing program, almost everything is exploration — you are learning basic things about your audience, your funnel, your product. As the program matures, the most obvious exploitation opportunities get captured, and continued progress requires going further into uncertain territory. A prioritization framework that penalizes uncertainty will therefore produce diminishing returns over time, not because the opportunity is exhausted, but because the framework is steering the program away from the remaining opportunities.
In our enterprise program, I watched this dynamic play out across multiple brands. The brands that maintained the highest win rates over time were the ones that kept running tests that the broader team was skeptical about — the weird ideas, the big redesigns, the hypothesis that contradicted the conventional wisdom about what that audience wanted. The brands that settled into a rhythm of high-confidence incremental testing showed declining impact over time.
The testing program that explores wins more than the testing program that exploits.
The Ease Trap: Small Tests Produce Undetectable Effects
The second major distortion in standard prioritization frameworks is the Ease dimension. High ease typically means low implementation cost — a copy change, a single-element swap, a configuration flag rather than a build. These are sensible things to favor in a world of constrained development resources. Fast, cheap tests that can be run without engineering support have obvious appeal.
But ease and impact are inversely correlated in a structural way.
Consider what it means for a test to be genuinely "easy." An easy test is usually a small test. A small test changes one element in a minor way. A small change produces a small effect at best. And a small effect in a typical testing environment — where you need to detect a meaningful change in conversion rate from a realistic sample size, in a reasonable runtime — is statistically undetectable.
Underpowered tests are one of the most persistent problems in enterprise testing programs. When an idea is too small to move the needle meaningfully, no amount of careful testing infrastructure will rescue it. The test will run, it will fail to reach significance, and it will be logged as "inconclusive" — which is exactly the right conclusion, but which provides no useful information and consumes traffic that could have been allocated to a test with a detectable effect.
The ease trap works as follows: a team systematically prioritizes easy tests because they score well on the ICE framework. Those easy tests produce mostly inconclusive results, because the effects are too small to detect at realistic sample sizes. The team develops a narrative about "our audience being hard to move" or "conversion rate being near its ceiling," when the actual explanation is that they have been running tests that were too small to produce detectable signals in the first place.
The tests that are genuinely hard to run — the full-page redesigns, the multi-step flow restructurings, the new feature introductions — are the ones that change enough about the experience to produce a measurable effect, in either direction. The sign of the effect is uncertain. But the magnitude, if the hypothesis is right, is detectable.
Key Takeaway: Easy tests are almost always small tests, and small tests produce effects that are too small to detect at realistic sample sizes. Systematically prioritizing ease produces a backlog of statistically underpowered ideas and a program with a chronic inconclusive rate.
How to Fix Prioritization: Weight Uncertainty Higher
The fix is not to abandon prioritization frameworks entirely. The fix is to recalibrate what you are trying to prioritize for.
If you are trying to maximize learning and business impact over a testing program's lifetime, your prioritization framework should reward ideas where:
The potential magnitude is large. Not just "how confident are we it will work" but "if it works, how big could the effect be?" A test that might double a conversion metric deserves more resource than a test that might improve it by a few percent, even if the latter is more certain to show a positive result.
The uncertainty is genuine and productive. Uncertainty is not a flaw in a test idea — it is a signal that the idea contains real information value. If everyone is confident the test will win, you are probably running an exploitation test in a space you already understand. If the team is genuinely uncertain, you might be discovering something new.
The mechanism is specific. The role of research and prior data is not to boost confidence scores — it is to sharpen hypotheses. A test backed by user research does not necessarily score higher on confidence; it scores higher on "we understand why this should work, which means we can learn from it regardless of the result." The mechanism behind a hypothesis is separate from the certainty about the outcome.
The opportunity cost is real. Prioritization should account for what else could run in the traffic and time that a test consumes. A high-confidence incremental test that runs for six weeks and produces a 2% lift has an opportunity cost: six weeks of traffic that could have been used to test a bolder hypothesis.
In practice, I recommend adding a "Potential Magnitude" dimension to replace or supplement the confidence score. Rate ideas on how large the effect could be if the hypothesis is correct. Use a separate "Hypothesis Quality" score that evaluates whether you have a specific, falsifiable mechanism — not whether you are sure it will work.
The Role of Research: Sharpening, Not Confirming
One of the subtler problems with confidence-heavy prioritization is what it does to the role of research in the testing process. When confidence is a primary scoring dimension, teams tend to use research to justify high confidence scores — which creates a perverse incentive to conduct research that confirms what you already believe.
This is backwards. Research does not add value by making you more certain. It adds value by making your hypotheses more specific and therefore more testable.
A user interview that reveals customers are confused about a specific element of a page is not valuable because it makes you confident the test will win. It is valuable because it tells you which element to change, what to change it to, and what mechanism you expect to be driving behavior. That specificity makes the test more interpretable regardless of the outcome. If the test wins, you know why. If it loses, you know what assumption was wrong.
In our enterprise program, the tests that produced the most durable program-level learning were not the tests we were most confident about going in. They were the tests where we had developed a specific behavioral hypothesis — a named mechanism, a predicted pattern of secondary metric movement, a clear statement of what would falsify the idea. Those tests taught us something regardless of whether the variant won.
Product-Led Growth Implications: The Testing Program IS Product Development
Here is the implication that I believe most testing teams miss: when you run your prioritization framework wrong, you are not just wasting experimentation capacity. You are making your product worse than it could be.
In a product-led growth context, the testing program is the primary mechanism for product discovery. Every test that runs is a product decision. The tests you choose to prioritize determine which product directions get explored and which get ignored. A confidence-heavy ICE backlog is, in product terms, a roadmap that favors incremental feature polish over discovery of fundamentally better user experiences.
The most impactful tests in any mature program are not optimizations. They are product changes that happen to be measured. The full-page redesign that tripled a primary metric was not an "experiment" in the conventional sense — it was a product team building a fundamentally different version of the experience and measuring whether it was better. The ICE framework would have deprioritized it because the team was uncertain whether it would work.
This is exactly why I built the prioritization workflow in GrowthLayer the way I did. Rather than asking teams to score ideas on generic ICE dimensions, the platform prompts for hypothesis specificity, potential magnitude, and mechanism quality. It tracks which types of hypotheses have historically produced large effects in your specific testing environment. Over time, teams using GrowthLayer can see that their bold, uncertain ideas outperform their confident incremental ones — and calibrate their backlog accordingly.
Key Takeaway: In a product-led growth program, the testing backlog is the product roadmap. A confidence-heavy prioritization framework produces a roadmap skewed toward incremental polish and away from discovery. This is not just a testing efficiency problem — it is a product strategy problem.
Moving Beyond ICE: A Practical Framework
If you are working with a team that has been running ICE or a variant of it, here is how I would approach the transition:
Audit your historical data first. Before changing anything prospectively, go back through your test results and look at the correlation between historical scores and actual impact. If your program has been running for more than a year and you have more than a dozen completed tests with documented scores and outcomes, you should be able to see the pattern directly. I would be surprised if the correlation is positive.
Add a Potential Magnitude dimension explicitly. Ask: if this hypothesis is correct, what is the maximum plausible effect? Not the expected effect, but the ceiling. This dimension should score higher for bold ideas and lower for incremental tweaks.
Separate mechanism quality from outcome confidence. Create a score for "do we have a specific, falsifiable behavioral hypothesis" that is distinct from "are we sure this will win." A well-specified hypothesis deserves to run even if the team is uncertain about the outcome.
Reserve some capacity for high-uncertainty, high-magnitude ideas. Even if you keep a confidence dimension for most of your backlog, deliberately allocate a portion of your testing capacity — I typically suggest at least a quarter — to ideas that score high on potential magnitude and mechanism quality but low on confidence. This is your exploration portfolio. It will have a lower win rate than your exploitation portfolio. It will also produce most of your largest wins.
Track the results by category. After twelve months, pull the data on which ideas produced the largest lifts: the high-confidence incremental ones or the uncertain bold ones. Your own data will tell you whether ICE is serving your program or harming it.
Conclusion
The test with the highest ICE score in our program could not prove any ROI. The test with the lowest score more than tripled the primary metric. That is not an anomaly — it is a predictable consequence of what ICE actually measures.
Confidence is inversely correlated with breakthrough potential. Ease is inversely correlated with detectable effect size. A framework that rewards both will systematically steer a testing program away from its highest-value opportunities.
The fix is not complicated. Add potential magnitude. Separate hypothesis quality from outcome certainty. Protect capacity for exploration. And look at your own historical data — the correlation between your prioritization scores and your actual test impacts will tell you everything you need to know about whether your current framework is working.
If you are ready to move beyond ICE scoring and build a testing backlog that reflects what actually produces impact, GrowthLayer is built for exactly this. The platform tracks hypothesis quality, potential magnitude, and historical impact patterns so your prioritization improves over time rather than compounding the same framework errors.
Frequently Asked Questions
Is ICE scoring completely useless? Not entirely — it is better than no prioritization at all. The problem is specific to how confidence and ease are weighted and interpreted. Used as a rough triage tool with awareness of its blind spots, ICE can still provide structure. The issue arises when teams treat it as predictive of test impact rather than as a starting point for conversation.
What prioritization frameworks are better than ICE? PIE (Potential, Importance, Ease) has some advantages over ICE in that it does not explicitly include a confidence dimension. Opportunity scoring frameworks that focus on market importance and current satisfaction levels are useful for larger product decisions. For experimentation specifically, any framework should include a potential magnitude dimension and separate hypothesis quality from outcome certainty.
How do I convince stakeholders to run "uncertain" tests? The argument is straightforward: show them the historical correlation between confidence scores and actual impact from your own program. If the data shows that your low-confidence tests have outperformed your high-confidence tests — which it usually will — you have an evidence-based case for changing the prioritization approach.
How many tests should be in the "exploration" portfolio at any given time? I typically recommend allocating at least a quarter of testing capacity to high-uncertainty, high-potential-magnitude ideas. The exact proportion depends on program maturity — earlier-stage programs should lean more toward exploration, since there are more fundamental unknowns to resolve.
Applied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method
Atticus Li leads applied experimentation at NRG Energy (Fortune 150), where he and his team run more than 100 controlled experiments per year on customer-facing surfaces. He is the creator of the PRISM Method, a framework for high-velocity experimentation programs at large enterprises. He writes regularly about the statistical and operational details of A/B testing — the parts most CRO content skips.
Keep exploring
Browse winning A/B tests
Move from theory into real examples and outcomes.
Read deeper CRO guides
Explore related strategy pages on experimentation and optimization.
Find test ideas
Turn the article into a backlog of concrete experiments.
Back to the blog hub
Continue through related editorial content on the main domain.