By Atticus Li -- Applied Experimentation Lead at NRG Energy (Fortune 150). Creator of the PRISM Method. Learn more at atticusli.com

Every experimentation team I have ever worked with has the same problem: too many test ideas and not enough capacity to run them all. The backlog grows. Stakeholders lobby for their favorites. The loudest voice wins.

This is how experimentation programs die -- not from lack of ideas, but from running the wrong ones.

At NRG Energy, we run 100+ experiments per year. That sounds like a lot, but we say no to three ideas for every one we run. The system that makes this work is revenue ranking: a prioritization framework that combines traditional ICE scoring with revenue-per-customer projections and pre-test MDE calculations.

Here is how to build one for your team.

Why ICE Scoring Alone Is Not Enough

Most experimentation teams start with ICE: Impact, Confidence, Ease. Score each on a 1-10 scale, multiply, rank. The highest scores go first.

The problem is that ICE is subjective. Impact of 8 according to whom? Confidence of 7 based on what data? Ease of 6 by whose estimation?

I have watched the same test idea get scored anywhere from 120 to 480 depending on who was in the room. That is not a prioritization framework. That is a negotiation.

ICE is a starting point, not a system. Revenue ranking adds the rigor that makes it useful.

The Revenue Ranking Framework

Revenue ranking adds three quantitative layers on top of ICE:

Layer 1: Revenue-Per-Customer Projection

For every test idea, estimate the annual revenue impact per customer if the test wins. Not the total revenue impact -- the per-customer impact. This forces specificity.

Example: "Redesign the plan comparison page" is too vague. "Increase plan comparison page conversion rate from 3.2% to 3.8%, which at our average customer lifetime value of $2,400, would add approximately $X per acquired customer per year." Now you have a number you can compare across test ideas.

The formula:

Revenue Impact = (Current Traffic) x (Expected Lift) x (Revenue Per Conversion) x (Time Period)

You do not need precision. You need relative accuracy. Is this a $10K/year opportunity or a $500K/year opportunity? That distinction matters for ranking.

Layer 2: Pre-Test MDE Calculation

Before ranking a test, calculate whether you can actually detect the expected effect. This is where most teams waste the most capacity.

Minimum Detectable Effect (MDE) is the smallest improvement your test can reliably detect given your traffic volume and test duration. If your test idea expects a 2% lift but your MDE at 80% power is 5%, you will never detect the effect even if it exists. The test will come back inconclusive, and you will have wasted the slot.

At NRG, we calculate MDE for every test idea before it enters the ranked backlog:

Required per-variant sample = 16 x (variance / MDE^2)

If the required sample size exceeds what we can get in a reasonable test duration (typically 4-6 weeks), the test either needs to be redesigned (larger expected effect, broader audience) or deprioritized. Running a test you cannot power is worse than not running it.

Layer 3: Opportunity Cost Scoring

Every test that runs displaces another test that could have run. We explicitly calculate this.

For each test, divide the projected revenue impact by the estimated test duration in weeks. This gives you revenue impact per week of testing capacity consumed. A test with $200K projected annual impact that takes 6 weeks scores differently than a test with $150K impact that takes 2 weeks.

Opportunity Score = Revenue Impact / Test Duration (weeks)

This naturally deprioritizes slow, low-impact tests and surfaces quick wins with disproportionate payoff.

How to Say No to Stakeholder Pet Projects

This is the real reason you need a quantitative framework. Without one, saying no to the SVP who wants to test their favorite button color is a political battle. With revenue ranking, it is a math problem.

The conversation changes from "we do not think your idea is good" to "your idea scored 340 on revenue ranking, and the current cutoff for Q2 is 520. Here is what would need to change to get it above the line."

Three tactics that work:

Show the tradeoff explicitly. "If we run Test A (your idea), we cannot run Tests B and C, which have a combined projected impact 4x larger. Should we make that trade?" Most stakeholders will back down when they see the numbers.

Offer a lower-cost alternative. "We cannot run a full test on this, but we can add it as a secondary metric on Test D, which is already planned. That way we get directional data without consuming a testing slot." This gives the stakeholder something without derailing the roadmap.

Create a parking lot with criteria. "This test is parked until we have enough traffic on that page to detect the expected effect. When monthly unique visitors hit X, it automatically enters the queue." This is not a no -- it is a not yet, with clear criteria for when it becomes a yes.

The key principle: never make it personal. The framework decides. You are just the messenger.

Tests That Matter to the Bottom Line vs Tests That Matter to One Team

At NRG, we categorize every test idea into one of three tiers:

Tier 1: Revenue-impacting. These tests directly affect acquisition, conversion, retention, or revenue. They get priority and the best traffic. Examples: checkout flow optimization, plan selection page tests, pricing presentation experiments.

Tier 2: Experience-impacting. These tests affect user experience metrics (NPS, satisfaction, task completion) without a direct revenue line. They run when Tier 1 tests are not consuming all capacity. Examples: navigation redesign, content readability improvements, help center optimization.

Tier 3: Internal-stakeholder-driven. These are tests that a specific team wants to run, often to validate a decision they have already made. They run only when there is spare capacity AND they meet minimum MDE requirements. Examples: brand color changes, copy tone adjustments, layout preferences.

This tiering is transparent. Every stakeholder knows it exists and knows where their test idea falls. The transparency reduces friction because the rules apply equally to everyone.

A Template You Can Use Right Now

Here is the scoring template we use at NRG, adapted for general use:

Criteria

Weight

Score Range

Your Score

Projected annual revenue impact

30%

$0-$1M+

Traffic available / MDE feasibility

25%

Can detect / Cannot detect

Opportunity score (impact/duration)

20%

Low / Med / High

Strategic alignment

15%

1-10

ICE composite (traditional)

10%

1-1000

Scoring rules:

Any test that cannot meet MDE requirements is automatically deprioritized regardless of other scores
Revenue impact uses conservative estimates (50th percentile, not optimistic case)
Opportunity score penalizes tests longer than 6 weeks
Strategic alignment is the one subjective measure -- used for tiebreakers, not primary ranking

Process:

All test ideas submitted with a brief (hypothesis, expected metric impact, target audience)
Experimentation team adds MDE calculation and revenue projection
Tests scored and ranked quarterly
Top-ranked tests enter the active queue
Quarterly review: compare projected vs actual impact to calibrate future scoring

This is the Revenue Rank step in Atticus Li's PRISM Method. It ensures that every experiment slot is spent on the highest-value opportunity available.

Common Mistakes in Roadmap Building

Ranking once and forgetting. Business conditions change. A test idea that was low-priority in Q1 might become urgent in Q2 because of a competitive move or a traffic pattern shift. Re-rank at least quarterly.

Ignoring maintenance tests. Not every test is a new idea. Some slots need to go to re-testing previous winners (have they held up?), running AA tests for calibration, and validating measurement changes. Budget 15-20% of capacity for maintenance.

Optimizing for quantity over quality. Running 100 bad tests is worse than running 30 great ones. Revenue ranking naturally prevents this, but only if you are honest about the revenue projections.

Not tracking calibration over time. After each test, record the projected impact vs the actual impact. Over quarters, this calibration data tells you whether your team systematically overestimates or underestimates. Adjust your scoring accordingly.

The Bottom Line

An experimentation roadmap is a resource allocation problem. Revenue ranking turns it from a political negotiation into an evidence-based process. You will still have hard conversations with stakeholders, but those conversations will be grounded in math instead of opinions.

Build the framework. Use it consistently. Calibrate it over time. Your win rate -- and your sanity -- will improve.

Atticus Li leads enterprise experimentation at NRG Energy with a 24%+ win rate across 150+ total experiments. The Revenue Ranking framework is part of the PRISM Method. Learn more at atticusli.com.

How to Build an Experimentation Roadmap Using Revenue Ranking