False Starts Are Killing Your Testing Program: The Pre-Launch QC Checklist That Prevents 90% of Them
7 of enterprises had false starts — bugs and errors caught post-launch. Each cost 1-4 weeks. Here's the pre-launch QC checklist that prevents 90% of them.
Editorial disclosure
This article lives on the canonical GrowthLayer blog path for indexing consistency. Review rules, sourcing rules, and update rules are documented in our editorial policy and methodology.
A false start is not a failed test. A failed test gives you information. A false start gives you nothing — except wasted time and a dataset you cannot trust.
In a enterprise audit I ran across a multi-brand enterprise testing program, several tests had false starts: bugs, targeting errors, analytics misconfigurations, or naming conflicts discovered after the test was already live. The discovery window ranged from a few hours to five days. The cost ranged from one week of lost runtime to four weeks, plus data that had to be discarded entirely.
Every single one of those false starts was preventable with a structured pre-launch QC process. I know that because one developer on the team had already built exactly that process — a comprehensive checklist covering functional testing, cross-browser validation, analytics, performance, accessibility, and sign-off. Tests that went through his checklist had zero false starts.
This article breaks down each false start, what caused it, what it cost, and the checklist that would have caught it before any traffic saw the test.
The 7 False Starts: What Happened and What Each One Cost
False Start 1: The Form Bug That Stopped Users Mid-Flow
A variant was launched to test a redesigned multi-step enrollment form. The design team had made a change to the logic governing field visibility on the second step — but the change introduced a bug that prevented the form from advancing past step one under certain input conditions.
The bug was not caught in the developer's own browser because it only triggered with specific input combinations that were common in production but not in the test developer's manual walkthrough. It was discovered by a quality analyst three days after launch, when she noticed completion rates in the variant were near zero.
Three days of data had to be discarded. The fix required a new deployment, adding another two days before the test could restart. Total impact: roughly five days of lost runtime, a reset of the traffic split, and a delayed read date by more than three weeks.
This is the most preventable class of false start. A structured functional testing protocol — including test scenarios with realistic input combinations across different user paths — would have caught it during QA.
False Start 2: The Page Name That Belonged to Two Pages
An analytics implementation for a test used a page name identifier that was already in use on a different page in the same property. Both pages were named identically in the analytics configuration, which meant the reporting dashboard was combining traffic and events from both pages into a single stream.
The test results looked implausible from day one — sample sizes were nearly double what was expected, and the event rates were not consistent with what session recordings showed. The root cause took two days to diagnose: every analytics event from the unrelated page was being counted in the test's dataset.
All data collected up to that point was unusable. The analytics implementation had to be corrected, the test restarted from zero, and the team lost over three weeks on a test that had already consumed significant development time. The correct fix — a simple naming audit before launch — takes under five minutes.
False Start 3: Targeting That Was Too Broad
A test designed for users in a specific enrollment stage was accidentally launched with targeting that included users from every stage of the funnel, including post-enrollment confirmation pages and account management pages.
The targeting error was not immediately obvious because early traffic numbers looked plausible. It was only when a data analyst segmented the results by funnel stage — five days into the test — that she found users from post-enrollment stages making up 30% of the sample. Those users had fundamentally different behavioral patterns, and their inclusion was distorting both the control and variant results in ways that could not be corrected retroactively.
The first five days of data were discarded. The corrected test ran for three additional weeks. The team lost the first five days and had to reestablish sample size from scratch.
False Start 4: The Third-Party Dependency That Loaded After the Test Element
A variant was designed around a redesigned component that relied on a third-party chat widget for context — the variant moved the widget's launch trigger to a different point in the flow. What the team did not account for was that the third-party widget had an asynchronous load sequence that, under high-traffic conditions, caused it to load after the test's variant JavaScript had already executed.
On fast connections and desktop, the test worked correctly. On mobile and slower connections — which represented over 60% of the traffic — the widget loaded after the test's modified trigger point, producing an experience that was neither the control nor a coherent variant. Users on slower connections saw an inconsistent intermediate state.
The test ran for a full week before the issue was surfaced by a mobile-specific analysis. A performance testing protocol — specifically, testing on throttled connection speeds as part of pre-launch QC — would have caught this within the first hour.
False Start 5: The Analytics Event That Fired in Both Variants
A primary conversion metric was an analytics event that fired when a user completed a specific form submission. The event implementation was correct for the variant. However, a duplicate event tag from a previous test had not been cleaned up, and it was also firing on the control — resulting in duplicate event counts on the control side only.
The control's conversion rate appeared artificially inflated, making the variant look like it was performing significantly worse than it actually was. When the duplicate tag was discovered and removed, the apparent performance gap closed substantially.
Seven days of data were already contaminated. The test had to be restarted. In a more consequential case, this kind of contamination could lead a team to kill a genuinely winning variant based on the inflated control numbers.
False Start 6: The A/A Test That Was Actually an A/B Test
An A/A test — designed to validate that the testing platform was randomizing correctly and that both samples were seeing identical experiences — showed a statistically significant difference between the control and the "identical" variant.
Investigation revealed that a configuration change had accidentally introduced a minor style difference in one element. The A/A test was actually an A/B test. The difference was subtle enough that the developer who set up the test had not noticed it during visual review, but the analytics were correctly detecting a real difference in user behavior driven by that element.
The A/A test had to be corrected and rerun, delaying the actual experiment it was designed to validate by two weeks. The lesson: A/A tests require the same visual validation and QC that any A/B test requires.
False Start 7: The Variant That Launched Without Sign-Off
This one is organizational rather than technical. A test launched by a team operating outside the core testing program — a brand team that had been given access to the testing platform — went live without going through the sign-off process. The test had not been reviewed by the analytics function, and the conversion event it was measuring was misconfigured: it was firing on page load rather than on actual form submission.
The test ran for a week. Results looked promising but were measuring page loads, not conversions. The error was only discovered when someone tried to reconcile the test's reported conversion rate against backend transaction data and found a 400% discrepancy.
This is the organizational pattern that produced 3x more false starts: tests launched by teams operating outside the structured QC process had dramatically higher rates of issues than tests that went through the checklist.
Key Takeaway: False starts are not random bad luck. They are predictable failures with identifiable root causes. Functional bugs, targeting errors, analytics misconfigurations, third-party dependency issues, and duplicate tags are all catchable before a single user sees the test — if you have a structured process to look for them.
The Taxonomy of False Starts
After categorizing all seven, four root cause categories emerged:
Code bugs (2 of 7): Logic errors in the variant implementation that produce broken or inconsistent user experiences. These are the most obviously preventable with structured functional testing across realistic user paths.
Targeting errors (1 of 7): Incorrect audience definitions that include users outside the intended scope, or exclude users who should be included. These require explicit validation of targeting rules against a sample of actual user segments before launch.
Analytics misconfigurations (3 of 7): Incorrect event implementations, duplicate tags, page naming conflicts, or metric definitions that do not match what they claim to measure. These require an analytics pre-check — verifying that the dashboard shows the right data before the test goes live.
Third-party dependencies (1 of 7): External scripts, widgets, or integrations that behave differently under production conditions (load order, connection speed, ad blockers) than in a controlled test environment.
The Real Cost: Not Just Lost Time
The obvious cost of a false start is the lost runtime. If your test needed four weeks to reach significance and you lose ten days to a false start, your read date moves out by a month. That has a compounding opportunity cost if the test is in a backlog queue.
But the less obvious cost is data contamination. In three of the seven false starts above, data had to be discarded entirely. Contaminated data is worse than no data — it can lead to incorrect conclusions about which direction to move. The analytics misconfiguration that inflated control rates (False Start 5) is a clear example: if that test had been read at the wrong moment, a winning variant might have been killed.
And there is an organizational cost that does not show up in any metric: trust. When a testing program produces contaminated tests, false positives, and restarted experiments, stakeholders lose confidence in the program's results. I have seen testing programs defunded not because they produced bad results, but because they appeared to produce unreliable results — when the underlying issue was process, not insight quality.
Key Takeaway: The real cost of false starts is not the lost days — it is the contaminated data and eroded stakeholder trust that follows when results cannot be reconciled against reality. A QC process protects the credibility of the entire program.
The Pre-Launch QC Checklist That Catches 90% of False Starts
This checklist was built by one developer on the team based on his own experience with false starts on earlier tests. Tests that went through this checklist had zero false starts across the remainder of the program.
I have organized it into six categories.
Category 1: Functional Testing
- [ ] Walk through every user path in the variant using realistic input data — not just the happy path - [ ] Test all conditional logic: what happens when required fields are left blank, when input formats are invalid, when optional fields are skipped - [ ] Verify that every CTA, button, and interactive element behaves correctly in the variant - [ ] Confirm that the variant does not produce any JavaScript console errors under normal usage - [ ] Test the variant with common browser extensions enabled (particularly ad blockers and privacy tools) that may affect script execution - [ ] Test with third-party scripts disabled to confirm the variant degrades gracefully
Category 2: Cross-Browser and Cross-Device Testing
- [ ] Test in the three highest-traffic browsers for your audience (check your analytics for the split before choosing) - [ ] Test on both iOS Safari and Android Chrome at a minimum - [ ] Test on a throttled connection speed (simulate 3G or Slow 4G) to surface load-order dependency issues - [ ] Verify that the variant layout does not break at common viewport widths (360px, 768px, 1024px, 1440px) - [ ] Confirm that interactive elements are usable on touch devices (tap targets are large enough, hover states have touch equivalents)
Category 3: Analytics Validation
- [ ] Verify that the primary conversion metric fires exactly once per qualifying conversion — no more, no fewer - [ ] Confirm there are no duplicate event tags from previous tests still active on the same page - [ ] Validate that the page name or identifier used in analytics is unique to this page — not shared with any other page in the property - [ ] Run a brief QA session in the variant and verify that your analytics dashboard reflects the events you triggered within five minutes - [ ] Check that the targeting definition, as configured in the testing platform, maps correctly to the intended audience — pull a sample of qualifying sessions and verify at least ten against the targeting rules - [ ] Confirm that the primary metric definition (what counts as a conversion) is consistent with how the business team will read the result
Category 4: Performance
- [ ] Run a performance audit on both control and variant — the variant's additional script load should not degrade page speed by more than acceptable threshold (typically 10-15% on Largest Contentful Paint) - [ ] Confirm that the variant script loads before any dependent third-party elements it depends on, or that it handles asynchronous load gracefully - [ ] Verify that the variant does not cause layout shifts (Cumulative Layout Shift) that would degrade the user experience or affect how users interact with key elements
Category 5: Accessibility
- [ ] Verify that the variant does not remove or obscure any ARIA labels that screen readers depend on - [ ] Confirm that any new interactive elements in the variant are keyboard-navigable - [ ] Check that color contrast ratios in new visual elements meet WCAG AA standards - [ ] Verify that form labels are still correctly associated with their inputs in the variant
Category 6: Sign-Off
- [ ] Developer sign-off: functional testing complete, no known bugs, code reviewed by a second developer - [ ] Analytics sign-off: event implementation verified, dashboard showing correct data, no duplicate tags - [ ] Test owner sign-off: hypothesis, success metric, and minimum detectable effect are documented - [ ] Stakeholder sign-off: any brand, legal, or compliance requirements for the variant copy or design have been reviewed
Key Takeaway: The checklist is not bureaucracy — it is structured memory. Without it, QC quality depends entirely on who happens to be working on a given test and what they happen to remember to check. The checklist makes the QC process consistent, regardless of who runs the test.
The Organizational Pattern: QC Is a Team Sport
One of the clearest patterns in the seven false starts was organizational: tests launched by teams operating outside the core testing process had a failure rate approximately three times higher than tests that went through the structured QC workflow.
This is not a statement about the competence of those teams. It is a statement about process. The core testing team had built up institutional knowledge — hard-won through exactly the kinds of false starts documented here — that lived inside the checklist. Teams without that history did not know what to look for, not because they were careless, but because they had not yet run the tests that taught the lessons.
The practical implication: QC gates need to be structural, not optional. If teams can launch tests without going through the QC process, some of them will — especially under deadline pressure. The access controls on your testing platform, and the sign-off requirements in your process, need to make skipping QC more difficult than doing it.
The testing program I audited moved to a required sign-off model — no test could be activated in the platform without an analytics team member confirming the event implementation. This single change would have prevented three of the seven false starts.
The Analytics Pre-Check: Validate Before You Launch
This deserves its own section because it is both the most impactful single step and the most commonly skipped.
Before any test goes live, someone should spend fifteen minutes in the platform doing the following:
1. Open the test's dashboard in a private or incognito window 2. Go through the variant experience as a real user would — including triggering the primary conversion event 3. Verify within five minutes that the analytics dashboard shows the event you just fired, in the correct count, attributed to the correct variant 4. Repeat the same validation for the control
This process catches analytics misconfigurations, duplicate tags, naming conflicts, and incorrect metric definitions — which together accounted for three of the seven false starts in this dataset — in about fifteen minutes.
The question this check answers is: "Does my dashboard show what actually happened, or is it showing me something else?" Without this check, you will not know the answer until you try to reconcile your test results against backend data days or weeks later.
I track pre-launch analytics validation as a required checklist item in [GrowthLayer](https://growthlayer.app), specifically because it is the step most likely to be skipped under time pressure and most likely to produce contaminated data when it is.
How to Implement QC Gates Without Slowing Velocity
The objection I hear most often is velocity: "If we require all of this before every launch, tests will take longer to ship."
The counter is empirical. The seven false starts in this dataset consumed a combined total of roughly 18 weeks of lost test runtime, plus the developer hours required to diagnose, fix, and restart each test. The checklist, applied before launch, adds an estimated 2-4 hours per test.
Across enterprises, a comprehensive pre-launch QC process would add approximately 84-168 hours of effort. Avoiding those 18 weeks of contaminated or lost runtime would recover many times that in analyst time, developer time, and delayed program value.
Velocity is not how many tests you can launch. It is how many valid results you can generate per quarter. False starts actively destroy velocity by consuming program capacity without producing usable insights.
The practical implementation:
- Build the checklist into your test ticket template so it is not a separate step — it is part of the definition of "done" for every test - Make analytics sign-off a blocking requirement in your testing platform's workflow, not a nice-to-have - Assign a specific owner for each checklist category (developer for functional and performance, analyst for analytics, UX for accessibility) so responsibility is clear - Track false start rate as a program health metric — make it visible so teams have incentive to maintain QC discipline
[GrowthLayer](https://growthlayer.app) was built in part to address this problem at the program level: a structured pipeline where test templates include pre-launch validation steps, and test status tracks which QC stages have been signed off before activation.
Conclusion
Seven false starts across enterprises is a significant contamination rate. That means roughly one in six tests in an unstructured program will produce either no result or a misleading one before the team even knows there is a problem. Across a year of work, that is months of program capacity consumed by preventable failure.
The checklist in this article will not catch every possible issue. But based on the taxonomy of what actually goes wrong in real testing programs — code bugs, targeting errors, analytics misconfigurations, third-party dependencies — it catches the patterns that account for the overwhelming majority of false starts.
Build it into your process before the next test launches. It is the highest-leverage investment a testing program can make.
Ready to run a tighter testing program with built-in QC gates and pre-launch validation? [GrowthLayer](https://growthlayer.app) tracks your test pipeline, flags pre-launch checklist gaps, and gives your team a shared system for running cleaner experiments.
_Atticus Li is a CRO Strategist and the Founder of [GrowthLayer](https://growthlayer.app), a platform for managing and improving enterprise experimentation programs._
Applied Experimentation Lead at NRG Energy (Fortune 150) · Creator of the PRISM Method
Atticus Li leads applied experimentation at NRG Energy (Fortune 150), where he and his team run more than 100 controlled experiments per year on customer-facing surfaces. He is the creator of the PRISM Method, a framework for high-velocity experimentation programs at large enterprises. He writes regularly about the statistical and operational details of A/B testing — the parts most CRO content skips.
Keep exploring
Browse winning A/B tests
Move from theory into real examples and outcomes.
Read deeper CRO guides
Explore related strategy pages on experimentation and optimization.
Find test ideas
Turn the article into a backlog of concrete experiments.
Back to the blog hub
Continue through related editorial content on the main domain.