Question 1

What is the multiple testing problem?

Accepted Answer

When you test multiple metrics or comparisons simultaneously, the probability of at least one false positive increases dramatically. For example, testing 20 metrics at a 5% significance level gives you a ~64% chance of at least one false positive, even if there is no real effect. Multiple comparison corrections adjust p-values to control this inflated error rate.

Question 2

When should I adjust p-values for multiple comparisons?

Accepted Answer

You should adjust p-values whenever you are testing multiple hypotheses simultaneously — for example, when measuring multiple metrics (conversion rate, revenue, engagement) in the same A/B test, when running an A/B/C/D test with multiple variant-vs-control comparisons, or when looking at results across multiple segments. If you only test one metric per experiment, no correction is needed.

Question 3

What is the difference between Bonferroni, Holm, and Benjamini-Hochberg?

Accepted Answer

Bonferroni is the simplest and most conservative: it multiplies each p-value by the number of tests. Holm (step-down) is uniformly more powerful than Bonferroni while still controlling the family-wise error rate — it adjusts p-values sequentially based on rank. Benjamini-Hochberg controls the false discovery rate (FDR) instead of the family-wise error rate, making it less conservative and more appropriate when you expect some true effects among many tests.

Question 4

Does multiple comparison correction apply to A/B/C tests?

Accepted Answer

Yes. In an A/B/C test (or any multi-variant test), you are implicitly making multiple comparisons: A vs B, A vs C, and possibly B vs C. Each comparison carries a risk of a false positive, so you should apply a correction. The number of comparisons depends on how many pairwise tests you run. For a test with k variants (including control), there are k-1 comparisons against control, or k*(k-1)/2 if you compare all pairs.

Question 5

What if nothing is significant after correction?

Accepted Answer

This is a valid and informative result — it means none of your observed effects are strong enough to survive the correction, and you cannot confidently reject the null hypothesis for any metric. This often happens when effect sizes are small or sample sizes are insufficient. You can consider running the test longer to increase power, focusing on fewer primary metrics, or using the less conservative Benjamini-Hochberg method if controlling false discovery rate (rather than family-wise error rate) is acceptable.

Metric	Original p	Adjusted p	Significant?
Conversion Rate	0.0300	0.0600	No was significant
Revenue per User	0.1200	0.1200	No

Multiple Comparisons Calculator

Results

Before vs After Correction

Methodology

Frequently Asked Questions

Related Calculators