Skip to main content
Free Calculator

Multiple Comparisons Calculator

Adjust p-values for multiple hypothesis tests using Bonferroni, Holm, or Benjamini-Hochberg correction to control false positives.

Typically 0.05 for a 95% confidence level

1.
2.

Enter between 2 and 20 metrics with their observed p-values.

Results

0 of 2 metrics remain significant after Holm (Step-Down) correction.

1 metric lost significance after correction.

MetricOriginal pAdjusted pSignificant?
Conversion Rate0.03000.0600
No
was significant
Revenue per User0.12000.1200
No

Before vs After Correction

Conversion Rate

Original: 0.0300Adjusted: 0.0600

Revenue per User

Original: 0.1200Adjusted: 0.1200

Blue = original confidence. Green/orange = adjusted confidence. Red line = significance threshold (0.05).

What was applied?

Holm's step-down procedure sorted p-values and applied progressively less severe corrections. The smallest p-value was multiplied by 2, the next by 1, and so on. This is more powerful than Bonferroni while providing the same FWER guarantee.

Methodology

This calculator applies three widely-used methods for adjusting p-values when performing multiple hypothesis tests simultaneously. Family-Wise Error Rate (FWER) methods — Bonferroni and Holm — control the probability of making even one false positive across all tests: Bonferroni: The simplest correction. Each p-value is multiplied by the number of tests (n). Adjusted p = min(1, p × n). While easy to understand, it is the most conservative method and can miss real effects when n is large. Holm (Step-Down): A sequential improvement over Bonferroni. P-values are sorted in ascending order. The smallest p-value is multiplied by n, the next by n-1, and so on. Monotonicity is enforced (each adjusted p-value must be at least as large as the previous). Holm is uniformly more powerful than Bonferroni while maintaining the same FWER guarantee. False Discovery Rate (FDR) method — Benjamini-Hochberg — controls the expected proportion of false positives among rejected hypotheses: Benjamini-Hochberg: P-values are sorted ascending and assigned ranks. Adjusted p_i = p_i × n / rank_i, with monotonicity enforced from the largest to smallest. This method is less conservative than FWER methods and is preferred in exploratory analyses where some false positives are acceptable as long as the overall proportion is controlled. All methods guarantee that the adjusted p-value is always between the original p-value and 1.

Frequently Asked Questions

What is the multiple testing problem?
When you test multiple metrics or comparisons simultaneously, the probability of at least one false positive increases dramatically. For example, testing 20 metrics at a 5% significance level gives you a ~64% chance of at least one false positive, even if there is no real effect. Multiple comparison corrections adjust p-values to control this inflated error rate.
When should I adjust p-values for multiple comparisons?
You should adjust p-values whenever you are testing multiple hypotheses simultaneously — for example, when measuring multiple metrics (conversion rate, revenue, engagement) in the same A/B test, when running an A/B/C/D test with multiple variant-vs-control comparisons, or when looking at results across multiple segments. If you only test one metric per experiment, no correction is needed.
What is the difference between Bonferroni, Holm, and Benjamini-Hochberg?
Bonferroni is the simplest and most conservative: it multiplies each p-value by the number of tests. Holm (step-down) is uniformly more powerful than Bonferroni while still controlling the family-wise error rate — it adjusts p-values sequentially based on rank. Benjamini-Hochberg controls the false discovery rate (FDR) instead of the family-wise error rate, making it less conservative and more appropriate when you expect some true effects among many tests.
Does multiple comparison correction apply to A/B/C tests?
Yes. In an A/B/C test (or any multi-variant test), you are implicitly making multiple comparisons: A vs B, A vs C, and possibly B vs C. Each comparison carries a risk of a false positive, so you should apply a correction. The number of comparisons depends on how many pairwise tests you run. For a test with k variants (including control), there are k-1 comparisons against control, or k*(k-1)/2 if you compare all pairs.
What if nothing is significant after correction?
This is a valid and informative result — it means none of your observed effects are strong enough to survive the correction, and you cannot confidently reject the null hypothesis for any metric. This often happens when effect sizes are small or sample sizes are insufficient. You can consider running the test longer to increase power, focusing on fewer primary metrics, or using the less conservative Benjamini-Hochberg method if controlling false discovery rate (rather than family-wise error rate) is acceptable.

Related Calculators

Updated for 2026. Built by GrowthLayer.