Skip to main content
Free Calculator

Meta-Analysis Calculator

Combine results from multiple A/B tests to estimate the true effect size with greater precision. Uses inverse-variance weighted fixed-effects meta-analysis.

Results

Pooled Effect

4.37%

weighted average lift

95% CI

[4.32, 4.43]

confidence interval

p-value

< 0.0001

Significant

Heterogeneity

I² Statistic

99.8%

High heterogeneity

Cochran's Q

875.97

Significant heterogeneity (p < 0.10)

Study Breakdown

StudyEffect (%)WeightContribution
Test 15.20%23.5%
Test 23.80%65.9%
Test 36.10%10.7%
Pooled4.37%100%

High Heterogeneity Detected

The I² value of 99.8% indicates substantial variation between studies. The pooled estimate may not accurately represent any individual context. Consider investigating the sources of variation or using a random-effects model for more conservative estimates.

Methodology

This calculator uses a fixed-effects meta-analysis with inverse-variance weighting, which is the standard approach for combining A/B test results. The core formulas are: Weight for study i: w_i = 1 / SE_i² Pooled effect: θ = Σ(w_i × θ_i) / Σ(w_i) Pooled standard error: SE(θ) = √(1 / Σ(w_i)) Where: - θ_i is the observed effect (lift %) for study i - SE_i is the standard error for study i - w_i is the inverse-variance weight The 95% confidence interval is: θ ± 1.96 × SE(θ) Standard error can be provided directly or approximated from the lift and sample size as SE ≈ |lift| / √(sampleSize). This approximation assumes a roughly normal distribution of the effect estimate and works reasonably well for conversion rate tests. Heterogeneity is assessed using Cochran's Q statistic: Q = Σ w_i × (θ_i - θ)² Q follows a chi-squared distribution with k-1 degrees of freedom under the null hypothesis of homogeneity. The I² statistic quantifies the proportion of total variation due to true heterogeneity rather than sampling error: I² = max(0, (Q - df) / Q × 100) I² interpretation: 0–25% low, 25–75% moderate, 75%+ high heterogeneity.

Frequently Asked Questions

What is meta-analysis in A/B testing?
Meta-analysis is a statistical technique for combining results from multiple independent experiments that test similar hypotheses. In A/B testing, it lets you pool results from several tests — for example, running the same CTA change across different pages or markets — to get a more precise estimate of the true effect size. This is especially useful when individual tests are underpowered.
When should I combine A/B test results?
Combine results when you have multiple tests that address the same underlying question — such as testing a similar design change across different pages, testing the same hypothesis in different markets, or re-running a previous test. Avoid combining tests that measure fundamentally different things, as this can produce misleading pooled estimates.
What is heterogeneity and why does it matter?
Heterogeneity measures how much the effect sizes vary across your studies beyond what random chance would explain. High heterogeneity (I² > 75%) suggests the true effect differs meaningfully between tests, which means a single pooled estimate may not accurately represent any individual context. In such cases, a random-effects model or investigating the sources of variation may be more appropriate.
What is the difference between fixed and random effects models?
A fixed-effects model assumes all studies share one true effect size, and differences are due to sampling error only. A random-effects model assumes the true effect varies between studies and accounts for this extra variability. This calculator uses a fixed-effects model, which is appropriate when your studies are similar in design and context. If heterogeneity is high, consider a random-effects approach.
How many studies do I need for a meta-analysis?
While technically you can combine as few as two studies, meta-analysis becomes more reliable with more studies. With only 2–3 studies, the heterogeneity tests have low statistical power and the pooled estimate is heavily influenced by each individual study. For robust conclusions, 5 or more studies is generally recommended, though even combining 2–3 well-designed tests provides a better estimate than looking at each in isolation.

Related Calculators

Updated for 2026. Built by GrowthLayer.