Why Most A/B Tests Are Lying to You

Thursday. A product manager at a Series B SaaS company opens her A/B testing dashboard for the fourth time that day, a half-drunk cold brew beside her laptop. The screen reads: Variant B, +8.3% conversion lift, 96% statistical significance.

She screenshots the result. Posts it in the #product-wins Slack channel with a party emoji. The head of engineering replies with a thumbs-up and starts planning the rollout sprint.

Here’s what the dashboard didn’t show her: if she had waited three more days (the original planned test duration), that significance would have dropped to 74%. The +8.3% lift would have shrunk to +1.2%. Below the noise floor. Not real.

If you’ve ever stopped a test early because it “hit significance,” you’ve probably shipped a version of this mistake. You’re in large company. At Google and Bing, only 10% to 20% of controlled experiments generate positive results, according to Ronny Kohavi’s research published in the Harvard Business Review. At Microsoft broadly, one-third of experiments prove effective, one-third are neutral, and one-third actively hurt the metrics they intended to improve. Most ideas don’t work. The experiments that “prove” they do are often telling you what you want to hear.

If your A/B testing tool lets you peek at results daily and stop whenever the confidence bar turns green, it’s not a testing tool. It’s a random number generator with a nicer UI.

The four statistical sins below account for the majority of unreliable A/B test results. Each takes less than 15 minutes to fix. By the end of this article, you’ll have a five-item pre-test checklist and a decision framework for choosing between frequentist, Bayesian, and sequential testing that you can apply to your next experiment Monday morning.

The Peeking Problem: 26% of Your Winners Aren’t Real

Every time you check your A/B test results before the planned end date, you’re running a new statistical test. Not metaphorically. Literally.

Frequentist significance tests are designed for a single look at a pre-determined sample size. When you check results after 100 visitors, then 200, then 500, then 1,000, you’re not running one test. You’re running four. Each look gives noise another chance to masquerade as signal.

Evan Miller quantified this in his widely cited analysis “How Not to Run an A/B Test.” If you check results after every batch of new data and stop the moment you see p < 0.05, the actual false positive rate isn’t 5%.

It’s 26.1%.

One in four “winners” is pure noise.

The mechanics are straightforward. A significance test controls the false positive rate at 5% for a single analysis point. Multiple checks create multiple opportunities for random fluctuations to cross the significance threshold. As Miller puts it: “If you peek at an ongoing experiment ten times, then what you think is 1% significance is actually just 5%.”

Checking results repeatedly and stopping at significance inflates your false positive rate by more than 5x. Image by the author.

This is the most common sin in A/B testing, and the most expensive. Teams make product decisions, allocate engineering resources, and report revenue impact to leadership based on results that had a one-in-four chance of being imaginary.

The fix is simple but unpopular: calculate your required sample size before you start, and don’t look at the results until you hit it. If that discipline feels painful (and for most teams, it does), sequential testing offers a middle path. More on that in the framework below.

Check your test results after every batch of visitors, and you’ll “find” a winner 26% of the time. Even when there isn’t one.

The Power Vacuum: Small Samples, Inflated Effects

Peeking creates false winners. The second sin makes real winners look bigger than they are.

Statistical power is the probability that your test will detect a real effect when one exists. The standard target is 80%, meaning a 20% chance you’ll miss a real effect even when it’s there. To hit 80% power, you need a specific sample size, and that number depends on three things: your baseline conversion rate, the smallest effect you want to detect, and your significance threshold.

Most teams skip the power calculation. They run the test “until it’s significant” or “for two weeks,” whichever comes first. This creates a phenomenon called the winner’s curse.

Here’s how it works. In an underpowered test, the random variation in your data is large relative to the real effect. The only way a real-but-small effect reaches statistical significance in a small sample is if random noise pushes the measured effect far above its true value. So the very act of reaching significance in an underpowered test guarantees that your estimated effect is inflated.

When small samples produce significant results, the observed effect is typically inflated well above the true value.
Image by the author.

A team might celebrate a +8% conversion lift, ship the change, and then watch the actual number settle at +2% over the following quarter. The test wasn’t wrong exactly (there was a real effect), but the team based their revenue projections on an inflated number. An artifact of insufficient sample size.

An underpowered test that reaches significance doesn’t find the truth. It finds an exaggeration of the truth.

The fix: run a power analysis before every test. Set your Minimum Detectable Effect (MDE) at the smallest change that would justify the engineering and product effort to ship. Calculate the sample size needed at 80% power. Then run the test until you reach that number. No early exits.

The Multiple Comparisons Trap

The third sin scales with ambition. Your A/B test tracks conversion rate, average order value, bounce rate, time on page, and click-through rate on the call-to-action. Five metrics. Standard practice.

Here’s the problem. At a 5% significance level per metric, the probability of at least one false positive across all five isn’t 5%. It’s 22.6%.

The math: 1 − (1 − 0.05)⁵ = 0.226.

Scale that to 20 metrics (common in analytics-heavy teams) and the probability hits 64.2%. You’re more likely to find noise that looks real than to avoid it entirely.

At 20 metrics and a standard 5% threshold, you have a nearly two-in-three chance of celebrating noise.
Image by the author.

Test 20 metrics at a 5% threshold and you have a 64% chance of celebrating noise.

This is the multiple comparisons problem, and most practitioners know it exists in theory but don’t correct for it in practice. They declare one primary metric, then quietly celebrate when a secondary metric hits significance. Or they run the same test across four user segments and count a segment-level win as a real result.

Two corrections exist, and major platforms already support them. Benjamini-Hochberg controls the expected proportion of false discoveries among your significant results (less conservative, preserves more power). Holm-Bonferroni controls the probability of even one false positive (more conservative, appropriate when a single wrong call has serious consequences). Optimizely uses a tiered version of Benjamini-Hochberg. GrowthBook offers both.

The fix: declare one primary metric before the test starts. Everything else is exploratory. If you must evaluate multiple metrics formally, apply a correction. If your platform doesn’t offer one, you need a different platform.

When “Significant” Doesn’t Mean Significant

The fourth sin is the quietest and possibly the most expensive. A test can be statistically significant and practically worthless at the same time.

Statistical significance answers exactly one question: “Is this result likely due to chance?” It says nothing about whether the difference is large enough to matter. A test with 2 million visitors can detect a 0.02 percentage point lift on conversion with high confidence. That lift is real. It’s also not worth a single sprint of engineering time to ship.

The gap between “real” and “worth acting on” is where practical significance lives. Most teams never define it.

Before any test, set a practical significance threshold: the minimum effect size that justifies implementation. This should reflect the engineering cost of shipping the change, the opportunity cost of the test’s runtime, and the downstream revenue impact. If a 0.5 percentage point lift translates to $200K in annual revenue and the change takes one sprint to build, that’s your threshold. Anything below it is a “true but useless” finding.

The fix: calculate your MDE before the test starts, not just for power analysis (though it’s the same number), but as a decision gate. Even if a test reaches significance, if the measured effect falls below the MDE, you don’t ship. Write this number down. Get stakeholder agreement before launch.

The Bayesian Fix That Doesn’t Fix Anything

If you’ve read this far, a thought might be forming: “I’ll just switch to Bayesian A/B testing. It handles peeking. It gives me ‘probability of being best’ instead of confusing p-values. Problem solved.”

This is the most popular misconception in modern experimentation.

Bayesian A/B testing does solve one real problem: communication. Telling a VP “there’s a 94% probability that Variant B is better” is clearer than “we reject the null hypothesis at α = 0.05.” Business stakeholders understand the first statement intuitively. The second requires a statistics lecture.

But Bayesian testing does not solve the peeking problem.

In October 2025, Alex Molas published a detailed simulation study showing that Bayesian A/B tests with fixed posterior thresholds suffer from the same false positive inflation when you peek and stop on success. Using a 95% “probability to beat control” as a stopping rule, checked after every 100 observations, produced false positive rates of 80%. Not 5%. Not 26%. Eighty percent.

David Robinson at Variance Explained reached a parallel conclusion: a fixed posterior threshold used as a stopping rule does not control error rates in the way most practitioners assume. The posterior remains interpretable at any sample size. But interpretability is not the same as error control.

None of this means Bayesian methods are useless. For low-stakes directional decisions (picking a blog headline, choosing an email subject line) where Type I error control isn’t critical, the intuitive probability framework is genuinely better. For high-stakes product decisions where you need reliable error guarantees, “just go Bayesian” is not an answer. It’s a costume change on the same problem.

Switching from frequentist to Bayesian doesn’t cure peeking. It just changes the number you’re misinterpreting.

The real solution isn’t a switch in methodology. It’s a pre-test protocol that forces statistical discipline regardless of which framework you choose.

The Pre-Test Protocol

This is the section the rest of the article was building toward. Everything above established why you need it. Everything below shows what changes once you have it.

The 5-Point Pre-Test Checklist

Run through these five items before pressing “Start” on any A/B test. Each one is pass/fail. If any item fails, fix it before launching.

Sample size calculated. Set your MDE (the smallest effect worth shipping). Calculate the required sample size at 80% power and 5% significance using Evan Miller’s free calculator or your platform’s built-in tool. Example: Baseline conversion 3.2%, MDE 0.5 percentage points → ~25,000 per variant.
Runtime fixed and documented. Divide required sample size by daily eligible traffic. Round up. Add buffer for weekday/weekend variation (minimum 7 full days, even if sample size is reached sooner). Write down the end date. Example: 8,300 eligible visitors/day, 50,000 total needed → 6 days minimum, rounded to 14 days to capture weekly cycles.
One primary metric declared. Write it down before the test starts. Secondary metrics are exploratory only. If you must evaluate multiple metrics formally, apply Benjamini-Hochberg or Holm-Bonferroni correction. Example: “Primary: checkout conversion rate. Secondary (exploratory): average order value, cart abandonment rate.”
Practical significance threshold set. Define the minimum effect that justifies implementation. Agree on this with engineering and product stakeholders before launch. If the test reaches statistical significance but falls below this threshold, you don’t ship. Example: “Minimum +0.5 percentage points on conversion (worth ~$200K annually, justifies a 2-week sprint).”
Analysis method chosen. Pick one: Frequentist, Bayesian, or Sequential. Document why. Use the decision matrix below. Example: “Sequential testing. Two planned analyses at day 7 and day 14. Alpha spending via O’Brien-Fleming bounds.”

Worked Example: Checkout Flow Test

A mid-market e-commerce team (500K monthly visitors) wants to test a new single-page checkout against their current multi-step flow. Here’s how they run the checklist:

1. MDE: 0.5 percentage points (from 3.2% baseline to 3.7%). At 500K monthly visitors with a $65 average order value, a 0.5pp lift generates roughly $195K in incremental annual revenue. The new checkout costs about 2 weeks of engineering time (~$15K loaded). The ROI clears the bar.

2. Sample size: At 80% power and 5% significance, this requires ~25,000 per variant. 50,000 total.

3. Runtime: 250K monthly visitors reach checkout. That’s ~8,300/day. 50,000 total ÷ 8,300/day = 6 days. Rounded to 14 days to capture weekday/weekend effects.

4. Primary metric: Checkout conversion rate. Average order value and cart abandonment tracked as exploratory (no correction needed since they won’t drive the ship/no-ship decision).

5. Method: Sequential testing. High traffic, and stakeholders want weekly progress updates. Two pre-planned analyses: day 7 and day 14. Alpha spending via O’Brien-Fleming bounds.

Result: At day 7, the observed lift is +0.3 percentage points. The sequential boundary isn’t crossed. Continue. At day 14, the lift is +0.6 percentage points. Boundary crossed. Ship it.

Without the protocol: The PM checks daily, sees +1.1 percentage points on day 3 with 93% “significance,” and declares a winner. She ships based on a number that’s nearly double the truth. Revenue projections overshoot by 83%. The actual lift settles at +0.6 points over the next quarter. Leadership loses trust in the experimentation program.

The best A/B test is the one where you wrote down “what would change our mind?” before pressing Start.

What Rigorous Testing Actually Buys You

At Microsoft Bing, an engineer picked up a low-priority idea that had been shelved for months: a small change to how ad headlines displayed in search results. The change seemed too minor to prioritize. Someone ran an A/B test.

The result was a 12% increase in revenue per search, worth over $100 million annually in the U.S. alone. It became the single most valuable change Bing ever shipped.

This story, documented by Ronny Kohavi in the Harvard Business Review, carries two lessons. First, intuition about what matters is wrong most of the time. At Google and Bing, 80% to 90% of experiments show no positive effect. As Kohavi puts it: “Any figure that looks interesting or different is usually wrong.” You need rigorous testing precisely because your instincts aren’t good enough.

Second, rigorous testing compounds. Bing’s experimentation program identified dozens of revenue-improving changes per month, collectively boosting revenue per search by 10% to 25% each year. This accumulation was a major factor in Bing growing its U.S. search share from 8% in 2009 to 23%.

The 15 minutes you spend on a pre-test checklist isn’t overhead. It’s the difference between an experimentation program that compounds real gains and one that ships noise, erodes stakeholder trust, and makes A/B testing look like theater.

That product manager from 3 PM Thursday? She’s going to run another test next week. So are you.

The dashboard will still show a confidence percentage. It will still turn green when it crosses a threshold. The UI is designed to make calling a winner feel satisfying and definitive.

But now you know what the dashboard doesn’t show. The 26.1%. The winner’s curse. The 64% false alarm rate. The Bayesian mirage.

Your next test starts soon. The checklist takes 15 minutes. The decision matrix takes five. That’s 20 minutes between shipping signal and shipping noise.

Which one will it be?

References

Evan Miller, “How Not To Run an A/B Test”
Alex Molas, “Bayesian A/B Testing Is Not Immune to Peeking” (October 2025)
David Robinson, “Is Bayesian A/B Testing Immune to Peeking? Not Exactly”, Variance Explained
Ron Kohavi, Stefan Thomke, “The Surprising Power of Online Experiments”, Harvard Business Review (September 2017)
Optimizely, “False Discovery Rate Control”, Support Documentation
GrowthBook, “Multiple Testing Corrections”, Documentation
Analytics-Toolkit, “Underpowered A/B Tests: Confusions, Myths, and Reality” (2020)
Statsig, “Effect Size: Practical vs Statistical Significance”
Statsig, “Sequential Testing: How to Peek at A/B Test Results Without Ruining Validity”

Source link

Sign Up to Our Newsletter

Top Categories

Uncategorized

Tech News

Tech

Software development

Popular Tech News

Top Categories

Uncategorized

Tech News

Tech

Software development

Popular Tech News

The Peeking Problem: 26% of Your Winners Aren’t Real

The Power Vacuum: Small Samples, Inflated Effects

The Multiple Comparisons Trap

When “Significant” Doesn’t Mean Significant

The Bayesian Fix That Doesn’t Fix Anything

The Pre-Test Protocol

The 5-Point Pre-Test Checklist

Worked Example: Checkout Flow Test

What Rigorous Testing Actually Buys You

References

Looking Glass’ Musubi showcases its holographic display in a consumer-friendly package

The rise of the silver collar workforce

Team TeachToday

About Author

You may also like

Our Company

Categories

Get Latest Updates and big deals

Our expertise, as well as our passion for web design, sets us apart from other agencies.