A/B Tests / Experiments

Running controlled experiments on component or pattern changes so the system evolves on evidence — measured impact on real user behavior — rather than on opinion.

Why it matters

Changes at the system layer hit every product, so a “small” tweak to the primary button or a default forms layout moves metrics at enormous scale — for good or ill. An experiment lets you ship that change to a slice of traffic, measure the effect against a holdout, and roll forward only if it wins. It also resolves the recurring stakeholder standoff (“our team needs a bigger CTA”) with data instead of seniority, which is critical for governance credibility.

How it works

An A/B test splits users into a control (current) and one or more variants, then compares a pre-declared metric:

  • One change, one hypothesis — “rounding card corners to 12px lifts card click-through” — stated before you look, so you can’t rationalize noise after.
  • Randomize per user, hold it sticky — bucket by a stable user/session id so a person sees one variant for the whole test; flipping mid-session poisons the result.
  • Power the sample — compute sample size up front from baseline rate, minimum detectable effect, and α=0.05 / power=0.8. Underpowered tests “fail” real wins.
  • Decide on the primary metric — declare it in advance; watch guardrails (errors, latency, bounce) so a CTR win that tanks task completion is caught.
TermMeaningTrap if ignored
Controlunchanged baselineno clean comparison
MDEsmallest effect worth detectingwrong sample size
p-valuechance result is noisepeeking inflates it
Guardrailmetric that must not regresslocal win, global loss

System-level rollout uses the same feature-flag plumbing as a pilot, just split rather than staged.

Example

A team proposes a higher-contrast primary button. Ship variant B to 50% of users, target = checkout-start rate, MDE = 1%, which needs ~45,000 users/arm for significance. After two weeks: B = 8.4% vs control 8.0%, p = 0.02, and the error-rate guardrail is flat. Promote B to the default token; the win is now the system standard, recorded for governance. Pair the readout with component-analytics to see which surfaces drove it.

Pitfalls

  • Peeking and stopping early — calling significance the first day it appears massively inflates false positives; wait for the planned sample.
  • No declared primary metric — testing 12 metrics guarantees one looks “significant” by chance (multiple comparisons).
  • Sticky bucketing skipped — re-randomizing per pageview mixes variants per user and erases the signal.
  • Ignoring guardrails — a flashier component that lifts clicks but raises rage-clicks or errors is a net loss.

See also