This is a very practical course on A/B testing by Udacity & Google and I have recommended this to many folks interested in A/B testing. Here are my original notes taken from this course. Course has 5 sections:
What can be tested — Both user facing (eg. red or green button) and non user facing changes (eg. system changes)
Short vs. Long term AB test — some products need more time to measure sucsess
Alternatives — Retrospective analysis (perspective analysis) for hypothesis testing and other qualitative analysis can be complementary
Impact — Every 100ms reduce in speed impacts 1% revenue — Amazon
Analogy — AB testing is useful to climb to the peak of your current mountain. Not really useful to choose this mountain or another mountain -John Lily, ex-CEO of Mozilla
Offline vs. Online — Online has more samples but offline (clinical trials) know more about your sample
Rate vs. probability — CTR shows if people find that button. However needs de-duping as there could be could be double-clicks. However pass-thru probability is already de-duped
Binomial vs. Normal — Binary outcomes are represented as binomial distribution.Binomial formula for Standard Error
Confidence interval — If you theoretically repeat the experiment, the interval around the 95% confidence will cover the change 95% of the times
Statistical Significance — When the difference is not by chance (i.e) rejects null hypothesis
H0: P(C)=P(T) (I.e.,) P(C)-P(T)=0
HA: P(C)!=P(T) (i.e.,) P(C)-P(T)!=0
if alpha <0.05, then reject null hypothesis, where alpha is P(P(C)=P(T)|H0)
Comparing 2 samples -
Ppool=(PT+PC)/(#T+#C) where # is number of samples
H0: d=0 I.e destimate~N(0,SEpool) where N is normal distribution
If destimate>1.96 * SEpool (or) destimate>1.96 * SEpool then reject H0
Practical (Substantive) Significance — What % of change matters to business. Eg. Making a change <2% might not be financially effective as it requires training customer support. In medicine, <5% is not worth training nurses and staff.
Type I error (⍺) — Chance of accepting a difference when there is not.
Type2 error(β) -Chance of rejecting an actual lift. β is inversely related to sample size.
Statistical Power — (1-β) is the power or sensitivity of experiment
2. Policy & Ethics
Participants are real people
Experiments can be harmful or potentially harmful — Tuskegee medical study left people untreated for syphilis. Milgram medical study used electric shock on participants and damaged psychologically. Facebook study exposed users to only positive items in newsfeed to measure behavior change.
IRB protects participants — (Institutional Review Board) is not necessary for online tests as they are less harmful
Risks — risks the partitcipants are exposed to. In FB example, removing some feed items could be of some risk
Benefits — from the outcome of study. In medical studies, benefits are higher and so risk is acceptable
Choice — Does participant have other choices? For cancer patients, death is the outcome for being untreated and hence high risk is acceptable. For Online, time, switching costs to other service could be an issue
Privacy — Do participants know of what data is being collected (PII, HIPAA)? Would it harm them if the data becomes public? Anonymous (usually aggregated and stored) vs. Pseudo-anonymous (cookie level). Timestamp is one of 18 HIPAA attributes.
Informed consent with choice of participation might be mandatary when risk is high
Types of Metrics
Sanity checking metrics (Invariant Metrics) — can be multiple metrics. Eg. When changing the page and testing CTR, latency & load times should be invariant
Evaluation metrics — Better to be one single and common evaluation metric across experiments instead of multiple metrics of composite metrics — which are hard to define. You will anyway dig into other detailed metrics (funnel pass thru) to answer why the single metric moved or not.
Sourcing Metric Ideas
External Data — Use industry metrics from External data (Neilson, comscore), Survey data (Pew, Forrester), Academic research. Useful for validating your internal metrics
Internal Historical data — Try to measure metric movements with past changes (beware this is correlation), talk to folks in your company, came from other companies
Define your Metric
Filters — outlier removal, bots, English content only — needs to be unbiased (validate by looking at distribution of slices and compare slice metrics)
Summary Metric — Sums&Counts ($), Distribution Metric (Mean, median, 95th pctl), Probabilities and Rates, Ratio (CTR)
Numerator & Denominator — For ratio metrics. Denom is usually unit of Analysis (see next lesson)
Relative change — Measuring relative change is better than absolute change as it is less affected by seasonality like high order value during holiday.
Practical significance level — analytic/theoritical computation of variance & CI works only for probability or normal data. For other distributions you need compute empirical variability (by looking at individual observations) or use Non-parametric confidence interval
Sign Test — Easy way to detect if there was any positive change probabilistically. However does not quantify the actual lift.
Sensitivity — Variable should move with the change made
Robustness — Variable should not move much without any change
AA Test — use AA test to find the right metric. AA also helps test the system, randomization. Run multiple AA tests (10–20) with different population sizes to find variance. However since SD is divided by sqrt of # of samples hence diminishing returns with more AA tests. Another option is to run one big AA test and do bootstrapping.
4. Designing an Experiment
Unit of Diversion/Randomization — (cookie, user, ip segments)
Unit of Analysis (user, page views) — analytical variability & empirical variability is close when UoD & UoA are same.
Empirical Variability — (SE=SD * z)
Analytical Variability — (SE=sqrt(p(1-p)(1/Ncont + 1/Nexp))
For continuous metrics like revenue per user, empirical variability is easy to compute than analytical variability
Intra vs. Inter User Experiments — Inter is same user getting control & treatment at different times and measure them
Target Population — US only, English users only etc.,
Cohort — creating sub groups with additional conditions to qualify for experiment (eg. Visited in last 2 months)
Size — Using variance learnt from AA
Sizing Trigger — Log who is supposed to see the change and log who saw the change to make sure it worked
Duration vs. Exposure — Better to run longer on small traffic. Because you don’t want to expose to large user group and you don’t want to make decision on just 1 or 2 days (what if it is a holiday), have unbiased traffic available to test future updated versions.
Learning Effect — Some users are averse to change and drop off. Some are excited about change and engage more
5. Analyzing Results
Sanity checks — test setup checks (no non-english traffic), population size checks (equal allocation to T & C), invariant checks (metrics that shouldn’t change didn’t change). If it fails, check experiment setup, infra & tracking
Single Metric — Use variability to analyze if the results are stat sig. If not stat sig, try to use non-parametric sign test. Try breaking down by various dimensions which will help identify any bugs and will also help come up with new hypothesis on how different groups are reacting to the experiment. While analyzing by groups, beware of Simpson’s Paradox (Individual groups show lift. But overall show no lift or worse negative):
Multiple Metrics/variants.- More metrics you have or more things you test, you are likely to see differences just by chance. Hence you need to use multiple comparisons method like Bonferroni, Closed testing procedure, Boole’s, Holm–Bonferroni.
Probability of 1 metric showing False positive among 3 metrics with 95% CI is: 1-(0.95*0.95*0.95) = 0.143 (for simplicity, assuming the metrics are independent)
Method 1: Assume Independence: ⍺overall=1-(1-⍺individual))n
Method 2: Bonferroni correction:
⍺individual = ⍺overall /n
More sophisticated methods:
Control probability that any metric shows a False Positive and compute Family-wise error rate.
Control False Discovery rate [FDR = E(# of FP/# rejections)] i.e., when you have 200 metrics, you are ok with 5 false positives for 95% CI.
Do you understand the change? — Intuition from others & past experiments help. Eg: Bold works well in Latin alphabets. In Japanese bold is hard to read. Using different color would be better
Do want to launch the change? — Is it worth it based on above learnings. Use ramp-up to launch to make sure the experience works well for other users too. If your change helps only for 30% of users (or) helps 70% and hurts 30%, decide if you want to launch or fine tune first.
Hold out — Even a stat sig lift is likely to go away during ramp up due to seasonal effect. Hold out helps to test long term behavior. A cohort analysis would help here.
Switchback testing — Instead of using user or cookie, uses slice of time as Unit of Diversion (Eg: https://blog.doordash.com/switchback-tests-and-randomized-experimentation-under-network-effects-at-doordash-f1d938ab7c2a)
Multi-arm bandit — Instead of static diversion of traffic through out the test (20:20:20:20:20), constantly redirect more traffic to the winner to exploit early on (Eg: https://multithreaded.stitchfix.com/blog/2018/11/08/bandits/)
Causal Inference -
Synthetic Control -