The Parallel Universe

Finding Causal Effects by Comparing Alternate Realities

Health Economics and Policy

Spring 2026

The Parallel Universe

DiD as a Counterfactual Bridge

The Dream: Seeing the Road Not Taken

The Fundamental Problem

We want to know: “What would have happened if this policy had never been implemented?”

But we can’t rewind history and try both paths.

Difference-in-Differences offers a solution:

Use a comparison group as a stand-in for the counterfactual — what the treated group would have looked like without treatment.

The core idea:

Compare the change over time in the treated group to the change over time in a control group. The difference in these differences is our causal estimate.

The First Parallel Universe Detective: Semmelweis (1847)

The Setting: Vienna General Hospital

1847 pregnant women were routed to either:
- Physician wing: 13-18% maternal mortality
- Midwife wing: ~3% mortality

Semmelweis’s Insight:

Physicians came from performing autopsies — could “cadaveric particles” be causing deaths?

Show R Code

semmelweis_data <- tibble(
  period = rep(c("Before\nIntervention", "After\nIntervention"), 2),
  group = rep(c("Physician Wing", "Midwife Wing"), each = 2),
  mortality = c(15.5, 2.5, 3.0, 2.8)
) %>%
  mutate(period = factor(period, levels = c("Before\nIntervention", "After\nIntervention")))

ggplot(semmelweis_data, aes(x = period, y = mortality,
                            color = group, group = group)) +
  geom_line(linewidth = 2) +
  geom_point(size = 5) +
  scale_color_manual(values = c(secondary_teal, accent_coral), name = "") +
  scale_y_continuous(labels = label_percent(scale = 1), limits = c(0, 18)) +
  labs(
    title = "Maternal Mortality by Wing",
    subtitle = "Hand-washing intervention in physician wing",
    x = "", y = "Mortality Rate"
  ) +
  theme_health_econ(base_size = 16) +
  theme(legend.position = "bottom")

Figure 1: Semmelweis’s natural experiment

Semmelweis’s Intervention Changed Everything

The Intervention:

Instituted hand-washing with chlorine solution for physicians (but not midwives)

The Result:

A natural DiD design — mortality dropped dramatically in the physician wing while staying stable in the midwife wing

Why this is DiD:

Treatment group: Physician wing (received hand-washing)
Control group: Midwife wing (no change)
Before vs. after: Mortality before and after the intervention

The Cost of Being Right Too Early

Evidence Rejected

Despite compelling evidence, Semmelweis’s findings were dismissed.

His superiors attributed improvements to changes in ventilation — not hand-washing.

Semmelweis faced professional ruin and died in an asylum.

A Parallel Trends Objection Before Its Time

The ventilation critique was essentially a parallel trends violation claim: “Both wings would have improved due to better ventilation — the physician wing just happened to get it first.”

If Semmelweis had the vocabulary of parallel trends, could he have defended his findings more convincingly?

It took another 20 years for germ theory to become accepted.

The Second Detective: John Snow (1854)

The Crisis:

Three major cholera waves devastated London
Dominant theory: Miasma (bad air causes disease)

Snow’s Insight:

Cholera was waterborne, not airborne — transmitted through contaminated Thames water

Snow’s Natural Experiment

The Setup:

London ordered water companies to move intake pipes upstream
Companies complied at different times
Lambeth moved pipes in 1852; Southwark & Vauxhall didn’t until later

The Identification:

Same neighborhoods, different water sources — a natural DiD design

Snow’s Table: The First DiD Calculation

Table 1

Company	Period	Deaths per 10,000	Outcome Expression	First Difference
Lambeth	Before (1849)	150	\(Y = L\)
Lambeth	After (1854)	37	\(Y = L + {\color{#e74c3c}{L_t}} + D\)	\(D + {\color{#e74c3c}{L_t}}\)
Southwark & Vauxhall	Before (1849)	118	\(Y = SV\)
Southwark & Vauxhall	After (1854)	147	\(Y = SV + SV_t\)	\(SV_t\)

Snow’s Numbers Tell the Story

The DiD Calculation (with numbers):

\[\delta^{DiD} = (37 - 150) - (147 - 118) = -113 - 29 = \mathbf{-142} \text{ deaths per 10,000}\]

The DiD Calculation (with algebra):

\[\delta^{DiD} = \Big[(L + {\color{#e74c3c}{L_t}} + D) - L\Big] - \Big[(SV + SV_t) - SV\Big] = D + ({\color{#e74c3c}{L_t}} - SV_t)\]

If \({\color{#e74c3c}{L_t}} = SV_t\) (parallel trends), then \(\delta^{DiD} = D\) — the true treatment effect!

Two Physicians, One Method

What Semmelweis and Snow Discovered

Both physicians were doing the same thing — comparing changes across groups over time.

Neither had the formal vocabulary, but both intuitively understood: a single before-after comparison isn’t enough. You need a control group to benchmark against.

Their legacy: The difference-in-differences framework formalizes exactly what they discovered.

Snow’s table is a DiD table. Now let’s see the same logic expressed in regression form.

From Intuition to Formalism

Semmelweis and Snow saw into the parallel universe before anyone had a name for it. Now let’s formalize what they discovered — turning intuition into a formula we can estimate, test, and defend.

From Motivation to Design

Figure 2

Building the Bridge Between Universes

The Difference-in-Differences Setup

Anatomy of the Bridge: Four Averages, Three Subtractions

Show R Code

# Create DiD visualization data
did_viz <- tibble(
  time = c(0, 1, 0, 1),
  group = c("Control", "Control", "Treated", "Treated"),
  outcome = c(2, 3, 3, 5.5),
  counterfactual = c(NA, NA, NA, 4)
)

# Add counterfactual line for treated
cf_line <- tibble(
  time = c(0, 1),
  outcome = c(3, 4),
  group = "Counterfactual"
)

ggplot() +
  # Control group
  geom_line(data = filter(did_viz, group == "Control"),
            aes(x = time, y = outcome), color = slate_gray, linewidth = 2) +
  geom_point(data = filter(did_viz, group == "Control"),
             aes(x = time, y = outcome), color = slate_gray, size = 6) +
  # Treated group
  geom_line(data = filter(did_viz, group == "Treated"),
            aes(x = time, y = outcome), color = secondary_teal, linewidth = 2) +
  geom_point(data = filter(did_viz, group == "Treated"),
             aes(x = time, y = outcome), color = secondary_teal, size = 6) +
  # Counterfactual
  geom_line(data = cf_line, aes(x = time, y = outcome),
            color = accent_coral, linewidth = 1.5, linetype = "dashed") +
  geom_point(data = filter(cf_line, time == 1),
             aes(x = time, y = outcome), color = accent_coral, size = 5, shape = 1, stroke = 2) +
  # Treatment effect arrow
  annotate("segment", x = 1.05, y = 4, xend = 1.05, yend = 5.5,
           arrow = arrow(length = unit(0.3, "cm"), ends = "both"),
           color = warm_gold, linewidth = 1.5) +
  annotate("text", x = 1.15, y = 4.75, label = "Treatment\nEffect",
           color = warm_gold, fontface = "bold", size = 5, hjust = 0) +
  # Labels
  annotate("text", x = -0.05, y = 2, label = "Control (Pre)", hjust = 1, color = slate_gray, size = 4) +
  annotate("text", x = 1.05, y = 3, label = "Control (Post)", hjust = 0, color = slate_gray, size = 4) +
  annotate("text", x = -0.05, y = 3, label = "Treated (Pre)", hjust = 1, color = secondary_teal, size = 4) +
  annotate("text", x = 1.05, y = 5.5, label = "Treated (Post)", hjust = 0, color = secondary_teal, size = 4) +
  annotate("text", x = 1.05, y = 4, label = "Counterfactual", hjust = 0, color = accent_coral, size = 4) +
  # Vertical line at treatment
  geom_vline(xintercept = 0.5, linetype = "dotted", color = slate_gray, linewidth = 1) +
  annotate("text", x = 0.5, y = 6.2, label = "Treatment\nOccurs", color = slate_gray, size = 4) +
  scale_x_continuous(breaks = c(0, 1), labels = c("Pre", "Post"), limits = c(-0.2, 1.5)) +
  scale_y_continuous(limits = c(1.5, 6.5)) +
  labs(
    title = "The DiD Estimator: Parallel Trends in Action",
    subtitle = "The control group trend estimates what would have happened to treated units",
    x = "", y = "Outcome"
  ) +
  theme_health_econ(base_size = 18) +
  theme(panel.grid = element_blank())

Figure 3: The difference-in-differences estimator

The Blueprint: One Regression Captures the Design

The 2×2 DiD Estimator:

\[\delta^{DiD} = \underbrace{(Y_{Treated,Post} - Y_{Treated,Pre})}_{\text{Change in treated}} - \underbrace{(Y_{Control,Post} - Y_{Control,Pre})}_{\text{Change in control}}\]

As a regression:

\[Y_{it} = \alpha + \beta_1 \cdot \text{Treat}_i + \beta_2 \cdot \text{Post}_t + \delta \cdot (\text{Treat}_i \times \text{Post}_t) + \varepsilon_{it}\]

\(\beta_1\): Baseline difference between groups
\(\beta_2\): Common time trend
\(\delta\): The DiD estimate — our causal effect!

Graduate Students: OLS Equivalence

The 2×2 DiD can be written as: \(\delta = (\bar{Y}_{T,post} - \bar{Y}_{T,pre}) - (\bar{Y}_{C,post} - \bar{Y}_{C,pre})\), which is algebraically identical to the OLS coefficient on \(\text{Treat} \times \text{Post}\) in the saturated model.

The Parallel Universe Test

The DiD Golden Rule

Ask yourself: “If the treatment had never happened, would the treated group have followed the same path as the control group?”

If the answer is yes, the control group is a valid window into the treated group’s alternate universe.

The Parallel Universe Test checks three things:

Were the universes aligned before the split? (If not, parallel trends fails.)
Did anyone peek ahead and change course? (If yes, the pre-period is contaminated.)
Did anything leak between universes? (If yes, your control group is partially treated.)

Just as the RDD “Twins Test” asks “Would these two people be indistinguishable without the cutoff?”, the Parallel Universe Test asks “Would these two groups have traveled the same road without the policy?”

The Bridge Works Only If the Universes Were Aligned

Express DiD in potential outcomes notation:

\[\delta^{DiD} = \underbrace{E[Y^1_{Treated} - Y^0_{Treated}]}_{\text{ATT (what we want)}} + \underbrace{\Big[E[\Delta Y^0_{Treated}] - E[\Delta Y^0_{Control}]\Big]}_{\text{Bias if trends differ}}\]

The Key Insight

DiD gives us the Average Treatment Effect on the Treated (ATT) only if the bias term equals zero.

This happens when parallel trends holds.

Graduate Students: Formal Identification Result

Under parallel trends: \(E[\Delta Y^0 | D=1] = E[\Delta Y^0 | D=0]\), we have \(\delta^{DiD} = ATT\). The bias term \(E[\Delta Y^0_{Treated}] - E[\Delta Y^0_{Control}]\) equals zero by assumption.

From Formula to Assumptions

The formula tells us how to build the bridge between universes. But when does it lead to the right universe? Three rules determine whether the connection is valid.

From Setup to Assumptions

Figure 4

Three Rules for Parallel Universes

Parallel Trends, No Anticipation, No Spillovers

Three Rules for Parallel Universes

Table 2

#	Assumption	What It Means	Testable	If Violated...
1	Parallel Trends	Without treatment, groups would follow the same trajectory	Partially	DiD estimate is biased
2	No Anticipation	Units don't change behavior before treatment	Sometimes	Pre-period is contaminated
3	SUTVA	No spillovers between units; treatment is well-defined	Sometimes	Treatment effect is ill-defined

Rule 1: The Universes Must Have Been Aligned

The Core Assumption

\[E[\Delta Y^0_{Treated}] = E[\Delta Y^0_{Control}]\]

In the absence of treatment, both groups would have followed the same trajectory.

What Parallel Trends IS:

Control group trend approximates treated counterfactual
Changes over time would be similar absent treatment

What It’s NOT:

Does not require groups to be similar at baseline
Does not require random assignment (though that would guarantee it)

Show R Code

set.seed(123)
time <- -5:5
control <- 10 + 0.5 * time + rnorm(11, 0, 0.2)
treated <- 12 + 0.5 * time + ifelse(time >= 0, 2, 0) + rnorm(11, 0, 0.2)

pt_data <- tibble(
  time = rep(time, 2),
  group = rep(c("Control", "Treated"), each = 11),
  outcome = c(control, treated)
)

ggplot(pt_data, aes(x = time, y = outcome, color = group)) +
  geom_line(linewidth = 1.5) +
  geom_point(size = 3) +
  geom_vline(xintercept = -0.5, linetype = "dashed", color = slate_gray) +
  scale_color_manual(values = c(slate_gray, secondary_teal), name = "") +
  annotate("text", x = -0.5, y = 16, label = "Treatment", color = slate_gray, size = 4) +
  labs(
    title = "Parallel Pre-Trends",
    x = "Period", y = "Outcome"
  ) +
  theme_health_econ(base_size = 14)

Figure 5: Parallel trends assumption visualized

Rule 2: No One Peeked into the Future

Definition

Units do not change their behavior before treatment in anticipation of it.

Violations:

Buying a house today because you expect a housing subsidy next month
Hospitals changing staffing before a merger is announced
Patients stocking up on medications before a policy change

Why It Matters:

If the pre-period is already affected, we don’t have a clean baseline for comparison

Show R Code

antic_data <- tibble(
  time = rep(-4:4, 2),
  group = rep(c("Control", "Treated (with anticipation)"), each = 9),
  outcome = c(
    10 + 0.5 * (-4:4),  # Control
    c(12, 12.5, 13, 14, 15, 17, 18, 19, 20)  # Treated - anticipates at t=-2
  )
)

ggplot(antic_data, aes(x = time, y = outcome, color = group)) +
  geom_line(linewidth = 1.5) +
  geom_point(size = 3) +
  geom_vline(xintercept = -0.5, linetype = "dashed", color = slate_gray) +
  annotate("rect", xmin = -2.5, xmax = -0.5, ymin = 9, ymax = 21,
           fill = accent_coral, alpha = 0.15) +
  annotate("text", x = -1.5, y = 20.5, label = "Anticipation\nContamination",
           color = accent_coral, size = 4, fontface = "bold") +
  scale_color_manual(values = c(slate_gray, accent_coral), name = "") +
  labs(
    title = "Anticipation Violates No-Anticipation",
    x = "Period", y = "Outcome"
  ) +
  theme_health_econ(base_size = 14)

Figure 6: Anticipation effects contaminate the baseline

Rule 3: The Universes Don’t Leak

Stable Unit Treatment Value Assumption

No interference: One unit’s treatment doesn’t affect another’s outcome
No hidden variation: Treatment is applied consistently across units

Potential Violations:

Hospital in treated state affects patients in control state (geographic spillovers)
New policy changes market equilibrium (general equilibrium effects)
Treatment intensity varies across “treated” units

Why It Matters:

If spillovers exist, the “control” group is partially treated
DiD underestimates the true effect
Need to think carefully about unit of analysis

From Rules to Violations

If all three rules hold, DiD identifies the ATT. But in practice, each rule can fail — and the violations have distinctive signatures. Let’s see what happens when the universes aren’t truly parallel.

From Assumptions to Violations

Figure 7

When the Universes Aren’t Parallel

How and Why DiD Breaks

Semmelweis’s Critics Broke Rule 1

His critics claimed improved ventilation — not hand-washing — caused mortality to drop.

What they were really arguing:

“The midwife wing would have improved too if it had gotten better ventilation. You can’t attribute the improvement to hand-washing.”

This is a parallel trends violation claim. They argued the counterfactual trend for the physician wing differed from the observed control trend.

The Stakes of Parallel Trends

Without the vocabulary to defend parallel trends, Semmelweis lost the argument — and thousands more mothers died.

Modern DiD gives us tools to preemptively address these objections.

The Shapeshifter: When Your Window Changes Shape

The Problem

The composition of your sample changes over time in ways correlated with treatment.

Example: Hong et al. (2013) — Napster and Music Sales

Study uses Consumer Expenditure Survey data on music spending
Compares internet users vs. non-users before/after Napster (June 1999)
Problem: Who uses the internet changed dramatically over this period!

Hong et al.: The Window Shifted

Internet diffusion changed the composition of “internet users”

The Issue: Early internet adopters were young music fans. As internet diffused, the “internet user” group became less music-focused.

The Demographics Tell the Story

Demographic characteristics by period

Important

Compositional changes can make parallel trends fail even if the underlying causal effect is real.

The Crystal Ball: When Policy Sees the Future

The Problem

Policies are enacted in response to pre-existing trends, not randomly.

Examples:

States with rising obesity rates pass soda taxes
Hospitals with declining quality join health systems
Cities with rising crime adopt new policing strategies

The Issue:

The very trends that prompted the policy violate parallel trends

Show R Code

endo_data <- tibble(
  time = rep(-5:5, 2),
  group = rep(c("Control", "Treated"), each = 11),
  outcome = c(
    10 + 0.3 * (-5:5),  # Control - steady growth
    c(12, 13, 14.5, 16.5, 19, 21, 22, 22.5, 23, 23.5, 24)  # Treated - was growing faster, policy slows it
  )
)

ggplot(endo_data, aes(x = time, y = outcome, color = group)) +
  geom_line(linewidth = 1.5) +
  geom_point(size = 3) +
  geom_vline(xintercept = -0.5, linetype = "dashed", color = slate_gray) +
  annotate("text", x = -3, y = 18, label = "Pre-existing\ntrend difference",
           color = accent_coral, size = 4, fontface = "bold") +
  scale_color_manual(values = c(slate_gray, accent_coral), name = "") +
  labs(
    title = "Policy Endogeneity",
    subtitle = "Treated group was already on a different trajectory",
    x = "Period", y = "Outcome"
  ) +
  theme_health_econ(base_size = 14)

The Ghost: Invisible Forces Push the Universes Apart

The Problem

Missing factors that differentially affect treated and control groups over time.

Common scenarios:

Economic shocks hit regions differently
Demographic changes vary across areas
Other policies implemented simultaneously

Solution: Conditional parallel trends — control for covariates that restore parallel trends assumption:

\[E[\Delta Y^0_{Treated} | X] = E[\Delta Y^0_{Control} | X]\]

From Violations to Diagnostics

We can’t directly observe the parallel universe. But we can check whether the universes were aligned before the split — and event studies are our diagnostic tool.

From Violations to Diagnostics

Figure 9

Checking the Alignment

Event Studies and Validation Checks

Learning from History

Can You Trust Your Window?

The Fundamental Challenge

Parallel trends is untestable — we never observe the treated counterfactual.

But we can examine pre-treatment trends as a diagnostic.

The Logic:

If groups had parallel trends before treatment, they might after too
If pre-trends differ, we should be very worried
Event studies also show dynamic treatment effects over time

Learning from Semmelweis

Event studies exist because we’ve learned from cases like Semmelweis. We need to preemptively address the parallel trends objection — not wait for critics to raise it after the fact.

The Alignment Check: Event Study Regression

\[ \begin{aligned} Y_{it} = \gamma_i + \lambda_t &+ \sum_{\tau=-q}^{-2} \mu_{\tau} (D_i \times \mathbf{1}\{\tau_t = \tau\}) + \sum_{\tau=0}^{m} \delta_{\tau} (D_i \times \mathbf{1}\{\tau_t = \tau\}) + \varepsilon_{it} \end{aligned} \]

Components:

\(\gamma_i\): Unit fixed effects
\(\lambda_t\): Time fixed effects
\(\mu_{\tau}\): Pre-treatment coefficients (should be ~0)
\(\delta_{\tau}\): Post-treatment effects (our estimates)

Key points:

Omit one period (usually \(\tau = -1\)) as reference
Pre-period \(\mu\)’s test for differential pre-trends
Post-period \(\delta\)’s trace dynamic treatment effects

Important Caveat

\(\mu_{\tau} = 0\) does NOT prove parallel trends holds — only that we can’t detect a violation.

Reading the Alignment Scan

Show R Code

set.seed(123)
event_time <- -4:4
coeff <- c(-0.02, 0.05, -0.03, 0, -0.15, -0.28, -0.35, -0.38, -0.42)
se <- c(0.06, 0.05, 0.04, NA, 0.06, 0.07, 0.08, 0.09, 0.10)

event_data <- tibble(
  period = event_time,
  estimate = coeff,
  se = se,
  lower = coeff - 1.96 * se,
  upper = coeff + 1.96 * se,
  phase = ifelse(period < 0, "Pre-Treatment", "Post-Treatment")
)

ggplot(event_data, aes(x = period, y = estimate)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = slate_gray) +
  geom_vline(xintercept = -0.5, linetype = "dashed", color = slate_gray) +
  geom_errorbar(data = filter(event_data, period != -1),
                aes(ymin = lower, ymax = upper, color = phase), width = 0.2, linewidth = 1) +
  geom_point(data = filter(event_data, period != -1),
             aes(color = phase), size = 4) +
  geom_point(data = filter(event_data, period == -1),
             shape = 1, size = 4, color = slate_gray, stroke = 1.5) +
  scale_color_manual(values = c(secondary_teal, accent_coral), name = "") +
  annotate("rect", xmin = -4.5, xmax = -0.5, ymin = -0.6, ymax = 0.3,
           fill = secondary_teal, alpha = 0.08) +
  annotate("rect", xmin = -0.5, xmax = 4.5, ymin = -0.6, ymax = 0.3,
           fill = accent_coral, alpha = 0.08) +
  annotate("text", x = -2.5, y = 0.25, label = "Pre-trends\n(should be ~0)",
           color = secondary_teal, size = 4, fontface = "bold") +
  annotate("text", x = 2.5, y = 0.25, label = "Treatment effects\n(our estimates)",
           color = accent_coral, size = 4, fontface = "bold") +
  annotate("text", x = -1, y = -0.15, label = "Reference\nperiod",
           color = slate_gray, size = 3.5) +
  labs(
    title = "Anatomy of an Event Study",
    subtitle = "Period -1 is the reference (normalized to zero)",
    x = "Event Time (Periods Relative to Treatment)",
    y = "Coefficient Estimate"
  ) +
  theme_health_econ(base_size = 18) +
  theme(legend.position = "none")

Figure 10: Interpreting event study coefficients

Graduate Students: Pre-Testing Pitfalls

Roth (2022) shows that conditioning on passing a pre-trends test can distort inference. Pre-trends tests have low power against violations that would meaningfully bias the DiD estimate. The absence of evidence is not evidence of absence.

From Diagnostics to Modern DiD

Figure 11

When the Split Gets Complicated

Staggered Adoption and New Estimators

When Everyone Enters Their Universe at Different Times

The Modern DiD Challenge

Many studies have staggered adoption: different units receive treatment at different times.

The standard two-way fixed effects (TWFE) estimator can be severely biased in this setting.

The Standard Approach:

\[Y_{it} = \delta D_{it} + \alpha_i + \gamma_t + \varepsilon_{it}\]

\(\alpha_i\): Unit fixed effects
\(\gamma_t\): Time fixed effects
\(D_{it}\): Treatment indicator

The Problem:

TWFE is a weighted average of many 2×2 comparisons
Some comparisons use already-treated units as controls
Some weights can be negative!

TWFE Uses Already-Treated Units as Controls

Bacon (2021) showed that TWFE decomposes into three types of comparisons:

Table 3

Comparison	Control Group	Problem
Early vs. Never Treated	Never treated	Clean
Late vs. Never Treated	Never treated	Clean
Early vs. Late (Not-Yet-Treated)	Late treated (pre-treatment period)	Clean
Early vs. Late (Already Treated)	Early treated (post-treatment period)	PROBLEMATIC - treated units used as controls

Key insight: When early-treated units serve as “controls” for late-treated units, and treatment effects grow over time, TWFE can even get the wrong sign.

Graduate Students: Goodman-Bacon Decomposition

The TWFE coefficient \(\hat{\delta}\) is a weighted average: \(\hat{\delta} = \sum_{k} w_k \hat{\delta}_k\) where weights \(w_k\) depend on group sizes and timing variance. Negative weights arise when already-treated groups serve as controls, which can reverse the sign of the estimate.

Modern Estimators Fix the Staggered Problem

Table 4

Method	Key Innovation	Best For
Callaway & Sant'Anna (2021)	Estimate group-time ATT(g,t) separately, then aggregate	Heterogeneous effects across cohorts
Sun & Abraham (2021)	Interaction-weighted estimator for event studies	Dynamic event study designs
De Chaisemartin & d'Haultfoeuille (2020)	Identify conditions when TWFE fails; propose alternative	Diagnosing TWFE problems

Practical Advice

With staggered adoption: (1) Use Bacon decomposition to check for problematic comparisons, (2) Apply modern estimators as robustness checks, (3) Be transparent about which approach you use.

Advanced Upgrades: Inference and Sensitivity

Clustered Standard Errors

Problem: Serial correlation within units inflates t-statistics

Solution: Cluster at the level of treatment assignment (Bertrand, Duflo & Mullainathan, 2004)

Honest DiD

Problem: How robust are results to parallel trends violations?

Solution: Rambachan & Roth (2023) bound the ATT under assumptions about maximum pre-trend deviations

Graduate Students: Wild Cluster Bootstrap

When the number of clusters is small (< 50), cluster-robust SEs can be severely size-distorted. Cameron, Gelbach & Miller (2008) propose the wild cluster bootstrap. In R: fwildclusterboot::boottest().

Advanced Upgrades: Triple Differences

Triple Differences (DDD)

Problem: Parallel trends is violated but affects both treatment and placebo groups similarly

Solution: Add a third difference to net out the bias:

\[\delta^{DDD} = \delta^{DiD}_{main} - \delta^{DiD}_{placebo}\]

Example: Compare age-eligible vs. age-ineligible within treatment vs. control states

From Theory to Evidence

We have the toolkit — the bridge, the rules, the alignment check. Now let’s see them work on real health policy questions.

From Methods to Application

Figure 12

From Methods to Practice

Applying DiD to Real Policy Evidence

Miller et al. (2019)

Miller et al. (2019): Discussion Questions

Skim the paper and think about the following:

What is the main research question?
What is the treatment and what is the comparison group?
How do they establish that the “bite” of the policy was real?
What falsification test do they run and why?
What are the main results and mechanisms?

Context

Under the Affordable Care Act, some states expanded Medicaid while others did not. This creates variation in treatment timing across states.

The Setting: Did Medicaid Expansion Save Lives?

The Policy:

ACA allowed states to expand Medicaid to adults up to 138% of poverty
Some states expanded starting in 2014; others did not
Focus: near-elderly adults (ages 55-64)

The Question:

Does Medicaid expansion reduce mortality?

The Challenge:

States that expanded may differ systematically from those that didn’t

First Check: Did the Policy Actually Change Anything?

The “bite”: Expansion states saw large increases in Medicaid enrollment and large decreases in uninsurance among the target population.

The Falsification Test: Looking Through the Wrong Window

The Logic:

Near-elderly (55-64) are the treatment group
Elderly (65+) are already on Medicare
If Medicaid expansion affects elderly mortality, something is wrong

Why This Works:

Elderly are exposed to same state-level shocks
But Medicaid expansion is irrelevant to them
Finding an effect would suggest confounding

The Verdict: Medicaid Expansion Saved Lives

Event study: Mortality declines after expansion

Key Results:

9.4% reduction in annual mortality among near-elderly
0.13 percentage point decline in mortality rate
Roughly 1 death prevented per 239–316 newly insured adults
Effect driven by disease-amenable causes of death

Event Study Features:

Pre-trends are flat (parallel trends supported)
Effect grows over time (consistent with health insurance improving health)

Assembling the Evidence Package

Table 5

Evidence Type	Finding	Supports...
Bite	Large increase in Medicaid enrollment; large decrease in uninsurance	Policy actually changed insurance coverage
Parallel Pre-Trends	Event study shows no differential pre-trends	Parallel trends assumption is plausible
Falsification	No effect on elderly (who are on Medicare)	Results not driven by state-level confounders
Main Result	9.2% reduction in mortality among near-elderly	Medicaid expansion saves lives
Mechanism	Effect driven by disease-related deaths	Health insurance improves health outcomes

From First to Second Application

Figure 13

A Second Application

Extending the Framework to New Settings

Gaynor et al. (2021)

Gaynor et al. (2021): Discussion Questions

Skim the paper and think about the following:

What is the main research question about hospital mergers?
What makes this study’s data unique (hint: surveys)?
How do they establish parallel trends in their event study?
What is the “new toy” effect and why might it matter?
Do the promised efficiencies of the merger materialize?

Context

Hospital mergers promise efficiency gains and quality improvements. This study examines whether those promises are kept.

The Question: Do Hospital Mergers Deliver on Their Promises?

Paper: The Anatomy of a Hospital System Merger: The Patient Did Not Respond Well to Treatment

Context:

Hospital mergers are increasingly common
Promised benefits: efficiency gains, quality improvements
Concerns: market power, higher prices, reduced quality

Innovation:

Combines administrative data with survey of hospital leadership
Opens the “black box” of management practices
Can compare stated goals vs. actual outcomes

Research Questions:

Do promised efficiencies materialize?
What happens to staffing and quality?
How do target hospitals perform post-merger?

The Evidence: Mergers Hurt More Than They Help

Key Findings:

Panel C: Physician exit rates jump immediately post-merger
Parallel pre-trends, then divergence
Target hospitals show negative profits by year 6

The “New Toy” Effect (Schoar, 2002):

Resources shift to acquired hospitals
Incumbent divisions suffer
Net effect may be negative

Two Studies, One Toolkit

Miller et al. (2019)

Strengths:

Clear policy variation
Strong falsification test
Plausible mechanism

DiD Design Elements:

Treatment: State Medicaid expansion
Control: Non-expansion states
Pre-trends: Event study shows parallel

Gaynor et al. (2021)

Strengths:

Rich data on mechanisms
Management survey adds insight
Multiple outcome measures

DiD Design Elements:

Treatment: Hospital acquisition
Control: Non-acquired hospitals
Pre-trends: Parallel in most outcomes

From Evidence to Synthesis

Figure 14

Key Takeaways

What to Carry into Applied Work

The Regression Table Is the Claim — Evidence Is the Smoking Gun

The regression coefficient is just the headline. Convincing DiD requires an evidence package — a collection of tests that together build the causal case.

Think of it like building a case in court: the main result is your argument, but the jury needs corroborating evidence.

Components of the Evidence Package:

Bite: Did the policy actually shift treatment uptake?
Pre-trends: Event study with flat pre-period
Falsification: No effect on groups that shouldn’t be affected
Mechanisms: Does the story make biological/economic sense?
Robustness: Results survive alternative specifications

The Complete Evidence Package: What to Build

Core Requirements

Parallel Trends: Would groups have evolved similarly?
No Anticipation: Is the pre-period clean?
SUTVA: No spillovers or interference?

Evidence to Show

Bite: Did the policy actually change something?
Pre-trends: Event study with flat pre-period
Falsification: No effect where none expected
Mechanisms: Does the story make sense?

The Complete Evidence Package: What to Watch For

Questions to Ask

Why did some units get treated and others didn’t?
What else changed at the same time?
Who is the comparison group, really?
Is there a good falsification test?
How sensitive are results to specification?

Red Flags

Diverging pre-trends
Policy enacted in response to trends
Compositional changes over time
Spillovers across groups

Where Parallel Universes Work Best

Setting	Treatment	Control	Key Concern
State policy adoption	Adopting states	Non-adopting states	Policy endogeneity
Hospital mergers	Acquired hospitals	Non-acquired	Selection into acquisition
Insurance expansions	Newly eligible	Ineligible	Compositional change
Payment reforms	Affected providers	Unaffected	Spillovers to controls
Geographic variation	Exposed areas	Unexposed areas	Local shocks

The Parallel Universe Principle

The Core Insight

“You can never visit the universe where the policy didn’t happen — but if you find a group that was traveling the same path, the gap between their journeys reveals the causal effect.”

DiD works because:

Policy creates a fork between treated and untreated universes
The control group’s journey approximates the road not taken
The gap at the fork is causal — as long as the universes were truly aligned

Looking Ahead

RDD exploited a spatial edge — comparing units on either side of a cutoff. DiD exploits a temporal fork — comparing trajectories before and after treatment.

But what if no single control group follows the same path?

Next: Synthetic Control Methods

When you can’t find a parallel universe, maybe you can build one — constructing a custom counterfactual from a weighted portfolio of control units.

From the parallel universe to the synthetic one — our causal toolkit keeps growing.

Discussion Questions

Design thinking: You want to study whether hospital price transparency laws reduce healthcare spending. What would be your treatment and control groups? What threats to parallel trends worry you most?
Falsification: In the Miller et al. study, why is the elderly population a good falsification test? Can you think of another falsification test they could have used?
Staggered adoption: Many health policies are adopted by states at different times. When does this help identification (more variation) vs. hurt it (TWFE problems)?
Anticipation: A state announces a hospital price transparency law 18 months before implementation. You plan a DiD study using the implementation date. What could go wrong? How would you adjust your design?

From Takeaways to Appendix

Figure 15

Appendix: Technical Details

The Honest DiD Framework

Rambachan & Roth (2023): What if parallel trends is slightly violated?

The Approach:

Measure maximum deviation in pre-treatment coefficients: \(\bar{M}\)
Assume post-treatment violations are bounded: \(|\text{bias}| \leq \bar{M}\)
Construct confidence intervals that account for potential bias

Interpretation:

“If post-treatment trend violations are no worse than what we observed pre-treatment, the ATT lies in this interval.”

Provides a sensitivity analysis for the parallel trends assumption.

R Package

The HonestDiD package implements these methods. Increasingly expected in top journals.

Triple Differences: Formal Setup

When to use: Parallel trends is violated, but you have an additional comparison dimension.

\[Y_{igt} = \alpha + \beta_1 \text{Treat}_g + \beta_2 \text{Post}_t + \beta_3 \text{Eligible}_i + \text{(two-way interactions)} + \delta^{DDD} (\text{Treat}_g \times \text{Post}_t \times \text{Eligible}_i) + \varepsilon_{igt}\]

Example: Miller et al. could have done:

\(g\): Expansion vs. non-expansion states
\(t\): Pre vs. post-ACA
\(i\): Near-elderly (eligible) vs. elderly (ineligible)

The DDD coefficient nets out any state-time trends common to both age groups.

Callaway & Sant’Anna (2021): Group-Time ATT

The Innovation: Instead of one pooled estimate, estimate ATT for each cohort at each time:

\[ATT(g, t) = E[Y_t(g) - Y_t(\infty) | G = g]\]

where \(g\) is the treatment timing group and \(Y_t(\infty)\) is the never-treated potential outcome.

Aggregation Options:

Simple average across \((g,t)\)
Event-study aggregation (by time since treatment)
Cohort-specific effects (by treatment timing)

R Implementation:

library(did)
att_gt <- att_gt(
  yname = "outcome",
  tname = "year",
  idname = "id",
  gname = "first_treat",
  data = mydata
)

Inference in DiD: Clustered Standard Errors

Bertrand, Duflo & Mullainathan (2004): Standard errors in DiD are often way too small.

The Problem:

Outcomes are serially correlated within units
Standard errors don’t account for this
Type I error rates can be 3-4x the nominal level!

The Solution:

Cluster standard errors at the level of treatment assignment (usually state or firm).

# In R with fixest
feols(Y ~ treat*post | unit + time, data = df, cluster = "state")

Rule of Thumb

You need at least 30-50 clusters for cluster-robust inference to work well. With fewer clusters, use wild cluster bootstrap.