The Parallel Universe

Finding Causal Effects by Comparing Alternate Realities

Health Economics and Policy

Spring 2026

The Parallel Universe

DiD as a Counterfactual Bridge

The Dream: Seeing the Road Not Taken

The Fundamental Problem

We want to know: “What would have happened if this policy had never been implemented?”

But we can’t rewind history and try both paths.

Difference-in-Differences offers a solution:

Use a comparison group as a stand-in for the counterfactual — what the treated group would have looked like without treatment.

The core idea:

Compare the change over time in the treated group to the change over time in a control group. The difference in these differences is our causal estimate.

The First Parallel Universe Detective: Semmelweis (1847)

The Setting: Vienna General Hospital

  • 1847 pregnant women were routed to either:
    • Physician wing: 13-18% maternal mortality
    • Midwife wing: ~3% mortality

Semmelweis’s Insight:

Physicians came from performing autopsies — could “cadaveric particles” be causing deaths?

Show R Code
semmelweis_data <- tibble(
  period = rep(c("Before\nIntervention", "After\nIntervention"), 2),
  group = rep(c("Physician Wing", "Midwife Wing"), each = 2),
  mortality = c(15.5, 2.5, 3.0, 2.8)
) %>%
  mutate(period = factor(period, levels = c("Before\nIntervention", "After\nIntervention")))

ggplot(semmelweis_data, aes(x = period, y = mortality,
                            color = group, group = group)) +
  geom_line(linewidth = 2) +
  geom_point(size = 5) +
  scale_color_manual(values = c(secondary_teal, accent_coral), name = "") +
  scale_y_continuous(labels = label_percent(scale = 1), limits = c(0, 18)) +
  labs(
    title = "Maternal Mortality by Wing",
    subtitle = "Hand-washing intervention in physician wing",
    x = "", y = "Mortality Rate"
  ) +
  theme_health_econ(base_size = 16) +
  theme(legend.position = "bottom")
Figure 1: Semmelweis’s natural experiment

Semmelweis’s Intervention Changed Everything

The Intervention:

Instituted hand-washing with chlorine solution for physicians (but not midwives)

The Result:

A natural DiD design — mortality dropped dramatically in the physician wing while staying stable in the midwife wing

Why this is DiD:

  • Treatment group: Physician wing (received hand-washing)
  • Control group: Midwife wing (no change)
  • Before vs. after: Mortality before and after the intervention

The Cost of Being Right Too Early

Evidence Rejected

Despite compelling evidence, Semmelweis’s findings were dismissed.

His superiors attributed improvements to changes in ventilation — not hand-washing.

Semmelweis faced professional ruin and died in an asylum.

A Parallel Trends Objection Before Its Time

The ventilation critique was essentially a parallel trends violation claim: “Both wings would have improved due to better ventilation — the physician wing just happened to get it first.”

If Semmelweis had the vocabulary of parallel trends, could he have defended his findings more convincingly?

It took another 20 years for germ theory to become accepted.

The Second Detective: John Snow (1854)

The Crisis:

  • Three major cholera waves devastated London
  • Dominant theory: Miasma (bad air causes disease)

Snow’s Insight:

Cholera was waterborne, not airborne — transmitted through contaminated Thames water

Snow’s Natural Experiment

The Setup:

  • London ordered water companies to move intake pipes upstream
  • Companies complied at different times
  • Lambeth moved pipes in 1852; Southwark & Vauxhall didn’t until later

The Identification:

Same neighborhoods, different water sources — a natural DiD design

Map of London water utilities, 1854

Snow’s Table: The First DiD Calculation

Table 1
Company Period Deaths per 10,000 Outcome Expression First Difference
**Lambeth** Before (1849) 150 \(Y = L\)
**Lambeth** After (1854) 37 \(Y = L + {\color{#e74c3c}{L_t}} + D\) \(D + {\color{#e74c3c}{L_t}}\)
**Southwark & Vauxhall** Before (1849) 118 \(Y = SV\)
**Southwark & Vauxhall** After (1854) 147 \(Y = SV + SV_t\) \(SV_t\)

Snow’s Numbers Tell the Story

The DiD Calculation (with numbers):

\[\delta^{DiD} = (37 - 150) - (147 - 118) = -113 - 29 = \mathbf{-142} \text{ deaths per 10,000}\]

The DiD Calculation (with algebra):

\[\delta^{DiD} = \Big[(L + {\color{#e74c3c}{L_t}} + D) - L\Big] - \Big[(SV + SV_t) - SV\Big] = D + ({\color{#e74c3c}{L_t}} - SV_t)\]

If \({\color{#e74c3c}{L_t}} = SV_t\) (parallel trends), then \(\delta^{DiD} = D\) — the true treatment effect!

Two Physicians, One Method

What Semmelweis and Snow Discovered

Both physicians were doing the same thing — comparing changes across groups over time.

Neither had the formal vocabulary, but both intuitively understood: a single before-after comparison isn’t enough. You need a control group to benchmark against.

Their legacy: The difference-in-differences framework formalizes exactly what they discovered.

Snow’s table is a DiD table. Now let’s see the same logic expressed in regression form.

From Intuition to Formalism

Semmelweis and Snow saw into the parallel universe before anyone had a name for it. Now let’s formalize what they discovered — turning intuition into a formula we can estimate, test, and defend.

From Motivation to Design

Figure 2

Building the Bridge Between Universes

The Difference-in-Differences Setup

Anatomy of the Bridge: Four Averages, Three Subtractions

Show R Code
# Create DiD visualization data
did_viz <- tibble(
  time = c(0, 1, 0, 1),
  group = c("Control", "Control", "Treated", "Treated"),
  outcome = c(2, 3, 3, 5.5),
  counterfactual = c(NA, NA, NA, 4)
)

# Add counterfactual line for treated
cf_line <- tibble(
  time = c(0, 1),
  outcome = c(3, 4),
  group = "Counterfactual"
)

ggplot() +
  # Control group
  geom_line(data = filter(did_viz, group == "Control"),
            aes(x = time, y = outcome), color = slate_gray, linewidth = 2) +
  geom_point(data = filter(did_viz, group == "Control"),
             aes(x = time, y = outcome), color = slate_gray, size = 6) +
  # Treated group
  geom_line(data = filter(did_viz, group == "Treated"),
            aes(x = time, y = outcome), color = secondary_teal, linewidth = 2) +
  geom_point(data = filter(did_viz, group == "Treated"),
             aes(x = time, y = outcome), color = secondary_teal, size = 6) +
  # Counterfactual
  geom_line(data = cf_line, aes(x = time, y = outcome),
            color = accent_coral, linewidth = 1.5, linetype = "dashed") +
  geom_point(data = filter(cf_line, time == 1),
             aes(x = time, y = outcome), color = accent_coral, size = 5, shape = 1, stroke = 2) +
  # Treatment effect arrow
  annotate("segment", x = 1.05, y = 4, xend = 1.05, yend = 5.5,
           arrow = arrow(length = unit(0.3, "cm"), ends = "both"),
           color = warm_gold, linewidth = 1.5) +
  annotate("text", x = 1.15, y = 4.75, label = "Treatment\nEffect",
           color = warm_gold, fontface = "bold", size = 5, hjust = 0) +
  # Labels
  annotate("text", x = -0.05, y = 2, label = "Control (Pre)", hjust = 1, color = slate_gray, size = 4) +
  annotate("text", x = 1.05, y = 3, label = "Control (Post)", hjust = 0, color = slate_gray, size = 4) +
  annotate("text", x = -0.05, y = 3, label = "Treated (Pre)", hjust = 1, color = secondary_teal, size = 4) +
  annotate("text", x = 1.05, y = 5.5, label = "Treated (Post)", hjust = 0, color = secondary_teal, size = 4) +
  annotate("text", x = 1.05, y = 4, label = "Counterfactual", hjust = 0, color = accent_coral, size = 4) +
  # Vertical line at treatment
  geom_vline(xintercept = 0.5, linetype = "dotted", color = slate_gray, linewidth = 1) +
  annotate("text", x = 0.5, y = 6.2, label = "Treatment\nOccurs", color = slate_gray, size = 4) +
  scale_x_continuous(breaks = c(0, 1), labels = c("Pre", "Post"), limits = c(-0.2, 1.5)) +
  scale_y_continuous(limits = c(1.5, 6.5)) +
  labs(
    title = "The DiD Estimator: Parallel Trends in Action",
    subtitle = "The control group trend estimates what would have happened to treated units",
    x = "", y = "Outcome"
  ) +
  theme_health_econ(base_size = 18) +
  theme(panel.grid = element_blank())

Figure 3: The difference-in-differences estimator

The Blueprint: One Regression Captures the Design

The 2×2 DiD Estimator:

\[\delta^{DiD} = \underbrace{(Y_{Treated,Post} - Y_{Treated,Pre})}_{\text{Change in treated}} - \underbrace{(Y_{Control,Post} - Y_{Control,Pre})}_{\text{Change in control}}\]

As a regression:

\[Y_{it} = \alpha + \beta_1 \cdot \text{Treat}_i + \beta_2 \cdot \text{Post}_t + \delta \cdot (\text{Treat}_i \times \text{Post}_t) + \varepsilon_{it}\]

  • \(\beta_1\): Baseline difference between groups
  • \(\beta_2\): Common time trend
  • \(\delta\): The DiD estimate — our causal effect!

Graduate Students: OLS Equivalence

The 2×2 DiD can be written as: \(\delta = (\bar{Y}_{T,post} - \bar{Y}_{T,pre}) - (\bar{Y}_{C,post} - \bar{Y}_{C,pre})\), which is algebraically identical to the OLS coefficient on \(\text{Treat} \times \text{Post}\) in the saturated model.

The Parallel Universe Test

The DiD Golden Rule

Ask yourself: “If the treatment had never happened, would the treated group have followed the same path as the control group?”

If the answer is yes, the control group is a valid window into the treated group’s alternate universe.

The Parallel Universe Test checks three things:

  1. Were the universes aligned before the split? (If not, parallel trends fails.)
  2. Did anyone peek ahead and change course? (If yes, the pre-period is contaminated.)
  3. Did anything leak between universes? (If yes, your control group is partially treated.)

Just as the RDD “Twins Test” asks “Would these two people be indistinguishable without the cutoff?”, the Parallel Universe Test asks “Would these two groups have traveled the same road without the policy?”

The Bridge Works Only If the Universes Were Aligned

Express DiD in potential outcomes notation:

\[\delta^{DiD} = \underbrace{E[Y^1_{Treated} - Y^0_{Treated}]}_{\text{ATT (what we want)}} + \underbrace{\Big[E[\Delta Y^0_{Treated}] - E[\Delta Y^0_{Control}]\Big]}_{\text{Bias if trends differ}}\]

The Key Insight

DiD gives us the Average Treatment Effect on the Treated (ATT) only if the bias term equals zero.

This happens when parallel trends holds.

Graduate Students: Formal Identification Result

Under parallel trends: \(E[\Delta Y^0 | D=1] = E[\Delta Y^0 | D=0]\), we have \(\delta^{DiD} = ATT\). The bias term \(E[\Delta Y^0_{Treated}] - E[\Delta Y^0_{Control}]\) equals zero by assumption.

From Formula to Assumptions

The formula tells us how to build the bridge between universes. But when does it lead to the right universe? Three rules determine whether the connection is valid.

From Setup to Assumptions

Figure 4

Three Rules for Parallel Universes

Parallel Trends, No Anticipation, No Spillovers

Three Rules for Parallel Universes

Table 2
# Assumption What It Means Testable If Violated...
1 Parallel Trends Without treatment, groups would follow the same trajectory Partially DiD estimate is biased
2 No Anticipation Units don't change behavior before treatment Sometimes Pre-period is contaminated
3 SUTVA No spillovers between units; treatment is well-defined Sometimes Treatment effect is ill-defined

Rule 1: The Universes Must Have Been Aligned

The Core Assumption

\[E[\Delta Y^0_{Treated}] = E[\Delta Y^0_{Control}]\]

In the absence of treatment, both groups would have followed the same trajectory.

What Parallel Trends IS:

  • Control group trend approximates treated counterfactual
  • Changes over time would be similar absent treatment

What It’s NOT:

  • Does not require groups to be similar at baseline
  • Does not require random assignment (though that would guarantee it)
Show R Code
set.seed(123)
time <- -5:5
control <- 10 + 0.5 * time + rnorm(11, 0, 0.2)
treated <- 12 + 0.5 * time + ifelse(time >= 0, 2, 0) + rnorm(11, 0, 0.2)

pt_data <- tibble(
  time = rep(time, 2),
  group = rep(c("Control", "Treated"), each = 11),
  outcome = c(control, treated)
)

ggplot(pt_data, aes(x = time, y = outcome, color = group)) +
  geom_line(linewidth = 1.5) +
  geom_point(size = 3) +
  geom_vline(xintercept = -0.5, linetype = "dashed", color = slate_gray) +
  scale_color_manual(values = c(slate_gray, secondary_teal), name = "") +
  annotate("text", x = -0.5, y = 16, label = "Treatment", color = slate_gray, size = 4) +
  labs(
    title = "Parallel Pre-Trends",
    x = "Period", y = "Outcome"
  ) +
  theme_health_econ(base_size = 14)

Rule 2: No One Peeked into the Future

Definition

Units do not change their behavior before treatment in anticipation of it.

Violations:

  • Buying a house today because you expect a housing subsidy next month
  • Hospitals changing staffing before a merger is announced
  • Patients stocking up on medications before a policy change

Why It Matters:

If the pre-period is already affected, we don’t have a clean baseline for comparison

Show R Code
antic_data <- tibble(
  time = rep(-4:4, 2),
  group = rep(c("Control", "Treated (with anticipation)"), each = 9),
  outcome = c(
    10 + 0.5 * (-4:4),  # Control
    c(12, 12.5, 13, 14, 15, 17, 18, 19, 20)  # Treated - anticipates at t=-2
  )
)

ggplot(antic_data, aes(x = time, y = outcome, color = group)) +
  geom_line(linewidth = 1.5) +
  geom_point(size = 3) +
  geom_vline(xintercept = -0.5, linetype = "dashed", color = slate_gray) +
  annotate("rect", xmin = -2.5, xmax = -0.5, ymin = 9, ymax = 21,
           fill = accent_coral, alpha = 0.15) +
  annotate("text", x = -1.5, y = 20.5, label = "Anticipation\nContamination",
           color = accent_coral, size = 4, fontface = "bold") +
  scale_color_manual(values = c(slate_gray, accent_coral), name = "") +
  labs(
    title = "Anticipation Violates No-Anticipation",
    x = "Period", y = "Outcome"
  ) +
  theme_health_econ(base_size = 14)
Figure 6: Anticipation effects contaminate the baseline

Rule 3: The Universes Don’t Leak

Stable Unit Treatment Value Assumption

  1. No interference: One unit’s treatment doesn’t affect another’s outcome
  2. No hidden variation: Treatment is applied consistently across units

Potential Violations:

  • Hospital in treated state affects patients in control state (geographic spillovers)
  • New policy changes market equilibrium (general equilibrium effects)
  • Treatment intensity varies across “treated” units

Why It Matters:

  • If spillovers exist, the “control” group is partially treated
  • DiD underestimates the true effect
  • Need to think carefully about unit of analysis

From Rules to Violations

If all three rules hold, DiD identifies the ATT. But in practice, each rule can fail — and the violations have distinctive signatures. Let’s see what happens when the universes aren’t truly parallel.

From Assumptions to Violations

Figure 7

When the Universes Aren’t Parallel

How and Why DiD Breaks

Semmelweis’s Critics Broke Rule 1

His critics claimed improved ventilation — not hand-washing — caused mortality to drop.

What they were really arguing:

“The midwife wing would have improved too if it had gotten better ventilation. You can’t attribute the improvement to hand-washing.”

This is a parallel trends violation claim. They argued the counterfactual trend for the physician wing differed from the observed control trend.

The Stakes of Parallel Trends

Without the vocabulary to defend parallel trends, Semmelweis lost the argument — and thousands more mothers died.

Modern DiD gives us tools to preemptively address these objections.

The Shapeshifter: When Your Window Changes Shape

The Problem

The composition of your sample changes over time in ways correlated with treatment.

Example: Hong et al. (2013) — Napster and Music Sales

  • Study uses Consumer Expenditure Survey data on music spending
  • Compares internet users vs. non-users before/after Napster (June 1999)
  • Problem: Who uses the internet changed dramatically over this period!

Hong et al.: The Window Shifted

Internet diffusion changed the composition of “internet users”

Demographic shifts between periods

The Issue: Early internet adopters were young music fans. As internet diffused, the “internet user” group became less music-focused.

The Demographics Tell the Story

Demographic characteristics by period

Important

Compositional changes can make parallel trends fail even if the underlying causal effect is real.

The Crystal Ball: When Policy Sees the Future

The Problem

Policies are enacted in response to pre-existing trends, not randomly.

Examples:

  • States with rising obesity rates pass soda taxes
  • Hospitals with declining quality join health systems
  • Cities with rising crime adopt new policing strategies

The Issue:

The very trends that prompted the policy violate parallel trends

Show R Code
endo_data <- tibble(
  time = rep(-5:5, 2),
  group = rep(c("Control", "Treated"), each = 11),
  outcome = c(
    10 + 0.3 * (-5:5),  # Control - steady growth
    c(12, 13, 14.5, 16.5, 19, 21, 22, 22.5, 23, 23.5, 24)  # Treated - was growing faster, policy slows it
  )
)

ggplot(endo_data, aes(x = time, y = outcome, color = group)) +
  geom_line(linewidth = 1.5) +
  geom_point(size = 3) +
  geom_vline(xintercept = -0.5, linetype = "dashed", color = slate_gray) +
  annotate("text", x = -3, y = 18, label = "Pre-existing\ntrend difference",
           color = accent_coral, size = 4, fontface = "bold") +
  scale_color_manual(values = c(slate_gray, accent_coral), name = "") +
  labs(
    title = "Policy Endogeneity",
    subtitle = "Treated group was already on a different trajectory",
    x = "Period", y = "Outcome"
  ) +
  theme_health_econ(base_size = 14)
Figure 8

The Ghost: Invisible Forces Push the Universes Apart

The Problem

Missing factors that differentially affect treated and control groups over time.

Common scenarios:

  • Economic shocks hit regions differently
  • Demographic changes vary across areas
  • Other policies implemented simultaneously

Solution: Conditional parallel trends — control for covariates that restore parallel trends assumption:

\[E[\Delta Y^0_{Treated} | X] = E[\Delta Y^0_{Control} | X]\]

From Violations to Diagnostics

We can’t directly observe the parallel universe. But we can check whether the universes were aligned before the split — and event studies are our diagnostic tool.

From Violations to Diagnostics

Figure 9

Checking the Alignment

Event Studies and Validation Checks

Learning from History

Can You Trust Your Window?

The Fundamental Challenge

Parallel trends is untestable — we never observe the treated counterfactual.

But we can examine pre-treatment trends as a diagnostic.

The Logic:

  • If groups had parallel trends before treatment, they might after too
  • If pre-trends differ, we should be very worried
  • Event studies also show dynamic treatment effects over time

Learning from Semmelweis

Event studies exist because we’ve learned from cases like Semmelweis. We need to preemptively address the parallel trends objection — not wait for critics to raise it after the fact.

The Alignment Check: Event Study Regression

\[ \begin{aligned} Y_{it} = \gamma_i + \lambda_t &+ \sum_{\tau=-q}^{-2} \mu_{\tau} (D_i \times \mathbf{1}\{\tau_t = \tau\}) + \sum_{\tau=0}^{m} \delta_{\tau} (D_i \times \mathbf{1}\{\tau_t = \tau\}) + \varepsilon_{it} \end{aligned} \]

Components:

  • \(\gamma_i\): Unit fixed effects
  • \(\lambda_t\): Time fixed effects
  • \(\mu_{\tau}\): Pre-treatment coefficients (should be ~0)
  • \(\delta_{\tau}\): Post-treatment effects (our estimates)

Key points:

  • Omit one period (usually \(\tau = -1\)) as reference
  • Pre-period \(\mu\)’s test for differential pre-trends
  • Post-period \(\delta\)’s trace dynamic treatment effects

Important Caveat

\(\mu_{\tau} = 0\) does NOT prove parallel trends holds — only that we can’t detect a violation.

Reading the Alignment Scan

Show R Code
set.seed(123)
event_time <- -4:4
coeff <- c(-0.02, 0.05, -0.03, 0, -0.15, -0.28, -0.35, -0.38, -0.42)
se <- c(0.06, 0.05, 0.04, NA, 0.06, 0.07, 0.08, 0.09, 0.10)

event_data <- tibble(
  period = event_time,
  estimate = coeff,
  se = se,
  lower = coeff - 1.96 * se,
  upper = coeff + 1.96 * se,
  phase = ifelse(period < 0, "Pre-Treatment", "Post-Treatment")
)

ggplot(event_data, aes(x = period, y = estimate)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = slate_gray) +
  geom_vline(xintercept = -0.5, linetype = "dashed", color = slate_gray) +
  geom_errorbar(data = filter(event_data, period != -1),
                aes(ymin = lower, ymax = upper, color = phase), width = 0.2, linewidth = 1) +
  geom_point(data = filter(event_data, period != -1),
             aes(color = phase), size = 4) +
  geom_point(data = filter(event_data, period == -1),
             shape = 1, size = 4, color = slate_gray, stroke = 1.5) +
  scale_color_manual(values = c(secondary_teal, accent_coral), name = "") +
  annotate("rect", xmin = -4.5, xmax = -0.5, ymin = -0.6, ymax = 0.3,
           fill = secondary_teal, alpha = 0.08) +
  annotate("rect", xmin = -0.5, xmax = 4.5, ymin = -0.6, ymax = 0.3,
           fill = accent_coral, alpha = 0.08) +
  annotate("text", x = -2.5, y = 0.25, label = "Pre-trends\n(should be ~0)",
           color = secondary_teal, size = 4, fontface = "bold") +
  annotate("text", x = 2.5, y = 0.25, label = "Treatment effects\n(our estimates)",
           color = accent_coral, size = 4, fontface = "bold") +
  annotate("text", x = -1, y = -0.15, label = "Reference\nperiod",
           color = slate_gray, size = 3.5) +
  labs(
    title = "Anatomy of an Event Study",
    subtitle = "Period -1 is the reference (normalized to zero)",
    x = "Event Time (Periods Relative to Treatment)",
    y = "Coefficient Estimate"
  ) +
  theme_health_econ(base_size = 18) +
  theme(legend.position = "none")

Figure 10: Interpreting event study coefficients

Graduate Students: Pre-Testing Pitfalls

Roth (2022) shows that conditioning on passing a pre-trends test can distort inference. Pre-trends tests have low power against violations that would meaningfully bias the DiD estimate. The absence of evidence is not evidence of absence.

From Diagnostics to Modern DiD

Figure 11

When the Split Gets Complicated

Staggered Adoption and New Estimators

When Everyone Enters Their Universe at Different Times

The Modern DiD Challenge

Many studies have staggered adoption: different units receive treatment at different times.

The standard two-way fixed effects (TWFE) estimator can be severely biased in this setting.

The Standard Approach:

\[Y_{it} = \delta D_{it} + \alpha_i + \gamma_t + \varepsilon_{it}\]

  • \(\alpha_i\): Unit fixed effects
  • \(\gamma_t\): Time fixed effects
  • \(D_{it}\): Treatment indicator

The Problem:

  • TWFE is a weighted average of many 2×2 comparisons
  • Some comparisons use already-treated units as controls
  • Some weights can be negative!

TWFE Uses Already-Treated Units as Controls

Bacon (2021) showed that TWFE decomposes into three types of comparisons:

Table 3
Comparison Control Group Problem
Early vs. Never Treated Never treated Clean
Late vs. Never Treated Never treated Clean
Early vs. Late (Not-Yet-Treated) Late treated (pre-treatment period) Clean
Early vs. Late (Already Treated) Early treated (post-treatment period) PROBLEMATIC - treated units used as controls

Key insight: When early-treated units serve as “controls” for late-treated units, and treatment effects grow over time, TWFE can even get the wrong sign.

Graduate Students: Goodman-Bacon Decomposition

The TWFE coefficient \(\hat{\delta}\) is a weighted average: \(\hat{\delta} = \sum_{k} w_k \hat{\delta}_k\) where weights \(w_k\) depend on group sizes and timing variance. Negative weights arise when already-treated groups serve as controls, which can reverse the sign of the estimate.

Modern Estimators Fix the Staggered Problem

Table 4
Method Key Innovation Best For
Callaway & Sant'Anna (2021) Estimate group-time ATT(g,t) separately, then aggregate Heterogeneous effects across cohorts
Sun & Abraham (2021) Interaction-weighted estimator for event studies Dynamic event study designs
De Chaisemartin & d'Haultfoeuille (2020) Identify conditions when TWFE fails; propose alternative Diagnosing TWFE problems

Practical Advice

With staggered adoption: (1) Use Bacon decomposition to check for problematic comparisons, (2) Apply modern estimators as robustness checks, (3) Be transparent about which approach you use.

Advanced Upgrades: Inference and Sensitivity

Clustered Standard Errors

Problem: Serial correlation within units inflates t-statistics

Solution: Cluster at the level of treatment assignment (Bertrand, Duflo & Mullainathan, 2004)

Honest DiD

Problem: How robust are results to parallel trends violations?

Solution: Rambachan & Roth (2023) bound the ATT under assumptions about maximum pre-trend deviations

Graduate Students: Wild Cluster Bootstrap

When the number of clusters is small (< 50), cluster-robust SEs can be severely size-distorted. Cameron, Gelbach & Miller (2008) propose the wild cluster bootstrap. In R: fwildclusterboot::boottest().

Advanced Upgrades: Triple Differences

Triple Differences (DDD)

Problem: Parallel trends is violated but affects both treatment and placebo groups similarly

Solution: Add a third difference to net out the bias:

\[\delta^{DDD} = \delta^{DiD}_{main} - \delta^{DiD}_{placebo}\]

Example: Compare age-eligible vs. age-ineligible within treatment vs. control states

From Theory to Evidence

We have the toolkit — the bridge, the rules, the alignment check. Now let’s see them work on real health policy questions.

From Methods to Application

Figure 12

From Methods to Practice

Applying DiD to Real Policy Evidence

Miller et al. (2019)

Miller et al. (2019): Discussion Questions

Skim the paper and think about the following:

  1. What is the main research question?
  2. What is the treatment and what is the comparison group?
  3. How do they establish that the “bite” of the policy was real?
  4. What falsification test do they run and why?
  5. What are the main results and mechanisms?

Context

Under the Affordable Care Act, some states expanded Medicaid while others did not. This creates variation in treatment timing across states.

The Setting: Did Medicaid Expansion Save Lives?

The Policy:

  • ACA allowed states to expand Medicaid to adults up to 138% of poverty
  • Some states expanded starting in 2014; others did not
  • Focus: near-elderly adults (ages 55-64)

The Question:

Does Medicaid expansion reduce mortality?

The Challenge:

States that expanded may differ systematically from those that didn’t

Medicaid eligibility changes

First Check: Did the Policy Actually Change Anything?

Medicaid coverage increased

Uninsured rate fell

The “bite”: Expansion states saw large increases in Medicaid enrollment and large decreases in uninsurance among the target population.

The Falsification Test: Looking Through the Wrong Window

The Logic:

  • Near-elderly (55-64) are the treatment group
  • Elderly (65+) are already on Medicare
  • If Medicaid expansion affects elderly mortality, something is wrong

Why This Works:

  • Elderly are exposed to same state-level shocks
  • But Medicaid expansion is irrelevant to them
  • Finding an effect would suggest confounding

No effect on elderly mortality

The Verdict: Medicaid Expansion Saved Lives

Event study: Mortality declines after expansion

Key Results:

  • 9.4% reduction in annual mortality among near-elderly
  • 0.13 percentage point decline in mortality rate
  • Roughly 1 death prevented per 239–316 newly insured adults
  • Effect driven by disease-amenable causes of death

Event Study Features:

  • Pre-trends are flat (parallel trends supported)
  • Effect grows over time (consistent with health insurance improving health)

Assembling the Evidence Package

Table 5
Evidence Type Finding Supports...
Bite Large increase in Medicaid enrollment; large decrease in uninsurance Policy actually changed insurance coverage
Parallel Pre-Trends Event study shows no differential pre-trends Parallel trends assumption is plausible
Falsification No effect on elderly (who are on Medicare) Results not driven by state-level confounders
Main Result 9.2% reduction in mortality among near-elderly Medicaid expansion saves lives
Mechanism Effect driven by disease-related deaths Health insurance improves health outcomes

From First to Second Application

Figure 13

A Second Application

Extending the Framework to New Settings

Gaynor et al. (2021)

Gaynor et al. (2021): Discussion Questions

Skim the paper and think about the following:

  1. What is the main research question about hospital mergers?
  2. What makes this study’s data unique (hint: surveys)?
  3. How do they establish parallel trends in their event study?
  4. What is the “new toy” effect and why might it matter?
  5. Do the promised efficiencies of the merger materialize?

Context

Hospital mergers promise efficiency gains and quality improvements. This study examines whether those promises are kept.

The Question: Do Hospital Mergers Deliver on Their Promises?

Paper: The Anatomy of a Hospital System Merger: The Patient Did Not Respond Well to Treatment

Context:

  • Hospital mergers are increasingly common
  • Promised benefits: efficiency gains, quality improvements
  • Concerns: market power, higher prices, reduced quality

Innovation:

  • Combines administrative data with survey of hospital leadership
  • Opens the “black box” of management practices
  • Can compare stated goals vs. actual outcomes

Research Questions:

  1. Do promised efficiencies materialize?
  2. What happens to staffing and quality?
  3. How do target hospitals perform post-merger?

The Evidence: Mergers Hurt More Than They Help

Key Findings:

  • Panel C: Physician exit rates jump immediately post-merger
  • Parallel pre-trends, then divergence
  • Target hospitals show negative profits by year 6

The “New Toy” Effect (Schoar, 2002):

  • Resources shift to acquired hospitals
  • Incumbent divisions suffer
  • Net effect may be negative

Event study of merger effects

Two Studies, One Toolkit

Miller et al. (2019)

Strengths:

  • Clear policy variation
  • Strong falsification test
  • Plausible mechanism

DiD Design Elements:

  • Treatment: State Medicaid expansion
  • Control: Non-expansion states
  • Pre-trends: Event study shows parallel

Gaynor et al. (2021)

Strengths:

  • Rich data on mechanisms
  • Management survey adds insight
  • Multiple outcome measures

DiD Design Elements:

  • Treatment: Hospital acquisition
  • Control: Non-acquired hospitals
  • Pre-trends: Parallel in most outcomes

From Evidence to Synthesis

Figure 14

Key Takeaways

What to Carry into Applied Work

The Regression Table Is the Claim — Evidence Is the Smoking Gun

The regression coefficient is just the headline. Convincing DiD requires an evidence package — a collection of tests that together build the causal case.

Think of it like building a case in court: the main result is your argument, but the jury needs corroborating evidence.

Components of the Evidence Package:

  • Bite: Did the policy actually shift treatment uptake?
  • Pre-trends: Event study with flat pre-period
  • Falsification: No effect on groups that shouldn’t be affected
  • Mechanisms: Does the story make biological/economic sense?
  • Robustness: Results survive alternative specifications

The Complete Evidence Package: What to Build

Core Requirements

  1. Parallel Trends: Would groups have evolved similarly?
  2. No Anticipation: Is the pre-period clean?
  3. SUTVA: No spillovers or interference?

Evidence to Show

  • Bite: Did the policy actually change something?
  • Pre-trends: Event study with flat pre-period
  • Falsification: No effect where none expected
  • Mechanisms: Does the story make sense?

The Complete Evidence Package: What to Watch For

Questions to Ask

  1. Why did some units get treated and others didn’t?
  2. What else changed at the same time?
  3. Who is the comparison group, really?
  4. Is there a good falsification test?
  5. How sensitive are results to specification?

Red Flags

  • Diverging pre-trends
  • Policy enacted in response to trends
  • Compositional changes over time
  • Spillovers across groups

Where Parallel Universes Work Best

Setting Treatment Control Key Concern
State policy adoption Adopting states Non-adopting states Policy endogeneity
Hospital mergers Acquired hospitals Non-acquired Selection into acquisition
Insurance expansions Newly eligible Ineligible Compositional change
Payment reforms Affected providers Unaffected Spillovers to controls
Geographic variation Exposed areas Unexposed areas Local shocks

The Parallel Universe Principle

The Core Insight

“You can never visit the universe where the policy didn’t happen — but if you find a group that was traveling the same path, the gap between their journeys reveals the causal effect.”

DiD works because:

  • Policy creates a fork between treated and untreated universes
  • The control group’s journey approximates the road not taken
  • The gap at the fork is causal — as long as the universes were truly aligned

Looking Ahead

RDD exploited a spatial edge — comparing units on either side of a cutoff. DiD exploits a temporal fork — comparing trajectories before and after treatment.

But what if no single control group follows the same path?

Next: Synthetic Control Methods

When you can’t find a parallel universe, maybe you can build one — constructing a custom counterfactual from a weighted portfolio of control units.

From the parallel universe to the synthetic one — our causal toolkit keeps growing.

Discussion Questions

  1. Design thinking: You want to study whether hospital price transparency laws reduce healthcare spending. What would be your treatment and control groups? What threats to parallel trends worry you most?

  2. Falsification: In the Miller et al. study, why is the elderly population a good falsification test? Can you think of another falsification test they could have used?

  3. Staggered adoption: Many health policies are adopted by states at different times. When does this help identification (more variation) vs. hurt it (TWFE problems)?

  4. Anticipation: A state announces a hospital price transparency law 18 months before implementation. You plan a DiD study using the implementation date. What could go wrong? How would you adjust your design?

From Takeaways to Appendix

Figure 15

Appendix: Technical Details

The Honest DiD Framework

Rambachan & Roth (2023): What if parallel trends is slightly violated?

The Approach:

  1. Measure maximum deviation in pre-treatment coefficients: \(\bar{M}\)
  2. Assume post-treatment violations are bounded: \(|\text{bias}| \leq \bar{M}\)
  3. Construct confidence intervals that account for potential bias

Interpretation:

“If post-treatment trend violations are no worse than what we observed pre-treatment, the ATT lies in this interval.”

Provides a sensitivity analysis for the parallel trends assumption.

R Package

The HonestDiD package implements these methods. Increasingly expected in top journals.

Triple Differences: Formal Setup

When to use: Parallel trends is violated, but you have an additional comparison dimension.

\[Y_{igt} = \alpha + \beta_1 \text{Treat}_g + \beta_2 \text{Post}_t + \beta_3 \text{Eligible}_i + \text{(two-way interactions)} + \delta^{DDD} (\text{Treat}_g \times \text{Post}_t \times \text{Eligible}_i) + \varepsilon_{igt}\]

Example: Miller et al. could have done:

  • \(g\): Expansion vs. non-expansion states
  • \(t\): Pre vs. post-ACA
  • \(i\): Near-elderly (eligible) vs. elderly (ineligible)

The DDD coefficient nets out any state-time trends common to both age groups.

Callaway & Sant’Anna (2021): Group-Time ATT

The Innovation: Instead of one pooled estimate, estimate ATT for each cohort at each time:

\[ATT(g, t) = E[Y_t(g) - Y_t(\infty) | G = g]\]

where \(g\) is the treatment timing group and \(Y_t(\infty)\) is the never-treated potential outcome.

Aggregation Options:

  1. Simple average across \((g,t)\)
  2. Event-study aggregation (by time since treatment)
  3. Cohort-specific effects (by treatment timing)

R Implementation:

library(did)
att_gt <- att_gt(
  yname = "outcome",
  tname = "year",
  idname = "id",
  gname = "first_treat",
  data = mydata
)

Inference in DiD: Clustered Standard Errors

Bertrand, Duflo & Mullainathan (2004): Standard errors in DiD are often way too small.

The Problem:

  • Outcomes are serially correlated within units
  • Standard errors don’t account for this
  • Type I error rates can be 3-4x the nominal level!

The Solution:

Cluster standard errors at the level of treatment assignment (usually state or firm).

# In R with fixest
feols(Y ~ treat*post | unit + time, data = df, cluster = "state")

Rule of Thumb

You need at least 30-50 clusters for cluster-robust inference to work well. With fewer clusters, use wild cluster bootstrap.