The Art of Comparison

Matching, Weighting, and the Quest for Valid Counterfactuals

Jonathan Seward

Spring 2026

The Comparison Problem

Why Naive Comparisons Fail

The Fundamental Challenge

The Core Question

How do we know what would have happened if someone hadn’t received treatment?

We never observe the same person both treated and untreated at the same time.

Our solution: Find someone similar enough who wasn’t treated and use their outcome as the counterfactual.

But “similar enough” is doing a lot of heavy lifting here…

When Comparisons Go Wrong

Consider a study of hospital quality:

Show R Code

hospital_summary <- hospital_data %>%
  group_by(high_spending) %>%
  summarize(
    mean_outcome = mean(outcome_score),
    se = sd(outcome_score) / sqrt(n()),
    mean_severity = mean(severity),
    .groups = "drop"
  ) %>%
  mutate(hospital_type = ifelse(high_spending == 1, "High Spending", "Lower Spending"))

ggplot(hospital_summary, aes(x = hospital_type, y = mean_outcome, fill = hospital_type)) +
  geom_col(width = 0.6) +
  geom_errorbar(aes(ymin = mean_outcome - 1.96*se, ymax = mean_outcome + 1.96*se),
                width = 0.2, linewidth = 1) +
  geom_text(aes(label = round(mean_outcome, 1)), vjust = -0.5, size = 6,
            fontface = "bold", color = primary_blue) +
  scale_fill_manual(values = c(slate_gray, accent_coral), guide = "none") +
  labs(
    title = "Average Patient Outcomes by Spending Group",
    subtitle = "But wait - who goes to high-spending hospitals?",
    x = "", y = "Patient Health Score"
  ) +
  theme_health_econ(base_size = 18) +
  coord_cartesian(ylim = c(180, 260))

Figure 1: Naive comparison suggests high-spending hospitals are worse!

The Hidden Story: Selection

Show R Code

p1 <- ggplot(hospital_data, aes(x = severity, fill = severity_group)) +
  geom_density(alpha = 0.6) +
  scale_fill_manual(values = c(secondary_teal, accent_coral),
                    name = "Severity Group") +
  labs(title = "Patient Severity Distribution",
       x = "Illness Severity", y = "Density") +
  theme_health_econ(base_size = 16) +
  theme(legend.position = "bottom")

p2 <- ggplot(hospital_data, aes(x = spending, y = outcome_score, color = severity_group)) +
  geom_point(alpha = 0.4, size = 2) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1.5) +
  scale_color_manual(values = c(secondary_teal, warm_gold, accent_coral),
                     name = "Severity Group") +
  labs(title = "Outcomes by Spending",
       subtitle = "Within severity levels, higher spending helps!",
       x = "Hospital Spending", y = "Health Outcome") +
  theme_health_econ(base_size = 16) +
  theme(legend.position = "bottom")

p1 + p2

Figure 2: High-spending hospitals treat sicker patients

The Selection Bias Formula

The naive difference in outcomes can be decomposed:

\[\underbrace{E[Y|D=1] - E[Y|D=0]}_{\text{What we observe}} = \underbrace{E[Y^1 - Y^0|D=1]}_{\text{ATT (what we want)}} + \underbrace{E[Y^0|D=1] - E[Y^0|D=0]}_{\text{Selection Bias}}\]

Selection bias arises when treated and untreated groups differ in their baseline potential outcomes.

In our hospital example: sicker patients end up in higher-spending hospitals, so \(E[Y^0|D=1] < E[Y^0|D=0]\).

Graduate Students: Full Decomposition

The SDO can also be written as: \(SDO = ATT + [E[Y^0|D=1] - E[Y^0|D=0]]\) or equivalently \(SDO = ATU + [E[Y^1|D=1] - E[Y^1|D=0]]\). The second term captures selection on levels (baseline differences) while treatment effect heterogeneity appears in the difference between ATT and ATE.

From Comparison to Unconfoundedness

Figure 3

The Unconfoundedness Solution

Designing Valid Counterfactuals

Design Trumps Analysis

The Golden Rule

The quality of your causal estimate depends more on your design than your statistical sophistication.

Good Design

Thoughtful selection of comparison group
Identifying relevant confounders before analysis
Understanding the assignment mechanism

Risky Practice

Running many models until one “works”
Adding controls post-hoc
Ignoring institutional details

The Unconfoundedness Assumption

Core Idea

If we observe all the variables that affect both treatment and outcome, we can eliminate selection bias by comparing “like with like.”

Formally: \[ \left(Y^1, Y^0\right) \perp\!\!\!\perp D \mid X\ \]

(the \(\perp\!\!\!\perp\) symbol means “is independent of”)

In words: Conditional on covariates \(X\), treatment assignment is as good as random.

Other names for this assumption:

Selection on observables
Conditional independence assumption (CIA)
Ignorability

Two Key Assumptions

Table 1

Assumption	Formal Statement	Plain English	Testable
Unconfoundedness	\((Y^1, Y^0) \perp\!\!\!\perp D \mid X\)	No unmeasured confounders	No
Common Support	\(0 < P(D=1 \mid X) < 1\)	Overlap in covariate distributions	Yes

The uncomfortable truth: Unconfoundedness is never testable. We must argue for it based on theory and institutional knowledge.

Graduate Students: Strong vs. Weak Unconfoundedness

Strong: \((Y^1, Y^0) \perp\!\!\!\perp D \mid X\) - needed for ATE Weak: \(Y^0 \perp\!\!\!\perp D \mid X\) - sufficient for ATT

Weak unconfoundedness only requires that we can predict the untreated potential outcome; we don’t need to predict treated outcomes for controls.

Common Support: The Overlap Requirement

Show R Code

# Generate propensity score-like data
set.seed(123)
n <- 500
ps_data <- tibble(
  x = c(rnorm(n/2, 0.3, 0.15), rnorm(n/2, 0.7, 0.15)),
  treated = rep(c(0, 1), each = n/2)
)

ggplot(ps_data, aes(x = x, fill = factor(treated))) +
  geom_density(alpha = 0.6) +
  geom_vline(xintercept = c(0.15, 0.85), linetype = "dashed", color = slate_gray) +
  annotate("rect", xmin = 0.15, xmax = 0.85, ymin = 0, ymax = Inf,
           fill = secondary_teal, alpha = 0.1) +
  annotate("text", x = 0.5, y = 3.5, label = "Common Support Region",
           color = secondary_teal, fontface = "bold", size = 5) +
  annotate("text", x = 0.05, y = 2, label = "No treated\nunits here",
           color = accent_coral, size = 4) +
  annotate("text", x = 0.95, y = 2, label = "No control\nunits here",
           color = accent_coral, size = 4) +
  scale_fill_manual(values = c(slate_gray, warm_gold),
                    labels = c("Control", "Treated"),
                    name = "") +
  labs(title = "The Common Support Problem",
       subtitle = "Causal inference requires overlap in covariate distributions",
       x = "Propensity Score (or Covariate Value)", y = "Density") +
  theme_health_econ(base_size = 18) +
  theme(legend.position = "bottom")

Figure 4: We can only make comparisons where both groups exist

From Assumptions to Matching

Figure 5

The Matching Toolbox

Methods for Like-with-Like Comparison

Strategy 1: Subclassification

The Idea

Divide the sample into groups (strata) based on confounders, then compare treated vs. untreated within each stratum.

Steps:

Create strata based on covariate values
Calculate treatment effect within each stratum
Aggregate using appropriate weights

Advantage: Simple, transparent, and forces you to think about comparisons.

Limitation: Curse of dimensionality - with many covariates, strata become sparse.

The Titanic Example: Setup

Show R Code

# Load Titanic data
read_data <- function(df) {
  full_path <- paste("https://github.com/scunning1975/mixtape/raw/master/", df, sep = "")
  read_dta(full_path)
}

titanic <- read_data("titanic.dta") %>%
  mutate(
    d = case_when(class == 1 ~ 1, TRUE ~ 0),
    sex_label = ifelse(sex == 0, "Female", "Male"),
    age_label = ifelse(age == 1, "Adult", "Child")
  )

# Simple difference in outcomes
ey1 <- titanic %>% filter(d == 1) %>% pull(survived) %>% mean()
ey0 <- titanic %>% filter(d == 0) %>% pull(survived) %>% mean()
sdo <- ey1 - ey0

Research question: Did first-class passengers have higher survival rates?

Table 2

Group	Survival Rate
First Class	62.5%
Other Classes	27.1%
Simple Difference	35.4%

But wait - who was in first class?

The Titanic Example: Confounders

Show R Code

titanic_summary <- titanic %>%
  group_by(d, sex_label, age_label) %>%
  summarize(n = n(), survival = mean(survived), .groups = "drop") %>%
  mutate(class_label = ifelse(d == 1, "First Class", "Other Classes"))

p1 <- titanic %>%
  group_by(d) %>%
  summarize(pct_female = mean(sex_label == "Female"), .groups = "drop") %>%
  mutate(class_label = ifelse(d == 1, "First Class", "Other Classes")) %>%
  ggplot(aes(x = class_label, y = pct_female, fill = class_label)) +
  geom_col(width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = scales::percent(pct_female, accuracy = 1)),
            vjust = -0.5, size = 5, fontface = "bold") +
  scale_fill_manual(values = c(warm_gold, slate_gray)) +
  scale_y_continuous(labels = scales::percent, limits = c(0, 0.8), expand = expansion(mult = c(0, 0.08))) +
  labs(title = "% Female by Class", x = "", y = "") +
  theme_health_econ(base_size = 14)

p2 <- titanic %>%
  group_by(d) %>%
  summarize(pct_adult = mean(age_label == "Adult"), .groups = "drop") %>%
  mutate(class_label = ifelse(d == 1, "First Class", "Other Classes")) %>%
  ggplot(aes(x = class_label, y = pct_adult, fill = class_label)) +
  geom_col(width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = scales::percent(pct_adult, accuracy = 1)),
            vjust = -0.5, size = 5, fontface = "bold") +
  scale_fill_manual(values = c(warm_gold, slate_gray)) +
  scale_y_continuous(labels = scales::percent, limits = c(0, 1), expand = expansion(mult = c(0, 0.08))) +
  labs(title = "% Adult by Class", x = "", y = "") +
  theme_health_econ(base_size = 14)

p3 <- titanic %>%
  group_by(sex_label, age_label) %>%
  summarize(survival = mean(survived), .groups = "drop") %>%
  ggplot(aes(x = interaction(sex_label, age_label), y = survival,
             fill = sex_label)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = scales::percent(survival, accuracy = 1)),
            vjust = -0.5, size = 4, fontface = "bold") +
  scale_fill_manual(values = c(accent_coral, secondary_teal), name = "") +
  scale_y_continuous(labels = scales::percent, limits = c(0, 1), expand = expansion(mult = c(0, 0.08))) +
  labs(title = "Survival by Sex & Age", x = "", y = "") +
  theme_health_econ(base_size = 14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

p1 + p2 + p3

Figure 6: First class had more women and adults - groups with higher survival

Subclassification in Action

Show R Code

# Create strata
titanic <- titanic %>%
  mutate(stratum = case_when(
    sex == 0 & age == 1 ~ "Male Adult",
    sex == 0 & age == 0 ~ "Male Child",
    sex == 1 & age == 1 ~ "Female Adult",
    sex == 1 & age == 0 ~ "Female Child"
  ))

# Calculate within-stratum effects
stratum_effects <- titanic %>%
  group_by(stratum, d) %>%
  summarize(survival = mean(survived), n = n(), .groups = "drop") %>%
  pivot_wider(names_from = d, values_from = c(survival, n), names_prefix = "d") %>%
  mutate(
    effect = survival_d1 - survival_d0,
    weight = (n_d0 + n_d1) / sum(n_d0 + n_d1)
  )

# Weighted average treatment effect
wate <- sum(stratum_effects$effect * stratum_effects$weight)

Table 3

Stratum	Control Survival	Treated Survival	Effect	Weight
Female Adult	18.8%	32.6%	13.7%	75.7%
Female Child	40.7%	100.0%	59.3%	2.9%
Male Adult	62.6%	97.2%	34.6%	19.3%
Male Child	61.4%	100.0%	38.6%	2.0%

Weighted Average Treatment Effect: 19.6% (vs. naive SDO of 35.4%)

First class still helped survival, but less than the naive comparison suggested.

Strategy 2: Exact Matching

The Idea

For each treated unit, find a control unit with identical covariate values.

Identical to subclassification when feasible, but:

Fails with continuous covariates
Fails when strata are sparse (curse of dimensionality)
Many treated units may go unmatched

When it works: Binary/categorical confounders with sufficient sample size in each cell.

Strategy 3: Nearest Neighbor Matching

When exact matching fails, find the closest match instead.

Distance Metrics:

Table 4

Metric	Formula	When to Use
Euclidean	\(\sqrt{\sum_k (X_{ik} - X_{jk})^2}\)	Similar scales
Normalized Euclidean	\(\sqrt{\sum_k \left(\frac{X_{ik} - X_{jk}}{\sigma_k}\right)^2}\)	Different scales
Mahalanobis	\(\sqrt{(X_i - X_j)' S^{-1} (X_i - X_j)}\)	Correlated covariates

Warning

Nearest neighbor matching introduces matching bias when matches aren’t exact. Bias correction methods (Abadie & Imbens, 2006) can help.

Strategy 4: Coarsened Exact Matching (CEM)

The Clever Compromise

Bin continuous variables into categories, then do exact matching on the bins.

Steps:

Coarsen: Convert continuous variables to bins (e.g., age -> age groups)
Match: Exact match on coarsened values
Prune: Drop strata without both treated and control units
Weight: Compute stratum weights for remaining observations

Key insight: You control the bias-variance tradeoff by choosing bin widths.

Wider bins -> more matches, more heterogeneity
Narrower bins -> fewer matches, more precision

Visualizing CEM

Show R Code

set.seed(456)
n <- 100
cem_data <- tibble(
  age = c(runif(n/2, 25, 65), runif(n/2, 30, 70)),
  income = c(rnorm(n/2, 50, 15), rnorm(n/2, 60, 15)),
  treated = rep(c(0, 1), each = n/2)
) %>%
  mutate(
    age_bin = cut(age, breaks = c(25, 35, 45, 55, 65, 75), include.lowest = TRUE),
    income_bin = cut(income, breaks = c(0, 40, 55, 70, 100, 140), include.lowest = TRUE)
  )

p1 <- ggplot(cem_data, aes(x = age, y = income, color = factor(treated))) +
  geom_point(size = 3, alpha = 0.7) +
  scale_color_manual(values = c(slate_gray, warm_gold),
                     labels = c("Control", "Treated"), name = "") +
  labs(title = "Before: Continuous Space", x = "Age", y = "Income ($K)") +
  theme_health_econ(base_size = 14) +
  theme(legend.position = "bottom")

p2 <- ggplot(cem_data, aes(x = age, y = income, color = factor(treated))) +
  geom_point(size = 3, alpha = 0.7) +
  geom_vline(xintercept = c(35, 45, 55, 65), linetype = "dashed", alpha = 0.5) +
  geom_hline(yintercept = c(40, 55, 70), linetype = "dashed", alpha = 0.5) +
  scale_color_manual(values = c(slate_gray, warm_gold),
                     labels = c("Control", "Treated"), name = "") +
  labs(title = "After: Coarsened Bins", x = "Age", y = "Income ($K)") +
  theme_health_econ(base_size = 14) +
  theme(legend.position = "bottom")

p1 + p2

Figure 7: CEM bins continuous variables then matches exactly within bins

From Matching to Propensity Scores

Figure 8

The Propensity Score Revolution

Balancing Many Covariates with One Score

The Dimensionality Problem

With \(K\) covariates, matching becomes exponentially harder:

2 binary covariates -> 4 strata
5 binary covariates -> 32 strata
10 binary covariates -> 1,024 strata

Rosenbaum & Rubin’s insight (1983): We don’t need to match on all covariates separately. We can collapse them into a single number.

The Propensity Score

Definition

The propensity score is the probability of receiving treatment given observed covariates:

\[e(X) = Pr(D = 1 \\mid X)\]

Key theorem: If unconfoundedness holds given \(X\), it also holds given \(e(X)\).

\[(Y^1, Y^0) \perp\!\!\!\perp D \\mid X \implies (Y^1, Y^0) \perp\!\!\!\perp D \\mid e(X)\]

Implication: Instead of matching on many variables, we can match on this single scalar!

Estimating Propensity Scores

Show R Code

# Generate example data with multiple confounders
set.seed(789)
n <- 1000
ps_example <- tibble(
  age = rnorm(n, 45, 12),
  income = rnorm(n, 55, 20),
  education = rbinom(n, 1, 0.4),
  chronic = rbinom(n, 1, 0.3)
) %>%
  mutate(
    # True propensity score model
    ps_true = plogis(-2 + 0.03 * age + 0.02 * income + 0.5 * education + 0.8 * chronic),
    treated = rbinom(n, 1, ps_true)
  )

# Estimate propensity score
ps_model <- glm(treated ~ age + income + education + chronic,
                data = ps_example, family = binomial)
ps_example$ps_hat <- predict(ps_model, type = "response")

Show R Code

ggplot(ps_example, aes(x = ps_hat, fill = factor(treated))) +
  geom_histogram(alpha = 0.6, position = "identity", bins = 40) +
  scale_fill_manual(values = c(slate_gray, warm_gold),
                    labels = c("Control", "Treated"), name = "") +
  labs(
    title = "Distribution of Estimated Propensity Scores",
    subtitle = "Good overlap is essential for valid matching",
    x = "Propensity Score", y = "Count"
  ) +
  theme_health_econ(base_size = 18) +
  theme(legend.position = "bottom")

Figure 9: Propensity scores should overlap between groups

Using Propensity Scores

Two main approaches:

Matching

Match treated to control with similar \(\hat{e}(X)\)
Can use nearest neighbor, caliper, etc.
Discards unmatched units

# With MatchIt
m <- matchit(treat ~ x1 + x2,
             data = df,
             method = "nearest",
             distance = "glm")

Weighting (IPW)

Upweight “surprising” observations
Treated: \(w = 1/\hat{e}(X)\) (low-propensity treated are informative)
Control: \(w = 1/(1-\hat{e}(X))\)
Keeps all observations

# Create IPW weights
df$ipw <- ifelse(df$treat == 1,
                 1/df$ps,
                 1/(1-df$ps))

Graduate Students: ATT vs. ATE Weights

For ATT: Weight controls by \(\hat{e}(X)/(1-\hat{e}(X))\), treated units get weight 1. For ATE: Weight treated by \(1/\hat{e}(X)\), controls by \(1/(1-\hat{e}(X))\). Normalized versions divide by sum of weights to improve stability.

Checking Balance

After matching/weighting, check that covariates are balanced:

Show R Code

# Perform matching
m_out <- matchit(
  treated ~ age + income + education + chronic,
  data = ps_example,
  method = "nearest",
  distance = "glm",
  caliper = 0.2,
  replace = TRUE,
  estimand = "ATT"
)

# Extract balance statistics (pre and post)
bal_tab <- bal.tab(m_out, un = TRUE)
bal_stats <- bal_tab$Balance
vars <- c("age", "income", "education", "chronic")

balance_wide <- tibble(
  variable = c("Age", "Income", "Education", "Chronic"),
  before = abs(bal_stats[vars, "Diff.Un"]),
  after = abs(bal_stats[vars, "Diff.Adj"])
)

balance_long <- balance_wide %>%
  pivot_longer(cols = c(before, after), names_to = "timing", values_to = "smd") %>%
  mutate(
    timing = factor(timing, levels = c("before", "after"),
                    labels = c("Before Matching", "After Matching")),
    variable = factor(variable, levels = rev(c("Age", "Income", "Education", "Chronic")))
  )

ggplot(balance_long, aes(y = variable)) +
  geom_segment(
    data = balance_wide,
    aes(x = before, xend = after, yend = factor(variable, levels = rev(c("Age", "Income", "Education", "Chronic")))),
    color = slate_gray, linewidth = 1, alpha = 0.6
  ) +
  geom_point(aes(x = smd, color = timing), size = 4) +
  geom_vline(xintercept = 0.1, linetype = "dashed", color = slate_gray) +
  annotate("rect", xmin = 0, xmax = 0.1, ymin = 0.5, ymax = 4.5,
           fill = secondary_teal, alpha = 0.1) +
  scale_color_manual(values = c(accent_coral, secondary_teal), name = "") +
  labs(
    title = "Covariate Balance Before and After Matching",
    subtitle = "Target: standardized mean difference < 0.1",
    x = "Absolute Standardized Mean Difference", y = ""
  ) +
  theme_health_econ(base_size = 18) +
  theme(legend.position = "bottom") +
  xlim(0, 0.4)

Figure 10: Standardized mean differences should be small (< 0.1) after matching

From Scores to Regression

Figure 11

Regression as Matching

What OLS Is Doing Under the Hood

OLS: The Implicit Matching Estimator

The familiar regression:

\[Y_i = \alpha + \delta D_i + \beta X_i + \varepsilon_i\]

Under the hood, OLS is doing a form of covariate adjustment - but with strong assumptions:

Linearity: Treatment effect is constant across \(X\)
Constant effects: No treatment effect heterogeneity
Correct specification: Functional form is right

When OLS Works (and When It Doesn’t)

Show R Code

set.seed(101)
n <- 200
ols_data <- tibble(
  x = c(runif(n/2, 0, 0.4), runif(n/2, 0.6, 1)),
  treated = rep(c(0, 1), each = n/2),
  y = 2 + 3 * x + 2 * treated + 0.5 * x * treated + rnorm(n, 0, 0.3)
)

p1 <- ggplot(ols_data, aes(x = x, y = y, color = factor(treated))) +
  geom_point(alpha = 0.6, size = 2) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1.5) +
  annotate("rect", xmin = 0.4, xmax = 0.6, ymin = -Inf, ymax = Inf,
           fill = accent_coral, alpha = 0.2) +
  annotate("text", x = 0.5, y = 6.5, label = "No overlap!\nOLS extrapolates",
           color = accent_coral, fontface = "bold", size = 4) +
  scale_color_manual(values = c(slate_gray, warm_gold),
                     labels = c("Control", "Treated"), name = "") +
  labs(title = "Poor Overlap: OLS Extrapolates",
       x = "Covariate X", y = "Outcome Y") +
  theme_health_econ(base_size = 14) +
  theme(legend.position = "bottom")

# Good overlap data
ols_data2 <- tibble(
  x = runif(n, 0.2, 0.8),
  treated = rbinom(n, 1, 0.5),
  y = 2 + 3 * x + 2 * treated + rnorm(n, 0, 0.3)
)

p2 <- ggplot(ols_data2, aes(x = x, y = y, color = factor(treated))) +
  geom_point(alpha = 0.6, size = 2) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1.5) +
  scale_color_manual(values = c(slate_gray, warm_gold),
                     labels = c("Control", "Treated"), name = "") +
  labs(title = "Good Overlap: OLS is Reasonable",
       x = "Covariate X", y = "Outcome Y") +
  theme_health_econ(base_size = 14) +
  theme(legend.position = "bottom")

p1 + p2

Figure 12: OLS extrapolates; matching interpolates

Summary: Choosing Your Method

Table 5

Method	Best For	Key Limitation
Subclassification	Few discrete covariates	Curse of dimensionality
Exact Matching	Categorical variables, large N	Sparse cells
Nearest Neighbor	Continuous covariates	Matching bias
CEM	Mix of variable types	Changes estimand
Propensity Score	Many covariates	Model dependence
OLS	Good overlap, linear effects	Extrapolation risk

Best practice: Use multiple methods as sensitivity checks. If results are robust across approaches, you can be more confident.

From Method to Application

Figure 13

Application: Maternal Bed Rest

Putting the Design into Practice Durrance and Guldi (2015)

Maternal Bed Rest and Infant Health Outcomes

Skim the paper and think about the following:

What is the main objective or research question?
What methods are they using?
What are the primary issues with the method(s) and how do they solve it? Think CIA.
What are the main results?
What are the policy implications?

The Research Question

Durrance & Guldi (2015): Does prescribed bed rest during pregnancy improve infant health?

The Challenge:

Women prescribed bed rest have pregnancy complications
Complications also cause poor infant outcomes
Naive comparison is confounded!

Data: Pregnancy Risk Assessment Monitoring System (PRAMS)

The Naive Result

Figure 15: Naive comparison suggests bed rest is harmful!

Key Findings

After propensity score matching on pregnancy complications and demographics:

Results

Very low birth weight: down 15.4%
Very preterm birth: down 7.7%
Infant death: down (significant)

The lesson: Selection bias made an effective treatment look harmful.

Without proper comparison groups, we would have concluded the opposite of the truth.

Key Takeaways

What We Learned

Selection bias distorts naive comparisons
Unconfoundedness requires observing all confounders
Common support limits where we can make inferences
Multiple methods exist - each with tradeoffs
Balance checking is essential

The Big Picture

Matching and weighting are tools for constructing counterfactuals.

They work when:

We observe the right variables
Groups have overlap
We check our assumptions

“All models are wrong, but some are useful.”

Looking Ahead

Coming Up

Instrumental Variables: When unconfoundedness fails
Regression Discontinuity: Exploiting sharp thresholds
Difference-in-Differences: Before-after with comparison groups

Discussion Questions

Hospital quality study: What confounders would you want to control for when comparing outcomes across hospitals?
Propensity score paradox: Why might controlling for too many variables actually hurt your analysis?
Bed rest study: What unmeasured confounders might still bias the propensity score results?

References & Resources

Key Papers: Rosenbaum & Rubin (1983) “The Central Role of the Propensity Score”; Abadie & Imbens (2011) “Bias-Corrected Matching Estimators”

Textbooks: Angrist & Pischke (2009) Mostly Harmless Econometrics; Cunningham (2021) Causal Inference: The Mixtape

This Deck: Companion R script (unconfoundedness_walkthrough.R) available for hands-on practice

From Application to Appendix

Figure 16

Appendix: Advanced Topics

Doubly Robust Estimation

Combines propensity score weighting with outcome regression:

\[\hat{\tau}_{DR} = \frac{1}{n}\sum_{i=1}^{n}\left[\frac{D_i Y_i}{\hat{e}(X_i)} - \frac{(D_i - \hat{e}(X_i))\hat{\mu}_1(X_i)}{\hat{e}(X_i)}\right] - \frac{1}{n}\sum_{i=1}^{n}\left[\frac{(1-D_i) Y_i}{1-\hat{e}(X_i)} + \frac{(D_i - \hat{e}(X_i))\hat{\mu}_0(X_i)}{1-\hat{e}(X_i)}\right]\]

Why “doubly robust”? Consistent if either the propensity score model or the outcome model is correctly specified.

Practical Advantage

You get two chances to get the model right. Even if one model is misspecified, you may still get consistent estimates.

Sensitivity Analysis

Since unconfoundedness is untestable, we can ask: How much unmeasured confounding would be needed to overturn our results?

Rosenbaum bounds: For matched pairs, test how sensitive results are to a hypothetical unmeasured confounder.

\(\Gamma = 1\): No unmeasured confounding
\(\Gamma = 2\): Unmeasured confounder doubles odds of treatment
If results hold at \(\Gamma = 2\), fairly robust

Interpretation

“Results would be overturned if an unmeasured confounder increased treatment odds by 50% and also affected outcomes.” - This gives readers a sense of how worried to be.

Machine Learning for Propensity Scores

Modern approaches use flexible ML methods:

Advantages:

Capture nonlinear relationships automatically
Less model dependence than logistic regression
Can handle many covariates

Common methods:

Random forests / Gradient boosting
LASSO / Elastic net
Ensemble methods (SuperLearner)

Caution

ML optimizes prediction, not balance. Always check covariate balance after matching, regardless of how propensity scores were estimated.

Variance Estimation

Standard errors for matching estimators are tricky:

Issues:

Matching creates dependence between observations
Propensity scores are estimated, not known
Weights may be extreme

Solutions:

Abadie-Imbens variance estimator: Accounts for matching
Bootstrap: Resample and re-match each iteration
Linearization: Approximate variance analytically

Practical Advice

Use package defaults (e.g., MatchIt + marginaleffects) which implement appropriate variance estimators. For IPW, use robust standard errors.