The Detective’s Toolkit

Finding Hidden Causes with Instrumental Variables

Jonathan Seward

Spring 2026

The Problem: Correlation Lies

Why Naive Associations Mislead

A Medical Mystery

The Puzzle

Hospitals that spend more money have worse patient outcomes.

Should we cut hospital budgets to improve health?

Obviously not. But this is exactly what the raw data suggests.

What’s going on here?

“The most important things are often invisible to simple statistics.”

Why Correlations Deceive Us

Consider this scenario from hospital data:

Show R Code
ggplot(hospital_data, aes(x = spending, y = outcome_score)) +
  geom_point(alpha = 0.5, color = slate_gray, size = 2.5) +
  geom_smooth(method = "lm", se = TRUE, color = accent_coral, linewidth = 1.5) +
  labs(
    title = "Hospital Spending vs. Patient Outcomes",
    subtitle = "More spending appears to worsen outcomes!",
    x = "Hospital Spending ($)",
    y = "Patient Health Score (higher = better)"
  ) +
  scale_x_continuous(labels = dollar_format()) +
  theme_health_econ(base_size = 18)

Figure 1: The naive correlation suggests spending hurts patients

The Hidden Variable: Severity

What We See

  • Spending goes up
  • Outcomes go down
  • Correlation: negative

What We Miss

  • Sicker patients need more care
  • Sicker patients have worse outcomes
  • Severity drives both variables
Figure 2: The hidden confounder creates a spurious correlation

Revealing the Truth

When we account for patient severity, the picture changes completely:

Show R Code
hospital_data %>%
  mutate(severity_group = cut(severity,
                              breaks = quantile(severity, c(0, 0.33, 0.67, 1)),
                              labels = c("Low Severity", "Medium Severity", "High Severity"),
                              include.lowest = TRUE)) %>%
  ggplot(aes(x = spending, y = outcome_score, color = severity_group)) +
  geom_point(alpha = 0.6, size = 2.5) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1.5) +
  scale_color_manual(values = c(secondary_teal, warm_gold, accent_coral),
                     name = "Patient Severity") +
  scale_x_continuous(labels = dollar_format()) +
  labs(
    title = "The Real Story: Spending Helps (Within Severity Groups)",
    subtitle = "Once we compare similar patients, the true effect emerges",
    x = "Hospital Spending ($)",
    y = "Patient Health Score"
  ) +
  theme_health_econ(base_size = 18) +
  theme(legend.position = "right")

Figure 3: Within severity groups, spending actually HELPS

The Core Problem

Endogeneity

When the thing we’re studying (spending) is connected to hidden factors (severity) that also affect our outcome.

Three Sources of Endogeneity:

  1. Omitted Variables - We can’t measure everything that matters
  2. Reverse Causality - Maybe outcomes affect spending?
  3. Measurement Error - Our data is imperfect

The Big Question: How do we find causal effects when we can’t see everything?

From Correlation to Instruments

Figure 4

Enter: Instrumental Variables

A Strategy for Hidden Confounding

The Detective’s Insight

The Key Idea

Find something that affects treatment but has no other connection to outcomes.

Think of it like this:

  • You want to know if coffee improves productivity
  • But motivated people drink more coffee AND are more productive
  • What if coffee prices varied randomly across cities?
  • Price changes who drinks coffee, but doesn’t directly affect productivity

Price is an “instrument” - it creates as-if-random variation in coffee drinking.

What Makes a Good Instrument?

An instrument \(Z\) must satisfy two conditions:

Figure 5: The instrument affects outcomes ONLY through the treatment

Requirement #1: Relevance (First Stage)

The instrument must actually affect treatment:

\[\text{Cov}(Z, D) \neq 0\]

Quick Stats Refresher

Covariance measures how two variables move together. If \(Z\) and \(D\) have positive covariance, when \(Z\) goes up, \(D\) tends to go up too.

Good news: This is testable! We can check with data whether the instrument actually moves the treatment.

Requirement #2: Exclusion Restriction

The instrument affects outcomes only through treatment: \(Z \rightarrow Y\) only via \(D\)

There must be no direct path from \(Z\) to \(Y\) that bypasses treatment.

The Hard Part

This is not testable. We must argue it convincingly with theory and institutional knowledge.

Graduate Students: Formal Statement

Exclusion states \(Y_i(d, z) = Y_i(d, z') \; \forall z, z'\). Potential outcomes depend only on treatment, not instrument value.

A Classic Example: Family Size and Work

Question: Does having more children cause mothers to work less?

Problem: Women who prefer staying home have more kids AND work less

Solution (Angrist & Evans, 1998):

  • Parents often want at least one child of each sex
  • First two same-sex → more likely to have third
  • Sex of children is random!
Figure 6: Same-sex children predict larger families

Why This Works: The Grandma Test

The Grandma Principle

A good instrument seems puzzling to someone who doesn’t know the research question.

Imagine telling your grandmother:

“The sex of a woman’s first two children predicts whether she works.”

Her reaction: “That makes no sense! Why would that matter?”

Exactly. It only makes sense through family size:

Same-sex kids → More kids → Less work

If the correlation made sense on its own, the instrument might not be valid.

From Idea to Estimation

Figure 7

The Detective’s Method: Calculating the Clues

From Intuition to the Wald Estimator

From Intuition to Calculation

Now that we understand what makes a good instrument, let’s see how to use it.

The detective has gathered the clues — now it’s time to solve the case.

The Wald Estimator: Ratios Tell the Story

The simplest IV estimator is just a ratio:

\[\hat{\delta}_{IV} = \frac{\text{Change in Outcome when } Z \text{ changes}}{\text{Change in Treatment when } Z \text{ changes}}\]

Or in statistics notation:

\[\hat{\delta}_{IV} = \frac{\text{Cov}(Y, Z)}{\text{Cov}(D, Z)} = \frac{\text{Reduced Form}}{\text{First Stage}}\]

  • Reduced Form: Effect of instrument on outcome (numerator)
  • First Stage: Effect of instrument on treatment (denominator)

Graduate Students: Deriving the Wald Estimator

Starting with \(Y = \alpha + \delta D + \gamma A + \nu\): \(\text{Cov}(Y, Z) = \delta \cdot \text{Cov}(D, Z) + \gamma \cdot \text{Cov}(A, Z) + \text{Cov}(\nu, Z)\). If exclusion holds, the last two terms are zero, giving \(\hat{\delta}_{Wald} = \text{Cov}(Y, Z) / \text{Cov}(D, Z)\).

Visual Intuition for Wald

Show R Code
# Generate IV data
set.seed(123)
n <- 400
iv_demo <- tibble(
  instrument = rbinom(n, 1, 0.5),
  treatment = 2 + 3 * instrument + rnorm(n, 0, 1.5),
  outcome = 10 + 1.5 * treatment + rnorm(n, 0, 2)
)

# Calculate group means
group_means <- iv_demo %>%
  group_by(instrument) %>%
  summarize(mean_treat = mean(treatment), mean_out = mean(outcome), .groups = "drop")

# Calculate Wald
delta_treat <- diff(group_means$mean_treat)
delta_out <- diff(group_means$mean_out)
wald_est <- delta_out / delta_treat

ggplot(iv_demo, aes(x = treatment, y = outcome)) +
  geom_point(aes(color = factor(instrument)), alpha = 0.4, size = 2) +
  geom_point(data = group_means, aes(x = mean_treat, y = mean_out,
                                      fill = factor(instrument)),
             size = 10, shape = 21, color = "white", stroke = 2) +
  geom_segment(data = group_means,
               aes(x = mean_treat[1], y = mean_out[1],
                   xend = mean_treat[2], yend = mean_out[2]),
               linewidth = 2, color = primary_blue,
               arrow = arrow(length = unit(0.3, "cm"), type = "closed")) +
  annotate("label", x = mean(group_means$mean_treat) + 0.3,
           y = mean(group_means$mean_out) + 1.5,
           label = paste0("Wald Estimate = ", round(wald_est, 2)),
           size = 6, fill = warm_gold, color = "white", fontface = "bold") +
  scale_color_manual(values = c(slate_gray, secondary_teal),
                     labels = c("Z = 0", "Z = 1"), name = "Instrument") +
  scale_fill_manual(values = c(slate_gray, secondary_teal), guide = "none") +
  labs(title = "The Wald Estimator: Connecting the Dots",
       subtitle = "Large circles are group means; the slope is our causal estimate",
       x = "Treatment Level", y = "Outcome") +
  theme_health_econ(base_size = 18) +
  theme(legend.position = "right")

Figure 8: The Wald estimator is the slope connecting group means

Two-Stage Least Squares (2SLS)

Stage 1: Predict Treatment

\[\hat{D} = \hat{\pi}_0 + \hat{\pi}_1 Z\]

Creates “cleaned” treatment reflecting only instrument-driven variation.

Stage 2: Use Predictions

\[Y = \hat{\alpha} + \hat{\delta} \cdot \hat{D}\]

The coefficient \(\hat{\delta}\) is our causal estimate!

Why This Works

\(\hat{D}\) only varies because of \(Z\), removing confounder variation.

Graduate Students: 2SLS Standard Errors

Don’t run two regressions manually — \(\hat{D}\) is estimated, not observed! Use ivreg() in R or ivregress in Stata for proper SEs.

From Estimation to Diagnostics

Figure 9

When the Detective’s Tools Fail

Weak Instruments and Exclusion Violations

Danger #1: Weak Instruments

Warning

If your first-stage F-statistic is below 10, your instrument is weak.

Show R Code
# Simulate IV with varying instrument strength
simulate_iv <- function(n = 150, strength = 1.0) {
  tibble(
    u = rnorm(n),
    z = rnorm(n),
    # Treatment depends on confounder u and instrument z
    d = 0.8 * u + strength * z + rnorm(n, 0, 0.3),
    # True effect of d is 1.5; confounder u biases OLS upward
    y = 2 + 1.5 * d + 1.2 * u + rnorm(n, 0, 0.5)
  )
}

set.seed(456)
n_sims <- 300

# Simulate across many strength values to get continuous F distribution
iv_results <- map_dfr(seq(0.02, 1.5, length.out = 15), function(str) {
  map_dfr(1:20, function(i) {
    dat <- simulate_iv(n = 150, strength = str)
    fs <- lm(d ~ z, data = dat)
    iv <- ivreg(y ~ d | z, data = dat)
    ols <- lm(y ~ d, data = dat)
    tibble(
      strength = str,
      iv_est = coef(iv)["d"],
      ols_est = coef(ols)["d"],
      f_stat = summary(fs)$fstatistic[1]
    )
  })
})

# Calculate OLS bias for reference
ols_biased <- mean(iv_results$ols_est)

ggplot(iv_results, aes(x = f_stat, y = iv_est)) +
  # Shade the "danger zone" where F < 10
  annotate("rect", xmin = 0, xmax = 10, ymin = -2, ymax = 6,
           fill = accent_coral, alpha = 0.15) +
  annotate("text", x = 5, y = 5.2, label = "Danger Zone\n(F < 10)",
           color = accent_coral, fontface = "bold", size = 4.5) +
  # Add points

  geom_point(alpha = 0.5, size = 2.5, color = slate_gray) +
  # True effect line
  geom_hline(yintercept = 1.5, linetype = "solid", color = secondary_teal, linewidth = 1.5) +
  annotate("text", x = 80, y = 1.5 + 0.3, label = "True Effect = 1.5",
           color = secondary_teal, fontface = "bold", size = 4.5, hjust = 1) +
  # OLS (biased) line
  geom_hline(yintercept = ols_biased, linetype = "dashed", color = accent_coral, linewidth = 1.2) +
  annotate("text", x = 80, y = ols_biased + 0.3, label = paste0("OLS Estimate = ", round(ols_biased, 2), " (biased)"),
           color = accent_coral, fontface = "bold", size = 4.5, hjust = 1) +
  # F = 10 threshold
  geom_vline(xintercept = 10, linetype = "dashed", color = slate_gray, linewidth = 1) +
  # Smooth trend to show bias pattern
  geom_smooth(method = "loess", se = FALSE, color = primary_blue, linewidth = 1.5, span = 0.5) +
  scale_x_continuous(limits = c(0, 90), breaks = c(0, 10, 20, 40, 60, 80)) +
  scale_y_continuous(limits = c(-2, 6)) +
  labs(
    title = "Weak Instruments: Bias Toward OLS + High Variance",
    subtitle = "Each dot is one simulation. Blue curve shows the average IV estimate at each F level.",
    x = "First-Stage F-Statistic",
    y = "IV Estimate"
  ) +
  theme_health_econ(base_size = 18)

Figure 10: As F-statistic drops, IV estimates become biased toward OLS and more variable

Danger #2: Exclusion Violations

The exclusion restriction is untestable. You must argue it convincingly.

Good Arguments: Mechanism knowledge, literally random, no alternative pathways

Red Flags: “Seems unrelated”, complex chains, multiple pathways

Table 1
Instrument Potential Violation
Lottery (draft) Affects education choices
Birth quarter Correlates with family SES
Distance to college Correlates with labor markets
Weather Affects mood directly

The Five Assumptions

Table 2
# Assumption Plain English Testable
1 SUTVA No interference between units Sometimes
2 Independence Instrument is as-good-as random Partially
3 Exclusion Instrument affects Y only through D No
4 Relevance Instrument actually changes D Yes!
5 Monotonicity Instrument moves everyone same direction Rarely

Key insight: Only relevance is fully testable. The others require theory.

Graduate Students: Formal Statements

  1. SUTVA: \(Y_i = Y_i(D_i)\) | 2. Independence: \((Y^0, Y^1, D^0, D^1) \perp\!\!\!\perp Z\) | 3. Exclusion: \(Y_i(d, z) = Y_i(d, z')\) | 4. Relevance: \(E[D^1 - D^0] \neq 0\) | 5. Monotonicity: \(D^1 \geq D^0 \; \forall i\)

From Failure Modes to LATE

Figure 11

Who Did the Detective Actually Identify?

Interpreting the Complier Population

The LATE Framework

Critical Insight

IV estimates the effect for compliers only — not everyone!

Table 3
Type Definition IV Applies?
Always Takers Take treatment regardless of Z No
Never Takers Never take treatment regardless of Z No
Compliers Take treatment only when Z is 'on' Yes!
Defiers Do the opposite of Z Violates monotonicity

Visualizing LATE

Show R Code
set.seed(789)
late_data <- tibble(
  type = rep(c("Never Takers", "Compliers", "Always Takers"), each = 200),
  effect = c(rnorm(200, 0.8, 0.4), rnorm(200, 1.5, 0.4), rnorm(200, 2.3, 0.4))
)

late_summary <- late_data %>%
  group_by(type) %>%
  summarize(mean_eff = mean(effect), .groups = "drop") %>%
  mutate(type = factor(type, levels = c("Never Takers", "Compliers", "Always Takers")))

pop_ate <- mean(late_data$effect)

ggplot(late_summary, aes(x = type, y = mean_eff, fill = type)) +
  geom_col(width = 0.65, show.legend = FALSE) +
  geom_hline(yintercept = pop_ate, linetype = "dashed", color = accent_coral, linewidth = 1.2) +
  geom_hline(yintercept = 1.5, linetype = "solid", color = primary_blue, linewidth = 1.5) +
  annotate("text", x = 2.9, y = pop_ate + 0.12, label = "Population ATE",
           color = accent_coral, fontface = "bold", size = 5, hjust = 1) +
  annotate("text", x = 2.9, y = 1.5 - 0.12, label = "LATE (IV estimates this)",
           color = primary_blue, fontface = "bold", size = 5, hjust = 1) +
  annotate("segment", x = 1.7, xend = 2.3, y = 0.35, yend = 0.35,
           arrow = arrow(length = unit(0.2, "cm"), ends = "both"),
           color = secondary_teal, linewidth = 1.5) +
  annotate("text", x = 2, y = 0.2, label = "IV focuses here",
           color = secondary_teal, fontface = "bold", size = 5) +
  scale_fill_manual(values = c(slate_gray, secondary_teal, warm_gold)) +
  labs(title = "Local Average Treatment Effect (LATE)",
       subtitle = "IV tells us about compliers — those whose behavior changes with the instrument",
       x = "", y = "Treatment Effect") +
  theme_health_econ(base_size = 18) +
  coord_cartesian(ylim = c(0, 2.9))

Figure 12: IV estimates LATE: the effect for compliers only

LATE: Blessing or Curse?

The Good News

  • LATE is a real causal effect
  • For people whose behavior can be changed
  • Often policy-relevant (marginal responders)

Draft lottery: effect for those who would serve if drafted but wouldn’t otherwise — exactly who a draft affects!

The Caution

  • LATE ≠ ATE
  • May not generalize
  • Different instruments → different LATEs

Birth quarter IV: effect for minimum-age dropouts — is that who we care about?

Graduate Students: The LATE Theorem

Under assumptions 1-5: \(\delta_{IV} = E[Y^1 - Y^0 | D^1 > D^0]\) — the causal effect for compliers. Different instruments identify different complier populations.

From Interpretation to Practice

Figure 13

The Detective’s Report: Practical Guidance

What to Report and How to Defend It

What to Report in IV Studies

Table 4
Analysis What It Shows Why Include
Naive OLS Baseline (biased) Shows bias to fix
First Stage Z → D effect Proves relevance
Reduced Form Z → Y effect Interpretable causal
IV/2SLS D → Y causal effect Main result
F-statistic Instrument strength Weak IV diagnostic

Pro tip: Always show the reduced form — it’s the causal effect of \(Z\), easier to defend than IV.

Summary: The IV Checklist

Before trusting an IV analysis, ask:

Checklist

  1. Is there a strong first stage? (F > 10)
  2. Is the exclusion restriction plausible? (No alternative pathways)
  3. Is independence credible? (Instrument as-good-as random)
  4. Who are the compliers? (Is LATE what we want?)
  5. Are the reduced form results sensible? (Right direction?)

Key Takeaways

What We Learned

  1. Correlation lies due to confounders
  2. Instruments create as-if-random variation
  3. IV = Reduced Form / First Stage
  4. Only relevance is testable
  5. IV estimates LATE, not ATE

The Big Picture

IV = detective using indirect evidence:

  • Can’t see the confounder
  • Found something that moves D randomly
  • Follow that thread to causation

“A good instrument is a gift from nature.”

Looking Ahead

Coming Up

  • Regression Discontinuity: When rules create quasi-experiments
  • Difference-in-Differences: Before-after comparisons done right
  • Applications: How these tools solve real health economics puzzles

Discussion Questions

  1. Hospital spending revisited: What might be a valid instrument for hospital spending in studying patient outcomes?

  2. Your research: Think of a causal question you’re interested in. What confounders might bias a naive analysis? Can you imagine an instrument?

  3. Critical thinking: A study uses “distance to hospital” as an instrument for receiving treatment. What are potential threats to the exclusion restriction?

References & Resources

Key Papers: Angrist & Krueger (2001) “IV and the Search for Identification” JEP; Angrist & Evans (1998) “Children and Parents’ Labor Supply” AER

Textbooks: Angrist & Pischke (2009) Mostly Harmless Econometrics; Cunningham (2021) Causal Inference: The Mixtape

This Deck: Companion R script (iv_walkthrough.R) available for hands-on practice

From Practice to Appendix

Figure 14

Appendix: Advanced Topics

Leniency Designs

A leniency design uses random assignment to decision-makers as an instrument.

How It Works:

  1. Individuals move through a system (courts, hospitals, schools)
  2. They’re randomly assigned to different decision-makers
  3. Decision-makers differ in their propensity to apply treatment

Examples:

  • Judicial sentencing: Judges randomly assigned; some are stricter (Kling 2006, Mueller-Smith 2015)
  • Foster care: Caseworkers differ in removal decisions (Doyle 2007, 2008)
  • Radiologist diagnosis: Random assignment to radiologists who differ in diagnosis rates (Chan et al. 2022)

Key Challenge

Leniency designs assume differences are due to preferences, not skill. If some judges are simply better at identifying who needs treatment, exclusion may fail.

Sensitivity Analysis

Since exclusion is untestable, we analyze sensitivity to violations:

Sources of Bias in IV:

  1. Direct effect of \(Z\) on \(Y\) for compliers
  2. Direct effect of \(Z\) on \(Y\) for non-compliers, weighted by non-compliance probability

Bias Formula:

\[\text{plim } \hat{\delta}_{IV} = \delta + \frac{\text{Cov}(\eta, Z)}{\text{Cov}(D, Z)}\]

where \(\eta\) is the composite error. Bias depends on:

  • How strong the instrument is (denominator)
  • How correlated the instrument is with unobservables (numerator)

Practical Implication

Strong instruments are more robust to small exclusion violations because the denominator is large.

Monotonicity Violations

What if there are defiers?

Defiers do the opposite of what the instrument suggests: - Low draft number → avoid military (to spite the system) - High draft number → volunteer (perverse response)

Consequences:

\[\text{IV Bias} \propto \frac{\text{Pr(Defier)}}{\text{Pr(Complier)}} \times (ATE_{Defiers} - ATE_{Compliers})\]

If defiers and compliers have similar treatment effects, bias is small even with some defiers.

Testing Monotonicity:

  • Cannot directly test (we don’t observe compliance types)
  • Can check for plausibility: Does the instrument logically push everyone the same direction?
  • Relaxations exist: “Average monotonicity” (Frandsen et al. 2019)

Weak Instrument Inference

When \(F < 10\), standard IV inference fails:

Problems:

  • 2SLS is biased toward OLS
  • Standard errors are too small
  • t-tests have wrong size (reject true nulls too often)

Solutions:

  1. Anderson-Rubin (AR) test: Valid inference regardless of instrument strength
  2. Conditional Likelihood Ratio (CLR): More powerful than AR
  3. LIML estimator: Less biased than 2SLS with weak instruments
  4. tF adjustment: Corrects critical values based on first-stage F

Rule of Thumb Update

The F > 10 rule (Stock & Yogo 2005) assumes you want 2SLS bias < 10% of OLS bias. For stricter standards, you need F > 20 or higher.

Two-Sample IV

When to use: First stage and reduced form come from different datasets.

Common scenarios:

  • Survey data (first stage) + admin data (reduced form)
  • Historical instrument differs from outcome period
  • Privacy prevents individual-level linking

The Mechanics:

  1. First stage in Sample A: \(\hat{\pi}_1\)
  2. Reduced form in Sample B: \(\hat{\rho}\)
  3. IV estimate: \(\hat{\delta}_{TSIV} = \hat{\rho} / \hat{\pi}_1\)

Warning

Both samples must be from the same population — different complier populations break identification.

SEs: Use Inoue & Solon (2010) or bootstrap to account for two-stage uncertainty.

2SLS vs. LIML

Table 5
Property 2SLS LIML
Bias Toward OLS ≈ Median-unbiased
Weak IV Poor Better
Many IVs Overfits Robust
Speed Fast Slower

Prefer LIML when:

  • F < 20 (weak instrument)
  • Many instruments
  • Robustness check needed

2SLS is fine when:

  • F > 20 (strong first stage)
  • Single instrument
  • Simplicity matters

Tip

Report both. If they differ, you have a weak instrument problem.

Health Economics Application: Chan et al. (2022)

Research Question: How much diagnostic variation is due to physician skill vs. preferences?

Setting: VA radiologists diagnosing pneumonia from chest X-rays

Design: Patients quasi-randomly assigned to radiologists who differ in diagnosis rates

Key Finding: ~39% of diagnostic variation is due to skill differences

Implications for IV:

  • Traditional leniency designs assume all variation is preferences
  • If skill varies, exclusion restriction may fail
  • Skilled radiologists affect outcomes through better diagnosis accuracy, not just diagnosis rates

Methodological Contribution

The paper develops partial identification bounds when skill and preferences cannot be separated — a more honest approach when exclusion is questionable.