Week 8: Instrumental Variables

PS 813 - Causal Inference

Anton Strezhnev

University of Wisconsin-Madison

March 8, 2026

\[ \require{cancel} \]

Last three weeks

  • Identification under conditional ignorability
    • Treatment assignment is independent of the potential outcomes given observed confounders \(\mathbf{X}\)
    • “Selection-on-observables”
  • “Selection-on-observables” isn’t a testable assumption
    • Relies on theory to decide which \(\mathbf{X}\) to include.
    • DAGs can help here.
  • Lots of estimation strategies
    • Stratify with low-dimensional \(\mathbf{X}\)
    • IPTW to eliminate treatment-covariate relationship, regression to model the outcome-covariate relationship.
    • Matching to reduce model dependence.
      • Or consider more modern flexible modelling techniques for \(\mathbb{E}[Y_i(d) | X_i]\)

This week

  • Can we estimate a treatment effect when neither ignorability nor conditional ignorability hold for treatment?
    • Can we get rid of unobserved confounding?
  • “Instrumental variables” designs are one way of dealing with this
  • We can identify some average of treatment effects if…
    • There does exist an ignorable or conditionally ignorable instrument which…
    • …has a monotonic effect on the treatment…
    • …and has no effect on the outcome except through its effect on the treatment.
  • What’s the average? The “Local Average Treatment Effect”
    • Average effect among those who are moved to take treatment by the instrument

Instrumental Variables

Treatment non-compliance

  • Often experiments suffer from treatment non-compliance
    • Participants randomized to receive a phone call don’t pick up.
    • Participants randomized to wear surgical masks choose not to.
  • New notation!
    • Let \(Z_i\) denote whether \(i\) is assigned to receive a treatment.
    • Let \(D_i\) denote the treatment actually taken by an individual.
  • Can we just take the simple difference-in-means between \(D_i = 1\) and \(D_i = 0\)?
    • No! Non-compliance affected by other factors which might also affect the outcome.
    • We’re stuck with an observational design.

  • Unless…

Intent-to-treat effect

  • We can first just change the question - instead of the effect of treatment, we can make our estimand the effect of being assigned to treatment.

  • Our estimator for the ITT is just the difference in means between the \(Z_i = 1\) and \(Z_i = 0\) arms

    \[\hat{\tau}_{\text{ITT}} = \hat{E}[Y_i | Z_i = 1] - \hat{E}[Y_i | Z_i = 0]\]

  • Identified under randomization of \(Z_i\) even if \(D_i\) is not randomized.

    • But combines two effects: the actual effect of \(D_i\) and the effect of \(Z_i\) on \(D_i\).

Instrumental variables

  • Suppose though that we’re interested in the actual effect of receiving treatment (the effect of \(D_i\)). What can we do?

Instrumental variables

  • Start by writing down potential outcomes for \(D_i\) along with joint potential outcomes of \(Y_i\) in terms of \(Z_i\) and \(D_i\)

    \[D_i(z) = D_i \text{ if } Z_i = z\] \[Y_i(d, z) = Y_i \text{ if } D_i = d, Z_i = z\]

  • Observed treatment \(D_i\) is a function of treatment assignment (\(Z_i\)) - it’s a post-treatment quantity (and so has potential outcomes).

Assumptions

  1. Randomization of instrument
  2. Exclusion restriction
  3. Non-zero first-stage relationship
  4. Monotonicity

Assumption 1: Randomization

  • \(Z_i\) is independent of both sets of potential outcomes (potential outcomes for the treatment and potential outcomes for the outcome).

    \[\{D_i(1), D_i(0)\} {\perp \! \! \! \perp} Z_i\]

    \[\{Y_i(d, z) \forall d, z\} {\perp \! \! \! \perp} Z_i\]

  • We can weaken this to conditional ignorability (where \(Z_i\) is randomized conditional on \(X_i\)), which is common in observational settings.

  • Sufficient to identify the intent-to-treat (ITT) effect

Assumption 1: Randomization

  • The randomization assumption eliminates any arrows from \(U\) to \(Z\).

Assumption 2: Exclusion restriction

  • \(Z_i\) only affects \(Y_i\) by way of its effect on \(D_i\).

  • In other words, if \(D_i\) were set at some level \(d\), the potential outcome for \(Y_i(d, z)\) does not depend on \(z\).

    \[Y_i(d, z) = Y_i(d, z^{\prime}) \text{ for any } z \neq z^{\prime}\]

  • Not a testable assumption! - we have to justify this with substantive knowledge.

    • Easiest in the treatment non-compliance case
    • But consider what might happen in a non-blinded situation where respondents knew their treatment assignments?
  • “Surprise” factor - If I told you \(Z\) was associated with \(Y\), would you think “that’s odd”?

Assumption 2: Exclusion restriction

  • The exclusion restriction eliminates any causal paths from \(Z\) to \(Y\) except for \(Z \to D \to Y\).

Assumption 3: Non-zero first stage

  • \(Z_i\) has an effect on \(D_i\)

    \[\mathbb{E}[D_i(1) - D_i(0)] \neq 0\]

  • Seems trivial, but we need this to make the estimator work.

  • Magnitude matters for estimator performance - a “weak” first-stage \(\leadsto\) biased IV ratios in finite samples.

    • IV estimators are consistent but not unbiased.

Assumption 3: Non-zero first stage

  • The non-zero first stage assumption requires a path from \(Z\) to \(D\).

Assumption 4: Monotonicity

  • \(Z_i\)’s effect on \(D_i\) only goes in one direction at the individual level

    \[D_i(1) - D_i(0) \ge 0\]

  • If it goes the other way, we can always flip the direction of the treatment to make this hold

    • The key is that the instrument does not have a positive effect on \(D_i\) for some units and a negative effect for others.
  • Not a testable assumption

Assumption 4: Monotonicity

  • In binary instrument/binary treatment world, this is sometimes called a “no defiers” assumption.
Stratum \(D_i(1)\) \(D_i(0)\)
“Always-takers” \(1\) \(1\)
“Never-takers” \(0\) \(0\)
“Compliers” \(1\) \(0\)
“Defiers” \(0\) \(1\)
  • Under no defiers, every unit with \(D_i = 1\) and \(Z_i = 0\) is an always-taker, every unit with \(D_i = 0\) and \(Z_i =1\) is a never-taker.

Assumption 4: Monotonicity

  • Can’t represent the monotonicity assumption in a DAG - it’s an assumption about the form of the relationship between \(Z\) and \(D\).

Interpreting the IV estimand

  • The classic IV estimand with one instrument is a ratio of sample covariances.

    \[\tau_{\text{IV}} = \frac{Cov(Y, Z)}{Cov(D, Z)}\]

  • With a binary instrument, this is sometimes called the “Wald” ratio - a ratio of differences in means

    \[\tau_{\text{IV}} = \frac{\mathbb{E}[Y_i | Z_i = 1] - \mathbb{E}[Y_i | Z_i = 0]}{\mathbb{E}[D_i | Z_i = 1] - \mathbb{E}[D_i | Z_i = 0]}\]

Interpreting the IV estimand

  • What does the Wald estimand correspond to in terms of causal effects?

    \[\tau_{\text{IV}} = \frac{\mathbb{E}[Y_i | Z_i = 1] - \mathbb{E}[Y_i | Z_i = 0]}{\mathbb{E}[D_i | Z_i = 1] - \mathbb{E}[D_i | Z_i = 0]}\]

  • Under our identification assumptions:

    • The numerator is the ITT
    • The denominator is the first-stage effect

Interpreting the IV estimand

  • Let’s decompose the denominator first - under randomization:

    \[\begin{align*} \mathbb{E}[D_i | Z_i = 1] - \mathbb{E}[D_i | Z_i = 0] &= \mathbb{E}[D_i(1) | Z_i = 1] - \mathbb{E}[D_i(0) | Z_i = 0]\\ &= \mathbb{E}[D_i(1)] - \mathbb{E}[D_i(0)]\\ &= \mathbb{E}[D_i(1) - D_i(0)] \end{align*}\]

Interpreting the IV estimand

  • With binary treatment/binary instrument, we can use law of total expectation to decompose by principal stratum

    \[\begin{align*}\mathbb{E}[D_i(1) - D_i(0)] = \underbrace{\mathbb{E}[D_i(1) - D_i(0) | D_i(1) = D_i(0)] \times P(D_i(1) = D_i(0))}_{\text{(always/never-takers)}} + \\ \underbrace{\mathbb{E}[D_i(1) - D_i(0) | D_i(1) > D_i(0)] \times P(D_i(1) > D_i(0))}_{\text{(compliers)}} + \\ \underbrace{\mathbb{E}[D_i(1) - D_i(0) | D_i(1) < D_i(0)] \times P(D_i(1) < D_i(0))}_{\text{(defiers)}}\end{align*}\]

  • The first term is \(0\)

  • And by no defiers, the last term is \(0\) since \(P(D_i(1) < D_i(0)) = 0\)

    \[\mathbb{E}[D_i(1) - D_i(0)] = Pr(D_i(1) > D_i(0))\]

Interpreting the IV estimand

  • Next, the numerator (the ITT). Under the exclusion restriction and randomization:

    \[\mathbb{E}[Y_i | Z_i = 1] = \mathbb{E}\bigg[Y_i(0) + \bigg(Y_i(1) - Y_i(0)\bigg)D_i(1)\bigg]\] \[\mathbb{E}[Y_i | Z_i = 0] = \mathbb{E}\bigg[Y_i(0) + \bigg(Y_i(1) - Y_i(0)\bigg)D_i(0)\bigg]\]

  • The difference (with some algebra) is

    \[\mathbb{E}[Y_i | Z_i = 1] - \mathbb{E}[Y_i | Z_i = 0] = \mathbb{E}\bigg[\bigg(Y_i(1) - Y_i(0)\bigg) \times \bigg(D_i(1) - D_i(0)\bigg)\bigg]\]

Interpreting the IV estimand

  • Conditioning on the principal strata again:

    \[=\begin{align*}\underbrace{\mathbb{E}\bigg[(Y_i(1) - Y_i(0)) \times (0) | (D_i(1) = D_i(0))\bigg] \times P(D_i(1) = D_i(0))}_{\text{always/never-takers}} + \\ \underbrace{\mathbb{E}\bigg[(Y_i(1) - Y_i(0)) \times (1) | (D_i(1) > D_i(0))\bigg] \times P(D_i(1) > D_i(0))}_{\text{compliers}} + \\ \underbrace{\mathbb{E}\bigg[(Y_i(1) - Y_i(0)) \times (-1) | (D_i(1) < D_i(0))\bigg] \times P(D_i(1) < D_i(0))}_{\text{defiers}}\end{align*}\]

  • Again, first term is zero because \(D_i(1) - D_i(0) = 0\), third is zero by “no defiers” and we have

    \[\mathbb{E}[Y_i | Z_i = 1] - \mathbb{E}[Y_i | Z_i = 0] = \mathbb{E}\bigg[Y_i(1) - Y_i(0) | D_i(1) > D_i(0)\bigg] \times P(D_i(1) > D_i(0))\]

  • The ITT is the product of a conditional average treatment effect and the proportion of compliers.

The LATE Theorem

  • The IV estimand, under our identification assumptions, is a Local Average Treatment Effect (LATE):

    \[\frac{\mathbb{E}[Y_i | Z_i = 1] - \mathbb{E}[Y_i | Z_i = 0]}{\mathbb{E}[D_i | Z_i = 1] - \mathbb{E}[D_i | Z_i = 0]} = \mathbb{E}[Y_i(1) - Y_i(0) | D_i(1) > D_i(0)]\]

  • The LATE is a conditional average treatment effect within the subpopulation of compliers

  • If treatment effects are constant, we can generalize this to the whole sample.

    • But if effects are heterogeneous, we are not necessarily getting a “representative” treatment effect.

Better LATE than never?

  • How should we interpret the LATE?
    • It’s not necessarily the quantity we care about - we care about the effect of the treatment in the entire sample.
    • On the other hand, in many policy applications the group responsive to treatment is also the group we want to learn about.
  • External validity question
    • Compliers are those compelled to take treatment by our encouragement. Would estimates generalize to those who are less encourageable?
    • The LATE is design-specific. If we came up with a different instrument, that changes the population on which we’re estimating an effect!
  • What can we do?
    • We can actually get the distribution of any covariate for the compliers - can compare to the rest of the sample!

Example: The effect of media on voting

  • Gerber, Karlan and Bergan (2009, AEJ:AE) estimate the effect of reading the Washington Post (or Washington Times) on political attitudes and voting behavior.
    • \(Z_i\): Random assignment to receive a free subscription to the Washington Post
    • \(D_i\): Actually subscribing to the Washington Post (as measured by a post-encouragement survey)
    • \(Y_i\): 2005 Turnout (measured in the survey)
  • Assumptions:
    • Assignment to get the free subscription offer is ignorable/exogenous
    • Getting the free subscription offer affects actual subscriptions (non-zero first stage)
    • No one would subscribe to the Post if they didn’t receive the offer but not subscribe if they did. (monotonicity/no defiers)
    • Assignment to get the free subscription offer doesn’t affect voting except through actually subscribing to the Post (exclusion restriction)

Example: The effect of media on voting

  • First, subset the data to WaPo or control observations that completed the follow-up survey
green <- read_dta("assets/data/publicdata.dta")
wapost <- green %>% filter(treatment != "TIMES"&!is.na(getpost)&!is.na(voted))

Example: The effect of media on voting

  • Is there a first-stage effect?
lm_robust(getpost ~ post, data=wapost)
            Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper  DF
(Intercept)    0.203     0.0189   10.73 4.06e-25    0.166    0.240 760
post           0.341     0.0341    9.99 3.82e-22    0.274    0.408 760
  • About 34 percent of the sample is a “complier” - quite substantial!

Example: The effect of media on voting

  • Is there an ITT?
lm_robust(voted ~ post, data=wapost)
            Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper  DF
(Intercept)  0.72627      0.021 34.6303 1.91e-158   0.6851   0.7674 760
post        -0.00135      0.033 -0.0409  9.67e-01  -0.0661   0.0634 760
  • ITT is essentially zero.

Example: The effect of media on voting

  • Compare with the naive OLS estimate
lm_robust(voted ~ getpost, data=wapost)
            Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper  DF
(Intercept)    0.703     0.0204   34.45 2.11e-157 0.663119    0.743 760
getpost        0.066     0.0332    1.99  4.70e-02 0.000877    0.131 760
  • Post subscribers are 6pp more likely to vote in the 2005 VA gubernatorial election.
    • But is this causal? No!

Example: The effect of media on voting

  • Let’s estimate the LATE using the Wald estimator
(mean(wapost$voted[wapost$post == 1]) - mean(wapost$voted[wapost$post == 0]))/(mean(wapost$getpost[wapost$post == 1]) - mean(wapost$getpost[wapost$post == 0]))
[1] -0.00396
  • Equivalent to a ratio of regression coefficients
coef(lm_robust(voted ~ post, data=wapost))[2]/coef(lm_robust(getpost ~ post, data=wapost))[2]
    post 
-0.00396 

Example: The effect of media on voting

  • We’ll talk about inference later on, but important to take note: the SE for the LATE can be much larger than the SE for the ITT
iv_robust(voted ~ getpost | post, data=wapost)
            Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper  DF
(Intercept)  0.72707     0.0368 19.7765 1.51e-70    0.655    0.799 760
getpost     -0.00396     0.0968 -0.0409 9.67e-01   -0.194    0.186 760

Review: The IV Ratio

IV in observational studies

  • Most applications of IV are not treatment non-compliance.
  • But all follow the same underlying logic.
    • Treatment of interest is not randomized…but there exists a real or “natural” experiment that is.
    • And this natural experiment affects the outcome only through its effect on the treatment of interest.
  • Examples:
    • Angrist (1990) - Vietnam draft lottery number as an instrument for the effect of military service on income.
    • Angrist and Krueger (1991) - Birth quarter as an instrument for education’s effect on income.
    • Acemoglu et. al. (2001) - European settler mortality as an instrument for the effect of institutional quality on GDP per capita.

IV in observational studies

  • Challenges
    • Exogeneity/ignorability of the instrument isn’t guaranteed - need to make the usual “selection-on-observables” arguments from theory
    • IV double-bind
      • If the instrument has a huge effect on your treatment…
      • …it might also affect a lot of other stuff
      • …all of those are potential exclusion restriction violations!

Discussion: The rainfall instrument

  • Miguel, Satyanth, and Sergenti (2004, JPE) look at the effect of economic growth on civil conflict in 41 African countries.
    • Growth and conflict are confounded (e.g. by political institutions).
    • Instrument for GDP growth using the annual change in rainfall
      • For heavily agrarian countries, rainfall fluctuations determine crop yields which are a large component of GDP.
    • Reduced form - Negative rainfall shocks increase civil conflict.
      • First stage - Negative rainfall shocks reduce GDP growth
  • Should we conclude that growth shocks increase conflict?
    • Exogeneity? Is rainfall as-good-as randomly assigned?
    • Monotonicity? Do positive rainfall shocks strictly boost GDP per capita?
    • Exclusion restriction? Is rainfall’s effect transmitted only through the mechanism the authors define?

Discussion: The rainfall instrument

Mellon (2025) “Rain, Rain, Go Away: 194 Potential Exclusion-Restriction Violations for Studies Using Weather as an Instrumental Variable”

Estimation and inference for IV

The IV ratio estimator

  • We’ve talked about what the IV estimand means…

    • …now let’s talk about estimation
  • Remember, the IV ratio estimand (with a single instrument and a single treatment)

    \[\tau_{\text{IV}} = \frac{Cov(Y_i, Z_i)}{Cov(D_i, Z_i)}\]

  • What is a consistent estimator of this? Let’s use the plug-in principle

    \[\hat{\tau}_{\text{IV}} = \frac{\widehat{Cov}(Y_i, Z_i)}{\widehat{Cov}(D_i, Z_i)}\]

  • Numerator: “Reduced form”/ITT

  • Denominator: “First stage”

The IV ratio estimator

  • If the sample covariances are consistent for the population covariances (e.g. under i.i.d. sampling), then by continuous mapping theorem

    \[\hat{\tau}_{\text{IV}} \overset{p}{\to} \tau_{\text{IV}}\]

  • And, by our usual assumptions from regression, we also have asymptotic normality

    \[\sqrt{n}(\hat{\tau}_{\text{IV}} - \tau_{\text{IV}}) \overset{d}{\to} \mathcal{N}(0, V)\]

  • What’s the variance \(V\)?

    • This is easiest to explain by writing the IV estimator in matrix form.
    • We can use the same tools from our analysis of linear regression!

IV ratio in matrix form

  • Let \(\mathbf{Z}\) be our matrix of instruments and covariates

  • Let \(\mathbf{X}\) be our matrix of treatments and covariates

    • Covariates here refers to stuff that’s not a treatment or an instrument
    • Possibly variables that are needed to satisfy ignorability for the instrument.
  • For identification, we need as many instruments as we have treatments

    • Most typical case in political science: single-instrument, single-treatment case
    • More “structural equation”-oriented discplines also consider multiple-instrument/multiple-treatment cases
    • Sometimes we might have more instruments than treatments (“overidentification”)
  • We can write the just-identified (no. instruments = no. treatments) IV estimator as:

    \[\hat{\tau}_{\text{IV}} = (\mathbf{Z}^{\prime}\mathbf{X})^{-1}(\mathbf{Z}^{\prime} Y)\]

IV ratio in matrix form

  • We can recover the asymptotic variance using the delta method!

    \[Var(\hat{\tau}_{\text{IV}}) = (\mathbf{Z}^{\prime}\mathbf{X})^{-1}(\mathbf{Z}^{\prime} \Sigma \mathbf{Z})(\mathbf{Z}^{\prime}\mathbf{X})^{-1}\]

  • \(\Sigma\) is the variance-covariance matrix of the errors

    • Under i.i.d. sampling, a heteroskedasticity-robust plug-in estimator for \(\hat{\Sigma}\) is a diagonal matrix with the squared residuals.
    • Other “robust” SEs w/ different specifications for \(\Sigma\) (e.g. clustering, spatial correlation, etc…) - we’ll talk about these later!

IV ratio inference

  • To illustrate with our Washington Post example
iv_est <- iv_robust(voted ~ getpost | post, data=wapost, se_type = "HC0")
iv_est
            Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper  DF
(Intercept)  0.72707     0.0367  19.800 1.11e-70    0.655    0.799 760
getpost     -0.00396     0.0967  -0.041 9.67e-01   -0.194    0.186 760
  • Let’s calculate the point estimate by hand!
Z <- model.matrix(~post, data=wapost) # Add the intercept
D <- model.matrix(~getpost, data=wapost) # Add the intercept
Y <- wapost$voted
point <- solve(t(Z)%*%D)%*%(t(Z)%*%Y)
point
                [,1]
(Intercept)  0.72707
getpost     -0.00396

IV ratio inference

  • To illustrate with our Washington Post example
iv_robust(voted ~ getpost | post, data=wapost, se_type = "HC0")
            Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper  DF
(Intercept)  0.72707     0.0367  19.800 1.11e-70    0.655    0.799 760
getpost     -0.00396     0.0967  -0.041 9.67e-01   -0.194    0.186 760
  • Let’s do the variance!
u <- Y - iv_est$fitted.values
vcov <- solve(t(Z)%*%D)%*%(t(Z)%*%diag(u^2)%*%Z)%*%solve(t(Z)%*%D)
sqrt(diag(vcov))
[1] 0.0383 0.0967

IV as “two-stage” least squares

  • Another way of interpreting \(\hat{\tau}_{\text{IV}}\) is in terms two ordinary least squares regressions

    • Accommodates the “overidentified” case (more instruments than treatments)
  • Intuition - We want to regress \(Y\) on the part of the treatment that is explained entirely by the instrument (the “exogenous” variation)

  • First stage - Project \(\mathbf{X}\) into the space of the instrument \(\mathbf{Z}\)

    \[\widehat{\mathbf{X}} = \underbrace{\mathbf{Z}(\mathbf{Z}^\prime\mathbf{Z})^{-1}\mathbf{Z}^\prime}_{P_{\mathbf{Z}}}\mathbf{X}\]

  • Second stage - Regress \(Y\) on the “fitted values” \(\widehat{\mathbf{X}}\)

    \[\hat{\beta}_{\text{2SLS}} = (\widehat{\mathbf{X}}^{\prime}\mathbf{X})^{-1}(\widehat{\mathbf{X}}^{\prime}Y)\]

  • Putting it all together, gives the 2SLS estimator!

    \[\hat{\tau}_{\text{2SLS}} = (\mathbf{X}^{\prime}\mathbf{Z}(\mathbf{Z}^{\prime}\mathbf{Z})^{-1}\mathbf{Z}^\prime\mathbf{X})^{-1}(\mathbf{X}^{\prime}\mathbf{Z}(\mathbf{Z}^{\prime}\mathbf{Z})^{-1}\mathbf{Z}^\prime Y)\]

Illustrating 2SLS

  • Let’s include covariates in our Gerber, Karlan and Bergan (2009) example
    • \(Z_i\): Random assignment to receive a free subscription to the Washington Post
    • \(D_i\): Actually subscribing to the Washington Post (as measured by a post-encouragement survey)
    • \(Y_i\): 2005 Turnout (measured in the survey)
    • \(X_i\): Gender, Age
  • Load and subset out missing covariates
green <- read_dta("assets/data/publicdata.dta")
wapost <- green %>% filter(treatment != "TIMES"&!is.na(getpost)&!is.na(voted)&!is.na(Bfemale)&!is.na(reportedage))

Illustrating 2SLS

  • Our first stage regresses subscription on assignment + covariates
first_stage <- lm_robust(getpost ~ post + Bfemale + reportedage , data= wapost)
summary(first_stage)

Call:
lm_robust(formula = getpost ~ post + Bfemale + reportedage, data = wapost)

Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  CI Lower CI Upper  DF
(Intercept)  0.12097    0.06406   1.889 5.94e-02 -0.004786  0.24673 729
post         0.35233    0.03482  10.118 1.31e-22  0.283965  0.42069 729
Bfemale     -0.00435    0.03505  -0.124 9.01e-01 -0.073156  0.06445 729
reportedage  0.00170    0.00125   1.360 1.74e-01 -0.000756  0.00416 729

Multiple R-squared:  0.134 ,    Adjusted R-squared:  0.13 
F-statistic: 35.4 on 3 and 729 DF,  p-value: <2e-16

Illustrating 2SLS

  • Let’s actually run 2SLS - I like two routines: iv_robust in estimatr (does 2SLS with robust SEs) and ivmodel in ivmodel (does robust 2SLS and weak-instrument robust tests + other diagnostics)
wapo_2sls <- iv_robust(voted ~ getpost  + Bfemale + reportedage | post + Bfemale + reportedage, data= wapost)
summary(wapo_2sls)

Call:
iv_robust(formula = voted ~ getpost + Bfemale + reportedage | 
    post + Bfemale + reportedage, data = wapost)

Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper  DF
(Intercept)  0.22562    0.07334  3.0766 2.17e-03  0.08165   0.3696 729
getpost      0.00428    0.09086  0.0471 9.62e-01 -0.17410   0.1827 729
Bfemale     -0.03495    0.03354 -1.0423 2.98e-01 -0.10079   0.0309 729
reportedage  0.01040    0.00127  8.1879 1.19e-15  0.00791   0.0129 729

Multiple R-squared:  0.093 ,    Adjusted R-squared:  0.0893 
F-statistic: 23.3 on 3 and 729 DF,  p-value: 2.15e-14

Illustrating 2SLS

wapo_2sls2 <- ivmodelFormula(voted ~ getpost  + Bfemale + reportedage | post + Bfemale + reportedage, data= wapost, heteroSE=T)
summary(wapo_2sls2)

Call:
ivmodel(Y = Y, D = D, Z = Z, X = X, intercept = intercept, beta0 = beta0, 
    alpha = alpha, k = k, manyweakSE = manyweakSE, heteroSE = heteroSE, 
    clusterID = clusterID, deltarange = deltarange, na.action = na.action)
sample size: 733
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

First Stage Regression Result:

F=111, df1=1, df2=729, p-value is <2e-16
R-squared=0.132,   Adjusted R-squared=0.131
Residual standard error: 0.444 on 730 degrees of freedom
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

Coefficients of k-Class Estimators:

             k Estimate Std. Error t value Pr(>|t|)
OLS    0.00000  0.04708    0.03247    1.45     0.15
Fuller 0.99863  0.00472    0.08979    0.05     0.96
TSLS   1.00000  0.00428    0.09060    0.05     0.96
LIML   1.00000  0.00428    0.09060    0.05     0.96
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

Alternative tests for the treatment effect under H_0: beta=0.

Anderson-Rubin test (under F distribution):
F=0.00223, df1=1, df2=729, p-value is 1
95 percent confidence interval:
 [-0.178674420106811, 0.183689569448123]

Conditional Likelihood Ratio test (under Normal approximation):
Test Stat=0.00223, p-value is 1
95 percent confidence interval:
 [-0.17867442296152, 0.183689572193089]

The weak instrument problem

  • Our ratio estimator is consistent

    \[\hat{\tau}_{\text{IV}} = \frac{\widehat{Cov(Y_i, Z_i)}}{\widehat{Cov(D_i, Z_i)}} \overset{p}{\to} \tau + \frac{Cov(U_i, Z_i)}{Cov(D_i, Z_i)}\]

  • Under exogeneity \(Cov(Z_i, U_i)\) is zero.

  • However, when there are small violations of exogeneity, a weak instrument will amplify them.

  • More generally, with a weak instrument, our t-ratio hypothesis tests assuming asymptotic normality will have incorrect type-1 error rates.

    • Why? Distributions of ratios are poorly behaved.

The weak instrument problem

  • Let’s use a simulation to see how bad the bias can be in IV versus just a simple OLS regression of outcome on treatment under unobserved confounding.

  • Let \(U_i \sim \mathcal{N}(0, 1)\) be an unobserved confounder. \(Z_i \sim \text{Bern}(.5)\) is an exogenous instrument.

  • The probability of treatment is modeled via a logit

    \[\text{log}\bigg(\frac{P(D_i = 1 | Z_i, U_i)}{1-P(D_i = 1 | Z_i, U_i)}\bigg) = \gamma Z_i + U_i\]

    \(\gamma\) here captures the relationship between the exogenous instrument \(Z_i\) and the treatment

  • The outcome is a function of \(U\) and a mean zero error term \(\epsilon_i\) only, so the true treatment effect is \(0\)

    \[Y_i = U_i + \epsilon_i\]

The weak instrument problem

  • Let’s see how the Wald estimator performs when we have a pretty large effect of \(Z_i\) on \(D_i\): \(\gamma = 3\) and \(N = 1000\)
## First stage effect
mean(firststage)
[1] 0.43
## F-statistic from the first stage
mean(firststageF)
[1] 295
## Bias of the naive OLS Y ~ X
mean(naive)
[1] 0.65
## Bias of IV
mean(IV)
[1] -0.0117

The weak instrument problem

  • Sampling distribution of the naive OLS estimator

The weak instrument problem

  • Sampling distribution of the IV estimator

The weak instrument problem

  • Now, what happens when our instrument is weak: \(\gamma = .2\) and \(N = 1000\)
## First stage effect
mean(firststage)
[1] 0.0399
## F-statistic from the first stage
mean(firststageF)
[1] 2.58
## Bias of the naive OLS Y ~ X
mean(naive)
[1] 0.826
## Bias of IV
mean(IV)
[1] 2.68

The weak instrument problem

  • Sampling distribution of the naive OLS estimator

The weak instrument problem

  • Sampling distribution of the IV estimator

The weak instrument problem

  • When is an instrument too weak?
  • Classic result: Stock and Yogo (2005) use first stage F-statistic thresholds
    • \(\leadsto\) heuristic of first-stage \(F > 10\)
    • Problem! These are benchmarks under homoskedasticity
  • Montiel Olea and Pflueger (2013)
    • Robust F-statistics
    • Comparable threshold to Stock and Yogo bias is \(F > 23.1\)
  • Lee, Moreira, McCrary, Porter (2020)
    • These are thresholds that the bias is “small” (approximately 10%)
    • If we want to use the F-statistic as a screen for \(p < .05\) to be valid, then we actually need \(F > 104.7\)
  • Angrist and Pischke (2009) - with “just-identified” IV bias is usually overwhelmed by the large standard errors.

Permutation test

  • When the assignment process of \(Z_i\) is known, we can construct hypothesis tests using permutation inference assuming a constant treatment effect \(\tau\) (Imbens and Rosenbaum, 2005).

    • With a single, de-meaned instrument \(\tilde{Z_i} = Z_i - \bar{Z}\), we can construct a test statistic based on the sample covariance between \(Z_i\) and \(Y_i\) with the effect removed:

    \[T(\tau) = \frac{1}{N} \sum_{i=1}^N \tilde{Z_i} \times (Y_i - \tau D_i)\]

  • If the instrument is valid, under the null hypothesis that \(\tau = \tau_0\), we can get the randomization distribution of the test statistic by simply re-randomizing treatment according to the known assignment process.

    • Construct confidence intervals by “inverting the test” - what values of \(\tau_0\) does the test fail to reject?
  • Alternative test statistics based on ranks of \(Y_i - \tau D_i\) (possibly within strata) can also be used.

Anderson-Rubin Test

  • Even when the assignment process is not known, the IV assumptions allow us to construct a test statistic that does not depend on the first stage.

    • This is the Anderson-Rubin (1949) approach
    • Andrews, Stock and Sun (2019) provide a good explanation especially for the “just-identified” case
  • Let \(\hat{\delta}_{\text{ITT}}\) be the reduced form or intent-to-treat estimate.

  • Under the IV assumptions, the reduced form is the product of first stage \(\gamma\) and the treatment effect \(\tau\)

    \[\delta_{\text{ITT}} = \gamma \times \tau\]

  • Assuming a particular null \(H_0: \tau = \tau_0\) implies that

    \[\delta_{\text{ITT}} - \gamma \times \tau_0 = 0\]

Anderson-Rubin Test

  • We can construct a test statistic based on the difference between the estimated ITT and the estimated first stage adjusted by the null which we know is normal in large samples.

    \[g(\tau_0) = \hat{\delta}_{\text{ITT}} - \hat{\gamma} \tau_0 \sim \mathcal{N}(0, \Omega(\tau_0))\]

  • The Anderson-Rubin (1949) test statistic is:

    \[AR(\tau) = g(\tau)^\prime \Omega(\tau)^{-1}g(\tau)\]

    Under the null \(H_0: \tau = \tau_0\), this has a chi-squared distribution which does not directly depend on the value of the first stage \(\gamma\)!

  • Intuitively: Statistical properties of differences in two normal random variables are well-known and easy. Statistical properties of ratios are much more complicated!

    • Again, invert the test to get a confidence interval
    • Can get infinite confidence bounds with a weak instrument - the test never rejects for any value of \(\tau_0\)

Example: Strong instrument

wapo_iv <- ivmodelFormula(voted ~ getpost  | post , data= wapost, heteroSE=T)
print(AR.test(wapo_iv))
$Fstat
[1] 0.0171

$df
[1]   1 731

$p.value
[1] 0.896

$ci.info
[1] "[-0.2058034696935, 0.175035583124674]"

$ci
      lower upper
[1,] -0.206 0.175

Example: Weak instrument

weak_iv_data <- data.frame(Y = Y, D= D, Z=Z)
weak_iv <- ivmodelFormula(Y ~ D | Z , data= weak_iv_data, heteroSE=T)
print(AR.test(weak_iv))
$Fstat
[1] 0.215

$df
[1]   1 998

$p.value
[1] 0.643

$ci.info
[1] "Whole Real Line"

$ci
     lower upper
[1,]  -Inf   Inf

Conclusion

  • Instrumental variables lets us leverage alternative sources of randomness to learn about an otherwise confounded causal relationship.
  • An instrument:
    • Affects treatment
    • Doesn’t affect the outcome except through treatment
    • Is ignorable w.r.t the outcome.
  • LATE theorem: The IV estimand is the ATE among those who would take treatment due to the instrument.
    • With continuous treatment/instrument - a weighted average of LATEs (Angrist and Imbens, 1995)
    • With covariates - a weighted average of covariate-specific LATEs
    • But be careful with this interpretation when the model is not fully saturated (Słoczyński, 2022)
  • Statistical inference is tricky
    • Beware weak instruments - typical large-sample asymptotics do poorly when instruments are irrelevant.
    • Consider weak-instrument robust tests (Anderson-Rubin)

Next week

  • Another method of dealing with unobserved confounding - differences-in-differences
  • What if we can observe a period when both treated and control are unexposed
    • Then any differences are attributable to bias
  • Assume the bias is time-invariant
    • Then we can just subtract it out!
  • “Parallel trends” - Observed trends in the control group are equal to counterfactual trends in the treated group!