PS 813 - Causal Inference
March 8, 2026
\[ \require{cancel} \]
We can first just change the question - instead of the effect of treatment, we can make our estimand the effect of being assigned to treatment.
Our estimator for the ITT is just the difference in means between the \(Z_i = 1\) and \(Z_i = 0\) arms
\[\hat{\tau}_{\text{ITT}} = \hat{E}[Y_i | Z_i = 1] - \hat{E}[Y_i | Z_i = 0]\]
Identified under randomization of \(Z_i\) even if \(D_i\) is not randomized.
Start by writing down potential outcomes for \(D_i\) along with joint potential outcomes of \(Y_i\) in terms of \(Z_i\) and \(D_i\)
\[D_i(z) = D_i \text{ if } Z_i = z\] \[Y_i(d, z) = Y_i \text{ if } D_i = d, Z_i = z\]
Observed treatment \(D_i\) is a function of treatment assignment (\(Z_i\)) - it’s a post-treatment quantity (and so has potential outcomes).
\(Z_i\) is independent of both sets of potential outcomes (potential outcomes for the treatment and potential outcomes for the outcome).
\[\{D_i(1), D_i(0)\} {\perp \! \! \! \perp} Z_i\]
\[\{Y_i(d, z) \forall d, z\} {\perp \! \! \! \perp} Z_i\]
We can weaken this to conditional ignorability (where \(Z_i\) is randomized conditional on \(X_i\)), which is common in observational settings.
Sufficient to identify the intent-to-treat (ITT) effect
\(Z_i\) only affects \(Y_i\) by way of its effect on \(D_i\).
In other words, if \(D_i\) were set at some level \(d\), the potential outcome for \(Y_i(d, z)\) does not depend on \(z\).
\[Y_i(d, z) = Y_i(d, z^{\prime}) \text{ for any } z \neq z^{\prime}\]
Not a testable assumption! - we have to justify this with substantive knowledge.
“Surprise” factor - If I told you \(Z\) was associated with \(Y\), would you think “that’s odd”?
\(Z_i\) has an effect on \(D_i\)
\[\mathbb{E}[D_i(1) - D_i(0)] \neq 0\]
Seems trivial, but we need this to make the estimator work.
Magnitude matters for estimator performance - a “weak” first-stage \(\leadsto\) biased IV ratios in finite samples.
\(Z_i\)’s effect on \(D_i\) only goes in one direction at the individual level
\[D_i(1) - D_i(0) \ge 0\]
If it goes the other way, we can always flip the direction of the treatment to make this hold
Not a testable assumption
| Stratum | \(D_i(1)\) | \(D_i(0)\) |
|---|---|---|
| “Always-takers” | \(1\) | \(1\) |
| “Never-takers” | \(0\) | \(0\) |
| “Compliers” | \(1\) | \(0\) |
| “Defiers” | \(0\) | \(1\) |
The classic IV estimand with one instrument is a ratio of sample covariances.
\[\tau_{\text{IV}} = \frac{Cov(Y, Z)}{Cov(D, Z)}\]
With a binary instrument, this is sometimes called the “Wald” ratio - a ratio of differences in means
\[\tau_{\text{IV}} = \frac{\mathbb{E}[Y_i | Z_i = 1] - \mathbb{E}[Y_i | Z_i = 0]}{\mathbb{E}[D_i | Z_i = 1] - \mathbb{E}[D_i | Z_i = 0]}\]
What does the Wald estimand correspond to in terms of causal effects?
\[\tau_{\text{IV}} = \frac{\mathbb{E}[Y_i | Z_i = 1] - \mathbb{E}[Y_i | Z_i = 0]}{\mathbb{E}[D_i | Z_i = 1] - \mathbb{E}[D_i | Z_i = 0]}\]
Under our identification assumptions:
Let’s decompose the denominator first - under randomization:
\[\begin{align*} \mathbb{E}[D_i | Z_i = 1] - \mathbb{E}[D_i | Z_i = 0] &= \mathbb{E}[D_i(1) | Z_i = 1] - \mathbb{E}[D_i(0) | Z_i = 0]\\ &= \mathbb{E}[D_i(1)] - \mathbb{E}[D_i(0)]\\ &= \mathbb{E}[D_i(1) - D_i(0)] \end{align*}\]
With binary treatment/binary instrument, we can use law of total expectation to decompose by principal stratum
\[\begin{align*}\mathbb{E}[D_i(1) - D_i(0)] = \underbrace{\mathbb{E}[D_i(1) - D_i(0) | D_i(1) = D_i(0)] \times P(D_i(1) = D_i(0))}_{\text{(always/never-takers)}} + \\ \underbrace{\mathbb{E}[D_i(1) - D_i(0) | D_i(1) > D_i(0)] \times P(D_i(1) > D_i(0))}_{\text{(compliers)}} + \\ \underbrace{\mathbb{E}[D_i(1) - D_i(0) | D_i(1) < D_i(0)] \times P(D_i(1) < D_i(0))}_{\text{(defiers)}}\end{align*}\]
The first term is \(0\)
And by no defiers, the last term is \(0\) since \(P(D_i(1) < D_i(0)) = 0\)
\[\mathbb{E}[D_i(1) - D_i(0)] = Pr(D_i(1) > D_i(0))\]
Next, the numerator (the ITT). Under the exclusion restriction and randomization:
\[\mathbb{E}[Y_i | Z_i = 1] = \mathbb{E}\bigg[Y_i(0) + \bigg(Y_i(1) - Y_i(0)\bigg)D_i(1)\bigg]\] \[\mathbb{E}[Y_i | Z_i = 0] = \mathbb{E}\bigg[Y_i(0) + \bigg(Y_i(1) - Y_i(0)\bigg)D_i(0)\bigg]\]
The difference (with some algebra) is
\[\mathbb{E}[Y_i | Z_i = 1] - \mathbb{E}[Y_i | Z_i = 0] = \mathbb{E}\bigg[\bigg(Y_i(1) - Y_i(0)\bigg) \times \bigg(D_i(1) - D_i(0)\bigg)\bigg]\]
Conditioning on the principal strata again:
\[=\begin{align*}\underbrace{\mathbb{E}\bigg[(Y_i(1) - Y_i(0)) \times (0) | (D_i(1) = D_i(0))\bigg] \times P(D_i(1) = D_i(0))}_{\text{always/never-takers}} + \\ \underbrace{\mathbb{E}\bigg[(Y_i(1) - Y_i(0)) \times (1) | (D_i(1) > D_i(0))\bigg] \times P(D_i(1) > D_i(0))}_{\text{compliers}} + \\ \underbrace{\mathbb{E}\bigg[(Y_i(1) - Y_i(0)) \times (-1) | (D_i(1) < D_i(0))\bigg] \times P(D_i(1) < D_i(0))}_{\text{defiers}}\end{align*}\]
Again, first term is zero because \(D_i(1) - D_i(0) = 0\), third is zero by “no defiers” and we have
\[\mathbb{E}[Y_i | Z_i = 1] - \mathbb{E}[Y_i | Z_i = 0] = \mathbb{E}\bigg[Y_i(1) - Y_i(0) | D_i(1) > D_i(0)\bigg] \times P(D_i(1) > D_i(0))\]
The ITT is the product of a conditional average treatment effect and the proportion of compliers.
The IV estimand, under our identification assumptions, is a Local Average Treatment Effect (LATE):
\[\frac{\mathbb{E}[Y_i | Z_i = 1] - \mathbb{E}[Y_i | Z_i = 0]}{\mathbb{E}[D_i | Z_i = 1] - \mathbb{E}[D_i | Z_i = 0]} = \mathbb{E}[Y_i(1) - Y_i(0) | D_i(1) > D_i(0)]\]
The LATE is a conditional average treatment effect within the subpopulation of compliers
If treatment effects are constant, we can generalize this to the whole sample.
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.203 0.0189 10.73 4.06e-25 0.166 0.240 760
post 0.341 0.0341 9.99 3.82e-22 0.274 0.408 760
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.72627 0.021 34.6303 1.91e-158 0.6851 0.7674 760
post -0.00135 0.033 -0.0409 9.67e-01 -0.0661 0.0634 760
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.703 0.0204 34.45 2.11e-157 0.663119 0.743 760
getpost 0.066 0.0332 1.99 4.70e-02 0.000877 0.131 760
[1] -0.00396
Mellon (2025) “Rain, Rain, Go Away: 194 Potential Exclusion-Restriction Violations for Studies Using Weather as an Instrumental Variable”
We’ve talked about what the IV estimand means…
Remember, the IV ratio estimand (with a single instrument and a single treatment)
\[\tau_{\text{IV}} = \frac{Cov(Y_i, Z_i)}{Cov(D_i, Z_i)}\]
What is a consistent estimator of this? Let’s use the plug-in principle
\[\hat{\tau}_{\text{IV}} = \frac{\widehat{Cov}(Y_i, Z_i)}{\widehat{Cov}(D_i, Z_i)}\]
Numerator: “Reduced form”/ITT
Denominator: “First stage”
If the sample covariances are consistent for the population covariances (e.g. under i.i.d. sampling), then by continuous mapping theorem
\[\hat{\tau}_{\text{IV}} \overset{p}{\to} \tau_{\text{IV}}\]
And, by our usual assumptions from regression, we also have asymptotic normality
\[\sqrt{n}(\hat{\tau}_{\text{IV}} - \tau_{\text{IV}}) \overset{d}{\to} \mathcal{N}(0, V)\]
What’s the variance \(V\)?
Let \(\mathbf{Z}\) be our matrix of instruments and covariates
Let \(\mathbf{X}\) be our matrix of treatments and covariates
For identification, we need as many instruments as we have treatments
We can write the just-identified (no. instruments = no. treatments) IV estimator as:
\[\hat{\tau}_{\text{IV}} = (\mathbf{Z}^{\prime}\mathbf{X})^{-1}(\mathbf{Z}^{\prime} Y)\]
We can recover the asymptotic variance using the delta method!
\[Var(\hat{\tau}_{\text{IV}}) = (\mathbf{Z}^{\prime}\mathbf{X})^{-1}(\mathbf{Z}^{\prime} \Sigma \mathbf{Z})(\mathbf{Z}^{\prime}\mathbf{X})^{-1}\]
\(\Sigma\) is the variance-covariance matrix of the errors
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.72707 0.0367 19.800 1.11e-70 0.655 0.799 760
getpost -0.00396 0.0967 -0.041 9.67e-01 -0.194 0.186 760
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.72707 0.0367 19.800 1.11e-70 0.655 0.799 760
getpost -0.00396 0.0967 -0.041 9.67e-01 -0.194 0.186 760
Another way of interpreting \(\hat{\tau}_{\text{IV}}\) is in terms two ordinary least squares regressions
Intuition - We want to regress \(Y\) on the part of the treatment that is explained entirely by the instrument (the “exogenous” variation)
First stage - Project \(\mathbf{X}\) into the space of the instrument \(\mathbf{Z}\)
\[\widehat{\mathbf{X}} = \underbrace{\mathbf{Z}(\mathbf{Z}^\prime\mathbf{Z})^{-1}\mathbf{Z}^\prime}_{P_{\mathbf{Z}}}\mathbf{X}\]
Second stage - Regress \(Y\) on the “fitted values” \(\widehat{\mathbf{X}}\)
\[\hat{\beta}_{\text{2SLS}} = (\widehat{\mathbf{X}}^{\prime}\mathbf{X})^{-1}(\widehat{\mathbf{X}}^{\prime}Y)\]
Putting it all together, gives the 2SLS estimator!
\[\hat{\tau}_{\text{2SLS}} = (\mathbf{X}^{\prime}\mathbf{Z}(\mathbf{Z}^{\prime}\mathbf{Z})^{-1}\mathbf{Z}^\prime\mathbf{X})^{-1}(\mathbf{X}^{\prime}\mathbf{Z}(\mathbf{Z}^{\prime}\mathbf{Z})^{-1}\mathbf{Z}^\prime Y)\]
Call:
lm_robust(formula = getpost ~ post + Bfemale + reportedage, data = wapost)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.12097 0.06406 1.889 5.94e-02 -0.004786 0.24673 729
post 0.35233 0.03482 10.118 1.31e-22 0.283965 0.42069 729
Bfemale -0.00435 0.03505 -0.124 9.01e-01 -0.073156 0.06445 729
reportedage 0.00170 0.00125 1.360 1.74e-01 -0.000756 0.00416 729
Multiple R-squared: 0.134 , Adjusted R-squared: 0.13
F-statistic: 35.4 on 3 and 729 DF, p-value: <2e-16
iv_robust in estimatr (does 2SLS with robust SEs) and ivmodel in ivmodel (does robust 2SLS and weak-instrument robust tests + other diagnostics)
Call:
iv_robust(formula = voted ~ getpost + Bfemale + reportedage |
post + Bfemale + reportedage, data = wapost)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.22562 0.07334 3.0766 2.17e-03 0.08165 0.3696 729
getpost 0.00428 0.09086 0.0471 9.62e-01 -0.17410 0.1827 729
Bfemale -0.03495 0.03354 -1.0423 2.98e-01 -0.10079 0.0309 729
reportedage 0.01040 0.00127 8.1879 1.19e-15 0.00791 0.0129 729
Multiple R-squared: 0.093 , Adjusted R-squared: 0.0893
F-statistic: 23.3 on 3 and 729 DF, p-value: 2.15e-14
Call:
ivmodel(Y = Y, D = D, Z = Z, X = X, intercept = intercept, beta0 = beta0,
alpha = alpha, k = k, manyweakSE = manyweakSE, heteroSE = heteroSE,
clusterID = clusterID, deltarange = deltarange, na.action = na.action)
sample size: 733
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
First Stage Regression Result:
F=111, df1=1, df2=729, p-value is <2e-16
R-squared=0.132, Adjusted R-squared=0.131
Residual standard error: 0.444 on 730 degrees of freedom
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Coefficients of k-Class Estimators:
k Estimate Std. Error t value Pr(>|t|)
OLS 0.00000 0.04708 0.03247 1.45 0.15
Fuller 0.99863 0.00472 0.08979 0.05 0.96
TSLS 1.00000 0.00428 0.09060 0.05 0.96
LIML 1.00000 0.00428 0.09060 0.05 0.96
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Alternative tests for the treatment effect under H_0: beta=0.
Anderson-Rubin test (under F distribution):
F=0.00223, df1=1, df2=729, p-value is 1
95 percent confidence interval:
[-0.178674420106811, 0.183689569448123]
Conditional Likelihood Ratio test (under Normal approximation):
Test Stat=0.00223, p-value is 1
95 percent confidence interval:
[-0.17867442296152, 0.183689572193089]
Our ratio estimator is consistent
\[\hat{\tau}_{\text{IV}} = \frac{\widehat{Cov(Y_i, Z_i)}}{\widehat{Cov(D_i, Z_i)}} \overset{p}{\to} \tau + \frac{Cov(U_i, Z_i)}{Cov(D_i, Z_i)}\]
Under exogeneity \(Cov(Z_i, U_i)\) is zero.
However, when there are small violations of exogeneity, a weak instrument will amplify them.
More generally, with a weak instrument, our t-ratio hypothesis tests assuming asymptotic normality will have incorrect type-1 error rates.
Let’s use a simulation to see how bad the bias can be in IV versus just a simple OLS regression of outcome on treatment under unobserved confounding.
Let \(U_i \sim \mathcal{N}(0, 1)\) be an unobserved confounder. \(Z_i \sim \text{Bern}(.5)\) is an exogenous instrument.
The probability of treatment is modeled via a logit
\[\text{log}\bigg(\frac{P(D_i = 1 | Z_i, U_i)}{1-P(D_i = 1 | Z_i, U_i)}\bigg) = \gamma Z_i + U_i\]
\(\gamma\) here captures the relationship between the exogenous instrument \(Z_i\) and the treatment
The outcome is a function of \(U\) and a mean zero error term \(\epsilon_i\) only, so the true treatment effect is \(0\)
\[Y_i = U_i + \epsilon_i\]
When the assignment process of \(Z_i\) is known, we can construct hypothesis tests using permutation inference assuming a constant treatment effect \(\tau\) (Imbens and Rosenbaum, 2005).
\[T(\tau) = \frac{1}{N} \sum_{i=1}^N \tilde{Z_i} \times (Y_i - \tau D_i)\]
If the instrument is valid, under the null hypothesis that \(\tau = \tau_0\), we can get the randomization distribution of the test statistic by simply re-randomizing treatment according to the known assignment process.
Alternative test statistics based on ranks of \(Y_i - \tau D_i\) (possibly within strata) can also be used.
Even when the assignment process is not known, the IV assumptions allow us to construct a test statistic that does not depend on the first stage.
Let \(\hat{\delta}_{\text{ITT}}\) be the reduced form or intent-to-treat estimate.
Under the IV assumptions, the reduced form is the product of first stage \(\gamma\) and the treatment effect \(\tau\)
\[\delta_{\text{ITT}} = \gamma \times \tau\]
Assuming a particular null \(H_0: \tau = \tau_0\) implies that
\[\delta_{\text{ITT}} - \gamma \times \tau_0 = 0\]
We can construct a test statistic based on the difference between the estimated ITT and the estimated first stage adjusted by the null which we know is normal in large samples.
\[g(\tau_0) = \hat{\delta}_{\text{ITT}} - \hat{\gamma} \tau_0 \sim \mathcal{N}(0, \Omega(\tau_0))\]
The Anderson-Rubin (1949) test statistic is:
\[AR(\tau) = g(\tau)^\prime \Omega(\tau)^{-1}g(\tau)\]
Under the null \(H_0: \tau = \tau_0\), this has a chi-squared distribution which does not directly depend on the value of the first stage \(\gamma\)!
Intuitively: Statistical properties of differences in two normal random variables are well-known and easy. Statistical properties of ratios are much more complicated!
PS 813 - University of Wisconsin - Madison