Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 49.47 3.94 12.554 4.93e-31 41.7 57.22 432
anygirls -2.83 4.59 -0.617 5.38e-01 -11.8 6.19 432
PS 813 - Causal Inference
February 16, 2026
\[ \require{cancel} \]
Recall that under conditional ignorability, \(\{Y_i(1), Y_i(0)\} \cancel{{\perp \! \! \! \perp}} D_i\)
Therefore:
\[\mathbb{E}[Y_i(1) | D_i = 1] \neq \mathbb{E}[Y_i(1)]\]
So the difference in means alone will not identify the ATE – we need to condition on the covariates \(\mathbf{X}_i\)
Iterated expectations
\[\mathbb{E}_X\bigg[\mathbb{E}[Y_i | D_i = 1, \mathbf{X}_i = x]\bigg] - \mathbb{E}_X\bigg[\mathbb{E}[Y_i | D_i = 0, \mathbf{X}_i = x]\bigg]\]
Consistency:
\[\mathbb{E}_X\bigg[\mathbb{E}[Y_i(1) | D_i = 1, \mathbf{X}_i = x]\bigg] - \mathbb{E}_X\bigg[\mathbb{E}[Y_i(0) | D_i = 0, \mathbf{X}_i = x]\bigg]\]
Conditional ignorability:
\[\mathbb{E}_X\bigg[\mathbb{E}[Y_i(1) | \mathbf{X}_i = x]\bigg] - \mathbb{E}_X\bigg[\mathbb{E}[Y_i(0) | \mathbf{X}_i = x]\bigg]\]
Law of iterated expectations
\[\mathbb{E}[Y_i(1)] - \mathbb{E}[Y_i(0)] = \tau\]
With infinite data, it would be possible to simply plug in sample analogues for \(\mathbb{E}[Y_i | D_i = 1, X_i = x]\) for each unique value of \(x\) (as long as positivity holds).
But as the dimensionality of \(\mathbf{X}_i\) grows large, within our sample this might be impossible (few to no observations for any given \(\mathbf{X}_i\) )
If our \(\mathbf{X}_i\) are sufficiently low-dimensional, we don’t really need any strong modeling assumptions to estimate the ATE.
\[\hat{\tau(x)} = \widehat{\mathbb{E}}[Y_i | D_i = 1, X_i = x] - \widehat{\mathbb{E}}[Y_i | D_i = 0, X_i = x]\]
\[\hat{\tau} = \sum_{x \in \mathcal{X}} \hat{\tau(x)} \widehat{P}(\mathbf{X}_i = x)\]
What happens if \(\mathbf{X}_i\) is high-dimensional? Coarsen into bins:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 49.47 3.94 12.554 4.93e-31 41.7 57.22 432
anygirls -2.83 4.59 -0.617 5.38e-01 -11.8 6.19 432
0 1 2 3 4 5 6 7 8 9 10
0 60 15 28 12 3 0 1 0 0 0 0
1 0 25 79 33 14 6 0 0 0 0 0
2 0 0 31 37 24 7 1 1 0 0 0
3 0 0 0 12 13 12 1 1 1 0 0
4 0 0 0 0 3 5 2 1 1 0 1
5 0 0 0 0 0 1 0 1 0 1 0
7 0 0 0 0 0 0 0 0 1 0 0
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 60.15 3.51 17.13 1.39e-50 53.25 67.05 432
totchi -5.11 1.08 -4.73 3.09e-06 -7.23 -2.98 432
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 2.232 0.103 21.71 3.53e-71 2.030 2.434 432
repub 0.499 0.156 3.21 1.42e-03 0.194 0.805 432
term estimate std.error p.value
1 anygirls 5.58 6.62 0.4
# A tibble: 3 × 6
totchi term estimate std.error df n
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 anygirls -15.8 13.5 38 40
2 2 anygirls 14.8 9.18 136 138
3 3 anygirls 19.1 11.8 92 94
wash_subset <- wash_subset %>% mutate(totchi1 = as.numeric(totchi==1),
totchi2 = as.numeric(totchi==2),
totchi3 = as.numeric(totchi==3))
# Adjusted
washington_strat <- lm_robust(aauw ~ anygirls*I(totchi2 - mean(totchi2)) +
anygirls*I(totchi3 - mean(totchi3)), wash_subset)
tidy(washington_strat) %>% filter(term == "anygirls") %>%
select(term, estimate, std.error, p.value) term estimate std.error p.value
1 anygirls 11.8 6.5 0.0712
\[\underbrace{\mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i | D_i = 0]}_{\text{Difference-in-means}} = \underbrace{\mathbb{E}[Y_i(1) - Y_i(0) | D_i = 1]}_{\text{ATT}} + \bigg(\underbrace{\mathbb{E}[Y_i(0) | D_i = 1] - \mathbb{E}[Y_i(0) | D_i = 0]}_{\text{Selection-into-treatment bias}}\bigg)\]
Let’s write the selection-into-treatment bias conditioning on \(U_i\)
\[\begin{align*}\text{Selection Bias} = &\sum_{u \in \mathcal{U}} \mathbb{E}[Y_i(0) | D_i = 1, U_i = u] Pr(U_i = u | D_i = 1) \\&- \sum_{u \in \mathcal{U}} \mathbb{E}[Y_i(0) | D_i = 0, U_i = u] Pr(U_i = u | D_i = 0)\end{align*}\]
Ignorability conditional on \(U_i\)
\[\text{Selection Bias} = \sum_{u \in \mathcal{U}} \mathbb{E}[Y_i(0) | U_i = u] Pr(U_i = u | D_i = 1) - \sum_{u \in \mathcal{U}} \mathbb{E}[Y_i(0) | U_i = u] Pr(U_i = u | D_i = 0)\]
Combining terms
\[\text{Selection Bias} = \sum_{u \in \mathcal{U}} \mathbb{E}[Y_i(0) | U_i = u] \times \bigg(Pr(U_i = u | D_i = 1) - Pr(U_i = u | D_i = 0)\bigg)\]
Two elements to selection bias. First, if treatment assignment is independent of the confounder, then the bias is 0
\[\begin{align*}\text{Selection Bias} &= \sum_{u \in \mathcal{U}} \mathbb{E}[Y_i(0) | U_i = u] \times \bigg(Pr(U_i = u) - Pr(U_i = u)\bigg)\\ &= \sum_{u \in \mathcal{U}} \mathbb{E}[Y_i(0) | U_i = u] \times 0 = 0 \end{align*}\]
Second, if \(Y_i(0)\) is independent of \(U_i\), we have:
\[\begin{align*}\text{Selection Bias} &= \sum_{u \in \mathcal{U}} \mathbb{E}[Y_i(0)] \times \bigg(Pr(U_i = u | D_i = 1) - Pr(U_i = u | D_i = 0)\bigg)\\ &= \mathbb{E}[Y_i(0)] \times \bigg(\sum_{u \in \mathcal{U}}Pr(U_i = u | D_i = 1) - \sum_{u \in \mathcal{U}} Pr(U_i = u | D_i = 0)\bigg)\\ &= \mathbb{E}[Y_i(0)] \times \bigg(1 - 1\bigg) = 0\end{align*}\]
A parent node is a direct cause of a child node.
An ancestor node is a direct or indirect cause of a descendent node.
Path: A sequence of nodes connected by edges (either direction!)
\[\begin{align*} Y &= f_Y(D, X, \epsilon_Y)\\ D &= f_D(X, \epsilon_D)\end{align*}\]
\[P(X_1, X_2, \dotsc, X_K) = \prod_{k=1}^K P(X_K | \text{parents}(X_K))\]
So, what is it about epidemiologists that drives them to seek the light of new tools, while economists (at least those in Imbens’s camp) seek comfort in partial blindness, while missing out on the causal revolution? Can economists do in their heads what epidemiologists observe in their graphs? Can they, for instance, identify the testable implications of their own assumptions? Can they decide whether the IV assumptions (i.e., exogeneity and exclusion) are satisfied in their own models of reality? Of course they can’t; such decisions are intractable to the graph-less mind. (I have challenged them repeatedly to these tasks, to the sound of a pin-drop silence)
PS 813 - University of Wisconsin - Madison