Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 44.0 0.946 46.47 1.88e-252 42.12 45.8 1004
black 7.7 1.230 6.26 5.71e-10 5.28 10.1 1004
PS 813 - Causal Inference
February 23, 2026
\[ \require{cancel} \]
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 44.0 0.946 46.47 1.88e-252 42.12 45.8 1004
black 7.7 1.230 6.26 5.71e-10 5.28 10.1 1004
Our goal is to estimate the average treatment effect
\[\mathbb{E}[Y_i(1)] - \mathbb{E}[Y_i(0)]\]
Recall that under selection-on-observables
\[\mathbb{E}[Y_i(1)] = \mathbb{E}_{X}\bigg[E[Y_i(1)| X_i]\bigg] = \mathbb{E}_{X}\bigg[\mathbb{E}[Y_i| X_i, D_i = 1]\bigg]\] \[\mathbb{E}[Y_i(0)] = \mathbb{E}_{X}\bigg[\mathbb{E}[Y_i(0)| X_i]\bigg] = \mathbb{E}_{X}\bigg[\mathbb{E}[Y_i| X_i, D_i = 0]\bigg]\]
So what we need to do to estimate the ATE is:
So far, we’ve shown that the OLS estimator is consistent and asymptotically normal for the Best Linear Predictor
To do inference on the CEF with OLS, we assume that it’s equal to the BLP!
Assumption 1: Linearity
\[Y_i = X_i^\prime\beta + \epsilon_i\]
Assumption 2: Strict exogeneity of the errors
\[\mathbb{E}[\epsilon | \mathbf{X}] = 0\]
When is linearity always true?
Consider a model with two binary covariates \(X_{i1}\) and \(X_{i2}\).
Four possible unique values:
With the way we’ve defined these parameters, the CEF can be written as:
\[\mathbb{E}[Y_i | X_{i1}, X_{i2}] = \alpha + \beta X_{i1} + \gamma X_{2i} + \delta X_{1i} X_{2i}\]
It’s linear by construction! Each level of \(\mathbb{E}[Y_i | X_{1i}, X_{2i}]\) is estimated separately by taking the mean.
\[\begin{align*}\hat{\beta} &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}Y)\\ &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}(\mathbf{X}\beta + \epsilon))\\ &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\mathbf{X})\beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon)\\ &= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) \end{align*}\]
\[\begin{align*} \mathbb{E}[\hat{\beta} | \mathbf{X}] &= \mathbb{E}\bigg[\beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) \bigg| \mathbf{X} \bigg]\\ &= \mathbb{E}[\beta | \mathbf{X}] + \mathbb{E}[(\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) | \mathbf{X}]\\ &= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime} \mathbb{E}[\epsilon | \mathbf{X}]\\ &= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}0\\ &= \beta \end{align*}\]
Lastly, by law of total expectation
\[\mathbb{E}[\hat{\beta}] = \mathbb{E}[\mathbb{E}[\hat{\beta}|\mathbf{X}]]\]
Therefore
\[\mathbb{E}[\hat{\beta}] = \mathbb{E}[\beta] = \beta\]
What have we still not assumed?
Assumption 4 - Spherical errors
\[Var(\epsilon | \mathbf{X}) = \begin{bmatrix} \sigma^2 & 0 & \cdots & 0\\ 0 & \sigma^2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma^2 \end{bmatrix} = \sigma^2 \mathbf{I}\]
Benefits
lm())Drawbacks
Under spherical errors, the variance formula simplifies
\[Var(\hat{\beta}) = (\mathbf{X}^\prime\mathbf{X})^{-1} (\mathbf{X}^\prime \sigma^2\mathbf{X}) (\mathbf{X}^\prime\mathbf{X})^{-1}\] \[Var(\hat{\beta}) = \sigma^2 (\mathbf{X}^\prime\mathbf{X})^{-1}\]
An unbiased estimator for \(\sigma^2\) is the sample variance of the residuals (mean squared error) adjusted by a degrees of freedom correction
\[\hat{\sigma^2} = \frac{1}{N-K} \sum_{i=1}^N (Y_i - \hat{Y_i})^2\]
\(K\) is the number of parameters. We refer to \(N-K\) as the “degrees of freedom”
Proof: Suppose there’s another linear estimator that diverges from the OLS projection of \(Y\) by a non-zero \(K \times N\) matrix \(D\)
\[\tilde{\beta} = ((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime + D)Y\]
Let’s take the expectation
\[\mathbb{E}[\tilde{\beta}] = \mathbb{E}[\hat{\beta}_{\text{OLS}}] + \mathbb{E}[DY]\]
We know OLS is unbiased. Next, we plug in the expression for \(Y\) (linearity)
\[\mathbb{E}[\tilde{\beta}] = \beta + \mathbb{E}[D\mathbf{X}\beta] + \mathbb{E}[D\epsilon]\]
Pull out the constants and use \(\mathbb{E}[\epsilon] = 0\)
\[\mathbb{E}[\tilde{\beta}] = \beta(\mathbf{I} + D\mathbf{X})\]
If \(\tilde{\beta}\) is unbiased, then
\[\beta(\mathbf{I} + D\mathbf{X}) = 0\]
This holds for all values of \(\beta\) only if \(D\mathbf{X} = 0\)
Under homoskedasticity, the variance is
\[Var(\tilde{\beta}) = \sigma^2\bigg((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime + D\bigg)\bigg((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime + D\bigg)^{\prime}\]
Using properties of matrix transposes
\[Var(\tilde{\beta}) = \sigma^2\bigg((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime + D\bigg)\bigg(\mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1} + D^\prime\bigg)\]
Expanding
\[Var(\tilde{\beta}) = \sigma^2\bigg((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime\mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1} + (\mathbf{X}^\prime\mathbf{X})^{-1}(D\mathbf{X})^\prime + D\mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1} + DD^\prime\bigg)\]
Using \(D\mathbf{X} = 0\) from above along with the definition of the variance of \(\hat{\beta}_{\text{OLS}}\)
\[Var(\tilde{\beta}) = Var(\hat{\beta}_{\text{OLS}}) + \sigma^2DD^\prime\]
The variance of our alternative linear, unbiased estimator exceeds the variance of OLS by the positive semi-definite matrix \(\sigma^2DD^\prime\) (since \(DD^\prime\) is an outer product)
Therefore OLS is the lowest-variance linear unbiased estimator
Assumption 5 - Normality of the errors
\[\epsilon | \mathbf{X} \sim \mathcal{N}(0, \sigma^2)\]
Not necessary even for Gauss-Markov assumptions
Not needed to do asymptotic inference on \(\hat{\beta}\)
Benefits?
Drawbacks
Let’s estimate the ATE via regression imputation
\[\hat{\tau} = \frac{1}{N}\sum_{i=1}^N \widehat{Y_i(1)} - \widehat{Y_i(0)}\]
Recall that the Lin (2013) estimator is equivalent to this!
term estimate std.error p.value
1 black 4.07 1.59 0.0107
reg_ate_cube <- lm_lin(black_turnout ~ black,
covariates = ~ pop90 + blackpop_pct1990 + I(blackpop_pct1990^2) + I(blackpop_pct1990^3) +
unemp + college_pct + hs_pct + unemp + income + poverty + home +
as.factor(year),
data = turn)
tidy(reg_ate_cube) %>% filter(term == "black") %>% dplyr::select(term, estimate, std.error, p.value) term estimate std.error p.value
1 black -2.71 2.89 0.347
Often you will see researchers estimate a single regression model with the form:
\[\mathbb{E}[Y_i | D_i, X_i] = \tau D_i + X_i^{\prime}\beta\]
Does \(\tau\) identify the ATE? Not necessarily!
Even if the model for the outcome is correct, under effect heterogeneity, the regression coefficient \(\tau\) is not the ATE.
Suppose the potential outcomes can be written as
\[Y_i(d) = Y_i(0) + \tau_i \times d\]
The ATE is \(\mathbb{E}[\tau_i] = \tau\)
Suppose we then fit the typical linear regression model that you see in many research papers
\[Y_i = \alpha + \tau_{\text{R}} D_i + X_i^{\prime}\beta + \epsilon_i\]
Will our OLS estimate of \(\hat{\tau_{\text{R}}}\) be consistent for \(\mathbb{E}[\tau_i]\)?
An important regression theorem (that you will see many times) is that the coefficients in a multiple linear regression can be written by partialling out correlations with the other covariates.
Consider estimating the following by OLS:
\[Y_i = \beta_0 + \beta_1 X_{i1} + \epsilon_{i}\]
The standard least-squares estimator for \(\beta_1\) can be expressed as the ratio of the sample covariance of \(X_{i1}\) and \(Y_i\) and the variance of \(X_{i1}\)
\[\hat{\beta_1} = \frac{\widehat{Cov}(Y_i, X_{i1})}{\widehat{Var}(X_{i1})} \to \frac{Cov(Y_i, X_{i1})}{Var(X_{i1})}\]
Now what happens with a multiple linear regression - how do we write \(\beta_1\) when the model is
\[Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \epsilon_{i}\]
It turns out that \(\beta_1\) can also be written in terms of a bivariate regression involving residuals from auxiliary regressions
\[\hat{\beta_1} = \frac{\widehat{Cov}(Y_i^{{\perp X_{i2}}}, X_{i1}^{{\perp X_{i2}}})}{\widehat{Var}(X_{i1}^{{\perp X_{i2}}})} \to \frac{Cov(Y_i^{{\perp X_{i2}}}, X_{i1}^{{\perp X_{i2}}})}{Var(X_{i1}^{{\perp X_{i2}}})}\]
\(Y_i^{{\perp X_{i2}}}\) is the residual from a regression of \(Y_i\) on all of the other covariates (here, just \(X_{i2}\))
\(X_{i1}^{{\perp X_{i2}}}\) is the residual from a regression of \(X_{i1}\) on all of the other covariates (\(X_{i2}\) here)
Using the Frisch-Waugh-Lovell theorem. we can write an expression for the treatment coefficient as
\[\tau_\text{R} = \frac{Cov(Y_i, D_i^{\perp X_i}}{Var(D_i^{\perp X_i})}\]
\(D_i^{\perp X_i}\) is the residual from a regression of \(D_i\) on all of the other covariates \(X_i\)
In this case, that’s just:
\[\tau_\text{R} = \frac{Cov(Y_i, D_i - \mathbb{E}[D_i|X_i])}{Var(D_i - \mathbb{E}[D_i|X_i])} \]
We can re-arrange terms (see Samii and Aronow, 2016 for the math) to get an expression for the weights placed on individual treatment effects \(\tau_i\)
\[\tau_\text{R} = \frac{E[w_i \tau_i]}{E[w_i]}\]
where \(w_i = (D_i - E[D_i | X_i])^2\)
Intuition: Effects for units whose treatment status is poorly predicted by the covariates get more weight.
The typical multiple regression estimator does not recover the ATE under effect heterogeneity.
Fearon and Laitin (2003) “Ethnicity, Insurgency, and Civil War”
It is best to design a regression to estimate one effect.
Don’t read all the coefficients from a regression as causal!
Intuition: We’ve selected \(X_i\) that affect our treatment \(D\).
For more, see
Westreich, Daniel, and Sander Greenland. “The table 2 fallacy: presenting and interpreting confounder and modifier coefficients.” American journal of epidemiology 177.4 (2013): 292-298.
Hünermund, Paul, and Beyers Louw. “On the Nuisance of Control Variables in Causal Regression Analysis.” Organizational Research Methods (2023)
As an alternative to modelling the outcome, can we instead adjust for treatment?
Yes: The propensity score
\[e(X_i) = P(D_i = 1 | X_i)\]
Rosenbaum and Rubin (1983) show that conditional on the propensity score, the treatment is independent of the covariates.
\[D_i {\perp \! \! \! \perp} X_i | e(X_i)\]
Therefore, conditioning on the propensity score suffices to adjust for confounding due to \(X\)
\[\{Y_i(1), Y_i(0)\} {\perp \! \! \! \perp} D_i | e(X_i)\]
The propensity score is a type of balancing score in that conditioning on it breaks the relationship between the covariates and treatment.
One way of adjusting for the propensity score is to use it as a weight on each observation.
We weight each unit by the inverse propensity of it receiving its particular treatment
This yields the Horvitz-Thompson estimator for the ATE (with known weights)
\[\hat{\tau} = \frac{1}{n}\sum_{i=1}^n\ \frac{D_i Y_i}{e(X_i)} - \frac{(1-D_i)Y_i}{1 - e(X_i)}\]
Intuition: Weighting constructs a “pseudo-population” where the link between treatment and the observed covariates is broken
By iterated expectations
\[E\left[\frac{D_i Y_i}{e(X_i)}\right] = E\bigg[E\left[\frac{D_i Y_i}{e(X_i)} \bigg| X_i \right]\bigg]\]
Consistency
\[E\left[\frac{D_i Y_i}{e(X_i)}\right] = E\bigg[\frac{E[D_i Y_i(1) | X_i]}{e(X_i)} \bigg]\]
Conditional ignorability
\[E\left[\frac{D_i Y_i}{e(X_i)}\right] = E\bigg[\frac{E[D_i |X_i] E[Y_i(1) | X_i]}{e(X_i)} \bigg]\]
Definition of the propensity score
\[E\left[\frac{D_i Y_i}{e(X_i)}\right] = E\bigg[\frac{e(X_i) E[Y_i(1) | X_i]}{e(X_i)} \bigg]\]
Iterated expectations
\[E\left[\frac{D_i Y_i}{e(X_i)}\right] = E[Y_i(1)]\]
Horvitz-Thompson behaves especially poorly when propensity scores are close to \(0\) and \(1\)
\[\hat{\tau}_{ht} = \frac{1}{n}\sum_{i=1}^n\ \frac{D_i Y_i}{\widehat{e(X_i)}} - \frac{(1-D_i)Y_i}{1 - \widehat{e(X_i)}}\]
Common practice is to normalize the weights within the treated and control groups to sum to \(1\), which gives us the Hajek estimator
\[\hat{\tau}_{\text{hajek}} = \frac{1}{\sum_{i=1}^n \frac{D_i}{\widehat{e(X_i)}}} \sum_{i=1}^n\ \frac{D_i Y_i}{\widehat{e(X_i)}} - \frac{1}{\sum_{i=1}^n \frac{1-D_i}{\widehat{1-e(X_i)}}} \sum_{i=1}^n\ \frac{(1-D_i) Y_i}{1-\widehat{e(X_i)}}\]
Weighted least squares will do this if we regress outcome on a binary indicator for treatment and supply the inverse propensity weights.
With high-dimensional \(X\), we’ll often need a parametric model for the propensity score
We don’t want to assume that the CEF is linear.
\[\mathbb{E}[Y_i| X_i] = X_i^\prime\beta\]
Instead, we might assume that a function of the CEF is linear
\[g(\mathbb{E}[Y_i | X_i]) = X_i^\prime\beta\]
This is where generalized linear models come in.
A parametric distribution on \(Y_i|X_i\) (“stochastic component”)
A linear predictor: \(\eta_i = X_i^{\prime}\beta = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \dotsc \beta_kX_{ik}\) (“systematic component”)
A link function \(g()\) applied to the CEF \(E[Y_i|X_i]\) that yields the linear predictor
\[g(E[Y_i | X_i]) = \eta_i\]
Alternatively, we can write the CEF in terms of the “inverse-link” \(g^{-1}()\) applied to the linear predictor
\[E[Y_i | X_i] = g^{-1}(\eta_i)\]
The “logit” or “logistic” GLM models the log-odds of a binary outcome as a function of the linear predictor \(X_i^{\prime}\beta\)
\[E[Y_i | X_i] = Pr(Y_i = 1 | X_i) = \pi_i\]
\[\log\bigg(\frac{\pi_i}{1-\pi_i}\bigg) = X_i^{\prime}\beta\]
Alternatively, this is written in terms of the “inverse-link” function (the logistic function) that relates \(\pi_i\) to \(g^{-1}(X_i^{\prime}\beta)\)
\[\pi_i = \frac{\exp(X_i^{\prime}\beta)}{1 + \exp(X_i^{\prime}\beta)} = \frac{1}{1 + \exp(-X_i^{\prime}\beta)}\]
# Subset
turn_sub <- turn %>% filter(blackpop_pct1990 > .1&blackpop_pct1990<.4)
# Regression adjustment
regression <- lm_lin(black_turnout ~ black,
covariates = ~ pop90 + blackpop_pct1990 +
college_pct + hs_pct + unemp + income + poverty + home +
year, data = turn_sub)
tidy(regression) %>% filter(term == "black") term estimate std.error statistic p.value conf.low conf.high df
1 black 4.219744 1.297229 3.252891 0.001208061 1.67196 6.767528 586
outcome
1 black_turnout
# Fit a propensity score model - we'll use a logit
pscore_model <- glm(black ~ pop90 + blackpop_pct1990 +
college_pct + hs_pct + unemp + income + poverty + home +
year, data=turn_sub, family= binomial(link='logit'))
# Predict the propensity score
turn_sub$e <- predict(pscore_model, type="response")
# Generate the weights
turn_sub$iptw_weight <- 1/turn_sub$e
turn_sub$iptw_weight[turn_sub$black == 0] <- 1/(1-turn_sub$e[turn_sub$black == 0]) black
4.270128
[1] 4.270128
set.seed(53706)
nIter <- 1000
boot_iptw <- rep(NA,nIter)
for (i in 1:nIter){
turn_boot <- turn_sub[sample(1:nrow(turn_sub), size = nrow(turn_sub), replace=T),]
# Fit a propensity score model
pscore_model_boot <- glm(black ~ pop90 + blackpop_pct1990 +
unemp + college_pct + hs_pct + unemp + income + poverty + home +
year, data=turn_boot, family= binomial(link='logit'))
# Predict the propensity score
turn_boot$e <- predict(pscore_model_boot, type="response")
# Generate the weights
turn_boot$iptw_weight <- 1/turn_boot$e
turn_boot$iptw_weight[turn_boot$black == 0] <- 1/(1-turn_boot$e[turn_boot$black == 0])
# Fit a model
iptw_model_boot <- lm(black_turnout ~ black, data=turn_boot, weights=iptw_weight)
boot_iptw[i] <- coef(iptw_model_boot)[2]
}In both of these approaches, we need to get a model correct in order for our estimator to be consistent for the ATE
In outcome regression, we need to get \(m_d(X_i) = \mathbb{E}[Y_i | D_i = d, X_i]\) correct
\[\hat{\tau}_{\text{reg}} = \frac{1}{n}\sum_{i=1}^n \bigg\{\hat{m}_1(X_i) - \hat{m}_0(X_i) \bigg\}\]
In inverse-propensity of treatment weighting, we need to get \(e(X_i) = P(D_i = 1 | X_i)\) correct
\[\hat{\tau}_{\text{ht}} = \frac{1}{n}\sum_{i=1}^n\ \frac{D_i Y_i}{\hat{e}(X_i)} - \frac{(1-D_i)Y_i}{1 - \hat{e}(X_i)}\]
Could we construct an estimator that’s still consistent even if we get one of these wrong?
The Augmented IPTW estimator combines the regression estimator with an IPW estimator that replaces \(Y_i\) with the residuals \(Y_i - \hat{m}_d(X_i)\)
\[\hat{\tau}_{\text{aipw-ht}} = \frac{1}{n}\sum_{i=1}^n\bigg\{\bigg(\hat{m}_1(X_i) - \hat{m}_0(X_i)\bigg) + \bigg(\frac{D_i (Y_i - \hat{m}_1(X_i))}{\hat{e}(X_i)} - \frac{(1-D_i)(Y_i - \hat{m}_0(X_i))}{1 - \hat{e}(X_i)}\bigg)\bigg\}\]
This is doubly-robust in the sense that it’s consistent for the ATE if either
As with IPW, we’ll usually use the Hajek estimator for the IPW part with the weights normalized to 1 within each treatment condition to avoid instability with propensity scores near 0 or 1.
Consider this form of the AIPW estimator. What happens if the outcome model is correct?
\[\hat{\tau}_{\text{aipw-ht}} = \frac{1}{n}\sum_{i=1}^n\bigg\{\bigg(\hat{m}_1(X_i) - \hat{m}_0(X_i)\bigg) + \bigg(\frac{D_i (Y_i - \hat{m}_1(X_i))}{\hat{e}(X_i)} - \frac{(1-D_i)(Y_i - \hat{m}_0(X_i))}{1 - \hat{e}(X_i)}\bigg)\bigg\}\]
Alternatively, we can rearrange the terms to write the estimator in a different form.
\[\hat{\tau}_{\text{aipw-ht}} = \frac{1}{n}\sum_{i=1}^n\bigg\{\bigg(\frac{D_i Y_i}{\hat{e}(X_i)} - \frac{(1-D_i)Y_i}{1 - \hat{e}(X_i)}\bigg)+ \bigg(\hat{m}_1(X_i)\bigg(1 - \frac{D_i}{\hat{e}(X_i)}\bigg) - \hat{m}_0(X_i)\bigg(1 - \frac{1 -D_i}{1 -\hat{e}(X_i)}\bigg)\bigg\}\]
# Outcome models
turn_sub$mu_1 <- predict(regression, newdata = turn_sub %>% mutate(black = 1))
turn_sub$mu_0 <- predict(regression, newdata = turn_sub %>% mutate(black = 0))
# AIPW estimator
aipw <- mean(turn_sub$mu_1 - turn_sub$mu_0) + # Regression (we could just pull this from the Lin estimate)
(weighted.mean(turn_sub$black_turnout[turn_sub$black == 1] - turn_sub$mu_1[turn_sub$black == 1], turn_sub$iptw_weight[turn_sub$black == 1]) - # Hajek IPW
weighted.mean(turn_sub$black_turnout[turn_sub$black == 0] - turn_sub$mu_0[turn_sub$black == 0], turn_sub$iptw_weight[turn_sub$black == 0]))
aipw[1] 3.81323
set.seed(53706)
nIter <- 1000
boot_aipw <- rep(NA,nIter)
for (i in 1:nIter){
turn_boot <- turn_sub[sample(1:nrow(turn_sub), size = nrow(turn_sub), replace=T),]
# Fit a propensity score model
pscore_model_boot <- glm(black ~ pop90 + blackpop_pct1990 +
unemp + college_pct + hs_pct + income + poverty + home +
year, data=turn_boot, family= binomial(link='logit'))
# Predict the propensity score
turn_boot$e <- predict(pscore_model_boot, type="response")
# Generate the weights
turn_boot$iptw_weight <- 1/turn_boot$e
turn_boot$iptw_weight[turn_boot$black == 0] <- 1/(1-turn_boot$e[turn_boot$black == 0])
# Fit a regression model
regression_boot <- lm_lin(black_turnout ~ black,
covariates = ~ pop90 + blackpop_pct1990 +
college_pct + hs_pct + unemp + income + poverty + home +
year, data = turn_boot)
turn_boot$mu_1 <- predict(regression_boot, newdata = turn_boot %>% mutate(black = 1))
turn_boot$mu_0 <- predict(regression_boot, newdata = turn_boot %>% mutate(black = 0))
# AIPW
boot_aipw[i] <- mean(turn_boot$mu_1 - turn_boot$mu_0) + # Regression (we could just pull this from the Lin estimate)
(weighted.mean(turn_boot$black_turnout[turn_boot$black == 1] - turn_boot$mu_1[turn_boot$black == 1], turn_boot$iptw_weight[turn_boot$black == 1]) - # Hajek IPW
weighted.mean(turn_boot$black_turnout[turn_boot$black == 0] - turn_boot$mu_0[turn_boot$black == 0], turn_boot$iptw_weight[turn_boot$black == 0]))
}PS 813 - University of Wisconsin - Madison