Week 6: Regression and Weighting

PS 813 - Causal Inference

Anton Strezhnev

strezhnev@wisc.edu

University of Wisconsin-Madison

February 23, 2026

\[ \require{cancel} \]

Last week

Selection-on-observables
- Identification under conditional ignorability
- We have to adjust for factors predicting treatment and predicting outcome
Directed acyclic graphs
- Represent causal relationships in terms of nodes and edges
- Characterize conditional independence (d-separation)
- Adjustment criterion

This week

How can we estimate effects under selection-on-observables assumptions?
- What happens when we can’t just stratify?
Regression modeling
- Estimate \(\mathbb{E}[Y | X, D = 1]\) and \(\mathbb{E}[Y | X, D = 0]\)
- What additional assumptions do we impose? Do some specifications impose more assumptions than others?
Inverse-propensity of treatment weighting
- Can we just adjust for a single quantity?
- Yes, the propensity score: \(P(D = 1 | X)\)
- Weighting by the inverse propensity of treatment.

Regression Adjustment

Running Example: Keele et. al. (2017)

Do minority candidates drive minority voter turnout?
- Keele, Shah, White and Kay (2017, JOP) “Black Candidates and Black Turnout: A Study of Viability in Louisiana Mayoral Elections”
- Examine mayoral elections in Louisiana from 1988 to 2011 and compare differences in Black turnout among elections with a Black candidate and elections without a Black candidate (all-white candidate slate)

turn <- haven::read_dta("assets/data/match-all.dta")
lm_robust(black_turnout ~ black, data=turn)

            Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
(Intercept)     44.0      0.946   46.47 1.88e-252    42.12     45.8 1004
black            7.7      1.230    6.26  5.71e-10     5.28     10.1 1004

Identification challenge: Black candidates are not randomly assigned to elections!
- Strategic entry: Black candidates might be more likely to run in cities with larger Black population shares.
- City demographics might just generally correlate with turnout.

Running Example: Keele et. al. (2017)

Let’s take a look at the data to see if this is the case!
- A very useful tool for visualizing CEFs is the binned scatterplot

Regression adjustment

Our goal is to estimate the average treatment effect

\[\mathbb{E}[Y_i(1)] - \mathbb{E}[Y_i(0)]\]
Recall that under selection-on-observables

\[\mathbb{E}[Y_i(1)] = \mathbb{E}_{X}\bigg[E[Y_i(1)| X_i]\bigg] = \mathbb{E}_{X}\bigg[\mathbb{E}[Y_i| X_i, D_i = 1]\bigg]\] \[\mathbb{E}[Y_i(0)] = \mathbb{E}_{X}\bigg[\mathbb{E}[Y_i(0)| X_i]\bigg] = \mathbb{E}_{X}\bigg[\mathbb{E}[Y_i| X_i, D_i = 0]\bigg]\]
So what we need to do to estimate the ATE is:
- Fit a regression model in the treated group to estimate \(\mathbb{E}[Y_i | X_i, D_i = 1]\)
- Fit a regression model in the control group to estimate \(\mathbb{E}[Y_i(0) | X_i]\)
- Average the estimates from these models over the sample distribution of \(X_i\)

Classical regression

So far, we’ve shown that the OLS estimator is consistent and asymptotically normal for the Best Linear Predictor
- But what about for estimating the conditional expectation function? \(\mathbb{E}[Y_i|X_i]\)
- In non-experimental settings, we often do need to get this correct otherwise we’ll still have some unadjusted-for bias remaining.
To do inference on the CEF with OLS, we assume that it’s equal to the BLP!
Assumption 1: Linearity

\[Y_i = X_i^\prime\beta + \epsilon_i\]
Assumption 2: Strict exogeneity of the errors

\[\mathbb{E}[\epsilon | \mathbf{X}] = 0\]
When is linearity always true?
- When we have a fully-saturated model - as many parameters as unique covariate values (e.g. (post)-stratification)

Fully saturated model

Consider a model with two binary covariates \(X_{i1}\) and \(X_{i2}\).
Four possible unique values:
- \(\mathbb{E}[Y_i | X_{1i} = 0, X_{2i} = 0] = \alpha\)
- \(\mathbb{E}[Y_i | X_{1i} = 1, X_{2i} = 0] = \alpha + \beta\)
- \(\mathbb{E}[Y_i | X_{1i} = 0, X_{2i} = 1] = \alpha + \gamma\)
- \(\mathbb{E}[Y_i | X_{1i} = 1, X_{2i} = 1] = \alpha + \beta + \gamma + \delta\)
With the way we’ve defined these parameters, the CEF can be written as:

\[\mathbb{E}[Y_i | X_{i1}, X_{i2}] = \alpha + \beta X_{i1} + \gamma X_{2i} + \delta X_{1i} X_{2i}\]
It’s linear by construction! Each level of \(\mathbb{E}[Y_i | X_{1i}, X_{2i}]\) is estimated separately by taking the mean.

Classical regression

Assumption 3: No perfect collinearity
- \(\mathbf{X}^{\prime}\mathbf{X}\) is invertible
- \(\mathbf{X}\) has full column rank
When does this fail? How can we check?

Classical regression

Under assumptions 1-3, our OLS estimator \(\hat{\beta}\) is also unbiased for \(\beta\)
Let’s do a quick proof for unbiasedness

\[\begin{align*}\hat{\beta} &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}Y)\\ &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}(\mathbf{X}\beta + \epsilon))\\ &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\mathbf{X})\beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon)\\ &= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) \end{align*}\]

Then we can obtain the conditional expectation of \(\mathbb{E}[\hat{\beta} | \mathbf{X}]\)

\[\begin{align*} \mathbb{E}[\hat{\beta} | \mathbf{X}] &= \mathbb{E}\bigg[\beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) \bigg| \mathbf{X} \bigg]\\ &= \mathbb{E}[\beta | \mathbf{X}] + \mathbb{E}[(\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) | \mathbf{X}]\\ &= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime} \mathbb{E}[\epsilon | \mathbf{X}]\\ &= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}0\\ &= \beta \end{align*}\]

Classical regression

Lastly, by law of total expectation

\[\mathbb{E}[\hat{\beta}] = \mathbb{E}[\mathbb{E}[\hat{\beta}|\mathbf{X}]]\]
Therefore

\[\mathbb{E}[\hat{\beta}] = \mathbb{E}[\beta] = \beta\]
What have we still not assumed?
- Anything about the distribution of the errors (aside from finite 1st/2nd moments)

Classical regression

Assumption 4 - Spherical errors

\[Var(\epsilon | \mathbf{X}) = \begin{bmatrix} \sigma^2 & 0 & \cdots & 0\\ 0 & \sigma^2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma^2 \end{bmatrix} = \sigma^2 \mathbf{I}\]
Benefits
- Simple, unbiased estimator for \(Var(\hat{\beta})\) - classical OLS standard errors (what you get in lm())
- Completes Gauss-Markov assumptions \(\leadsto\) OLS is BLUE (Best Linear Unbiased Estimator)
Drawbacks
- Basically never is true (at least in social science settings!)

Classical regression

Under spherical errors, the variance formula simplifies

\[Var(\hat{\beta}) = (\mathbf{X}^\prime\mathbf{X})^{-1} (\mathbf{X}^\prime \sigma^2\mathbf{X}) (\mathbf{X}^\prime\mathbf{X})^{-1}\] \[Var(\hat{\beta}) = \sigma^2 (\mathbf{X}^\prime\mathbf{X})^{-1}\]
An unbiased estimator for \(\sigma^2\) is the sample variance of the residuals (mean squared error) adjusted by a degrees of freedom correction

\[\hat{\sigma^2} = \frac{1}{N-K} \sum_{i=1}^N (Y_i - \hat{Y_i})^2\]
\(K\) is the number of parameters. We refer to \(N-K\) as the “degrees of freedom”

OLS is BLUE

The Gauss-Markov Theorem
- Under Assumptions 1-4, we can show that there is no other estimator that for \(\beta\) that is both linear and unbiased which has lower variance than \(\hat{\beta}_{\text{OLS}}\)

OLS is BLUE

Proof: Suppose there’s another linear estimator that diverges from the OLS projection of \(Y\) by a non-zero \(K \times N\) matrix \(D\)

\[\tilde{\beta} = ((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime + D)Y\]
Let’s take the expectation

\[\mathbb{E}[\tilde{\beta}] = \mathbb{E}[\hat{\beta}_{\text{OLS}}] + \mathbb{E}[DY]\]
We know OLS is unbiased. Next, we plug in the expression for \(Y\) (linearity)

\[\mathbb{E}[\tilde{\beta}] = \beta + \mathbb{E}[D\mathbf{X}\beta] + \mathbb{E}[D\epsilon]\]

OLS is BLUE

Pull out the constants and use \(\mathbb{E}[\epsilon] = 0\)

\[\mathbb{E}[\tilde{\beta}] = \beta(\mathbf{I} + D\mathbf{X})\]
If \(\tilde{\beta}\) is unbiased, then

\[\beta(\mathbf{I} + D\mathbf{X}) = 0\]
This holds for all values of \(\beta\) only if \(D\mathbf{X} = 0\)

OLS is BLUE

Under homoskedasticity, the variance is

\[Var(\tilde{\beta}) = \sigma^2\bigg((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime + D\bigg)\bigg((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime + D\bigg)^{\prime}\]
Using properties of matrix transposes

\[Var(\tilde{\beta}) = \sigma^2\bigg((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime + D\bigg)\bigg(\mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1} + D^\prime\bigg)\]
Expanding

\[Var(\tilde{\beta}) = \sigma^2\bigg((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime\mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1} + (\mathbf{X}^\prime\mathbf{X})^{-1}(D\mathbf{X})^\prime + D\mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1} + DD^\prime\bigg)\]

OLS is BLUE

Using \(D\mathbf{X} = 0\) from above along with the definition of the variance of \(\hat{\beta}_{\text{OLS}}\)

\[Var(\tilde{\beta}) = Var(\hat{\beta}_{\text{OLS}}) + \sigma^2DD^\prime\]
The variance of our alternative linear, unbiased estimator exceeds the variance of OLS by the positive semi-definite matrix \(\sigma^2DD^\prime\) (since \(DD^\prime\) is an outer product)
Therefore OLS is the lowest-variance linear unbiased estimator

Classical regression

Assumption 5 - Normality of the errors

\[\epsilon | \mathbf{X} \sim \mathcal{N}(0, \sigma^2)\]
Not necessary even for Gauss-Markov assumptions
Not needed to do asymptotic inference on \(\hat{\beta}\)
- Why? Central Limit Theorem!
Benefits?
- Finite-sample inference (no reliance on asymptotics for hypothesis tests)
- Inference for predictions from the model.
Drawbacks
- Again, almost never true unless our outcome is a sum or mean of a large number of i.i.d. random variables

Regression review

What do we need for OLS to be consistent for the “best linear approximation” to the CEF?
- Very little!
What do we need for OLS to be consistent and unbiased for the conditional expectation function?
- Truly linear CEF
- Guaranteed if we have a saturated regression
- But still no assumptions about the outcome distribution!
What do we need to do inference:
- We almost never assume homoskedasticity because “robust” SE estimators are ubiquitous
- Even some forms of error correlation are permitted (“cluster” robust SEs)
- Sample sizes are usually large enough where CLT approximations are good.
- Otherwise, we’ll generally use something like a Fisher randomization test.

Regression imputation in practice

Let’s visualize \(\mathbb{E}[Y_i(1) | X_i]\) and \(\mathbb{E}[Y_i(0) | X_i ]\) with a single covariate.

Is the linear CEF approximation reasonable?

Regression imputation in practice

What happens if we use a cubic fit?

Lots of sensitivity to model specification in the extremes of \(X\).

Regression imputation in practice

If we have little overlap between treatment and control in the covariates, our counterfactual predictions will be based heavily on extrapolations - lots of model dependence

Seems concerning here.

Regression imputation

Let’s estimate the ATE via regression imputation

\[\hat{\tau} = \frac{1}{N}\sum_{i=1}^N \widehat{Y_i(1)} - \widehat{Y_i(0)}\]
Recall that the Lin (2013) estimator is equivalent to this!

reg_ate <- lm_lin(black_turnout ~ black, 
                  covariates = ~ pop90 + blackpop_pct1990 + 
                    unemp + college_pct + hs_pct + unemp + income + poverty + home + 
                    as.factor(year), 
                  data = turn)

tidy(reg_ate) %>% filter(term == "black") %>% dplyr::select(term, estimate, std.error, p.value)

   term estimate std.error p.value
1 black     4.07      1.59  0.0107

Regression imputation

What happens if we use a different model specification and add a cubic polynomial instead?

reg_ate_cube <- lm_lin(black_turnout ~ black, 
                  covariates = ~ pop90 + blackpop_pct1990 + I(blackpop_pct1990^2) + I(blackpop_pct1990^3) +
                    unemp + college_pct + hs_pct + unemp + income + poverty + home +
                    as.factor(year), 
                  data = turn)

tidy(reg_ate_cube) %>% filter(term == "black") %>% dplyr::select(term, estimate, std.error, p.value)

   term estimate std.error p.value
1 black    -2.71      2.89   0.347

Our estimates are really sensitive to our choice of functional form!

Regression with constant effects

Often you will see researchers estimate a single regression model with the form:

\[\mathbb{E}[Y_i | D_i, X_i] = \tau D_i + X_i^{\prime}\beta\]
Does \(\tau\) identify the ATE? Not necessarily!
- We need to assume the model is correct (e.g. no treatment-covariate interactions)
- And we need to assume constant treatment effects!
Even if the model for the outcome is correct, under effect heterogeneity, the regression coefficient \(\tau\) is not the ATE.
- “Regression weighting problem” (Samii and Aronow, 2016)

Regression weighting

Suppose the potential outcomes can be written as

\[Y_i(d) = Y_i(0) + \tau_i \times d\]
The ATE is \(\mathbb{E}[\tau_i] = \tau\)
Suppose we then fit the typical linear regression model that you see in many research papers

\[Y_i = \alpha + \tau_{\text{R}} D_i + X_i^{\prime}\beta + \epsilon_i\]
Will our OLS estimate of \(\hat{\tau_{\text{R}}}\) be consistent for \(\mathbb{E}[\tau_i]\)?
- No!

Frisch-Waugh-Lovell

An important regression theorem (that you will see many times) is that the coefficients in a multiple linear regression can be written by partialling out correlations with the other covariates.
- The Frisch-Waugh-Lovell theorem!
Consider estimating the following by OLS:

\[Y_i = \beta_0 + \beta_1 X_{i1} + \epsilon_{i}\]
The standard least-squares estimator for \(\beta_1\) can be expressed as the ratio of the sample covariance of \(X_{i1}\) and \(Y_i\) and the variance of \(X_{i1}\)

\[\hat{\beta_1} = \frac{\widehat{Cov}(Y_i, X_{i1})}{\widehat{Var}(X_{i1})} \to \frac{Cov(Y_i, X_{i1})}{Var(X_{i1})}\]

Frisch-Waugh-Lovell

Now what happens with a multiple linear regression - how do we write \(\beta_1\) when the model is

\[Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \epsilon_{i}\]
It turns out that \(\beta_1\) can also be written in terms of a bivariate regression involving residuals from auxiliary regressions

\[\hat{\beta_1} = \frac{\widehat{Cov}(Y_i^{{\perp X_{i2}}}, X_{i1}^{{\perp X_{i2}}})}{\widehat{Var}(X_{i1}^{{\perp X_{i2}}})} \to \frac{Cov(Y_i^{{\perp X_{i2}}}, X_{i1}^{{\perp X_{i2}}})}{Var(X_{i1}^{{\perp X_{i2}}})}\]
\(Y_i^{{\perp X_{i2}}}\) is the residual from a regression of \(Y_i\) on all of the other covariates (here, just \(X_{i2}\))
\(X_{i1}^{{\perp X_{i2}}}\) is the residual from a regression of \(X_{i1}\) on all of the other covariates (\(X_{i2}\) here)

Regression weighting

Using the Frisch-Waugh-Lovell theorem. we can write an expression for the treatment coefficient as

\[\tau_\text{R} = \frac{Cov(Y_i, D_i^{\perp X_i}}{Var(D_i^{\perp X_i})}\]
\(D_i^{\perp X_i}\) is the residual from a regression of \(D_i\) on all of the other covariates \(X_i\)
In this case, that’s just:

\[\tau_\text{R} = \frac{Cov(Y_i, D_i - \mathbb{E}[D_i|X_i])}{Var(D_i - \mathbb{E}[D_i|X_i])} \]

Regression weighting

We can re-arrange terms (see Samii and Aronow, 2016 for the math) to get an expression for the weights placed on individual treatment effects \(\tau_i\)

\[\tau_\text{R} = \frac{E[w_i \tau_i]}{E[w_i]}\]

where \(w_i = (D_i - E[D_i | X_i])^2\)
Intuition: Effects for units whose treatment status is poorly predicted by the covariates get more weight.
- Why? Because OLS is a minimum-variance estimator - estimates are more precise for strata where there is more residual variation in \(D_i\)
The typical multiple regression estimator does not recover the ATE under effect heterogeneity.
- But in standard selection-on-observables settings, these weights are at least non-negative

Beware the Table 2 fallacy

Fearon and Laitin (2003) “Ethnicity, Insurgency, and Civil War”

Beware the Table 2 fallacy

It is best to design a regression to estimate one effect.
Don’t read all the coefficients from a regression as causal!
Intuition: We’ve selected \(X_i\) that affect our treatment \(D\).
- So the coefficient on \(X_i\) in our regression will - by construction - suffer from post-treatment bias since we’re controlling for \(D\) in that same regression!
- We also justified our choices of \(X_i\) based on what affects \(D\), but the confounders for \(D\) are not the same as the confounders for some other \(X\)
- Effect heterogeneity also messes this up – remember FWL!
For more, see

Westreich, Daniel, and Sander Greenland. “The table 2 fallacy: presenting and interpreting confounder and modifier coefficients.” American journal of epidemiology 177.4 (2013): 292-298.

Hünermund, Paul, and Beyers Louw. “On the Nuisance of Control Variables in Causal Regression Analysis.” Organizational Research Methods (2023)

Beware the Table 2 fallacy

Depending on the DAG, you might get multiple effects out of a single analysis, but it’s a very specific set of assumptions:

Propensity scores

As an alternative to modelling the outcome, can we instead adjust for treatment?
- Is there a single quantity scalar that we can adjust for regardless of how high-dimensional \(\mathbf{X}\) is?
Yes: The propensity score

\[e(X_i) = P(D_i = 1 | X_i)\]
Rosenbaum and Rubin (1983) show that conditional on the propensity score, the treatment is independent of the covariates.

\[D_i {\perp \! \! \! \perp} X_i | e(X_i)\]
Therefore, conditioning on the propensity score suffices to adjust for confounding due to \(X\)

\[\{Y_i(1), Y_i(0)\} {\perp \! \! \! \perp} D_i | e(X_i)\]
The propensity score is a type of balancing score in that conditioning on it breaks the relationship between the covariates and treatment.

Weighting estimators

One way of adjusting for the propensity score is to use it as a weight on each observation.
We weight each unit by the inverse propensity of it receiving its particular treatment
- Treated units receive a weight \(\frac{1}{e(X_i)}\)
- Control units receive a weight \(\frac{1}{1 - e(X_i)}\)
This yields the Horvitz-Thompson estimator for the ATE (with known weights)

\[\hat{\tau} = \frac{1}{n}\sum_{i=1}^n\ \frac{D_i Y_i}{e(X_i)} - \frac{(1-D_i)Y_i}{1 - e(X_i)}\]
Intuition: Weighting constructs a “pseudo-population” where the link between treatment and the observed covariates is broken
- Which treated units get more weight? Those whose treatment status is counter to what we would predict from the propensity score.
- Downweight the “overrepresented” and upweight the “underrepresented”

Why weighting works

By iterated expectations

\[E\left[\frac{D_i Y_i}{e(X_i)}\right] = E\bigg[E\left[\frac{D_i Y_i}{e(X_i)} \bigg| X_i \right]\bigg]\]
Consistency

\[E\left[\frac{D_i Y_i}{e(X_i)}\right] = E\bigg[\frac{E[D_i Y_i(1) | X_i]}{e(X_i)} \bigg]\]
Conditional ignorability

\[E\left[\frac{D_i Y_i}{e(X_i)}\right] = E\bigg[\frac{E[D_i |X_i] E[Y_i(1) | X_i]}{e(X_i)} \bigg]\]

Why weighting works

Definition of the propensity score

\[E\left[\frac{D_i Y_i}{e(X_i)}\right] = E\bigg[\frac{e(X_i) E[Y_i(1) | X_i]}{e(X_i)} \bigg]\]
Iterated expectations

\[E\left[\frac{D_i Y_i}{e(X_i)}\right] = E[Y_i(1)]\]

Normalizing the propensity score

Horvitz-Thompson behaves especially poorly when propensity scores are close to \(0\) and \(1\)

\[\hat{\tau}_{ht} = \frac{1}{n}\sum_{i=1}^n\ \frac{D_i Y_i}{\widehat{e(X_i)}} - \frac{(1-D_i)Y_i}{1 - \widehat{e(X_i)}}\]
Common practice is to normalize the weights within the treated and control groups to sum to \(1\), which gives us the Hajek estimator

\[\hat{\tau}_{\text{hajek}} = \frac{1}{\sum_{i=1}^n \frac{D_i}{\widehat{e(X_i)}}} \sum_{i=1}^n\ \frac{D_i Y_i}{\widehat{e(X_i)}} - \frac{1}{\sum_{i=1}^n \frac{1-D_i}{\widehat{1-e(X_i)}}} \sum_{i=1}^n\ \frac{(1-D_i) Y_i}{1-\widehat{e(X_i)}}\]
Weighted least squares will do this if we regress outcome on a binary indicator for treatment and supply the inverse propensity weights.

Inference

We want to estimate the sampling variance \(\text{Var}(\hat{\tau})\) in order to do inference
- One option is to assume the weights are known and use a conventional “robust” SE estimator.
- But this fails to account for the uncertainty in the estimation of the propensity scores themselves (Reifeis and Hudgens, 2022).
Closed form estimators do exist…but it’s often easier to just use the bootstrap

Bootstrapping

Bootstrapping is a useful tool for learning about the sampling distribution of a parameter when we don’t want to use theory to figure out \(\widehat{\text{Var}(\hat{\tau})}\)
- Works well for estimators that are smooth functionals of the data (don’t change drastically to minor perturbations of the input)
Intuition: What drives uncertainty in \(\hat{\tau}\) when we’re doing population inference?
- If we took another sample, we would calculate a different value.
- We can’t go and re-draw a sample from the population again (that’s just a hypothetical).
- But our sample is a pretty good approximation for the population distribution…and we can resample from that.
Bootstrap procedures approximate the distribution of \(\hat{\tau}\) under repeated draws from the population by taking repeated draws from the empirical distribution of the data in the sample.
- Under i.i.d. observations, we’ll often use the “nonparametric” bootstrap in which we resample complete observations with replacement

The bootstrap

The bootstrap procedure for IPTW:
1. Draw a sample of the data by resampling observations (the rows of the data \(\{Y_i, D_i, X_i\}\)) with replacement
2. Using the bootstrapped sample, estimate the propensity scores \(\hat{e}(X_i)\).
3. Using those propensity scores and the bootstrapped sample, compute an estimate of the average treatment effect \(\hat{\tau}\). Store that estimate
4. Repeat for many iterations.
At the end, we have a bunch of draws that allow us to approximate the sampling distribution of \(\hat{\tau}\).

Estimating the propensity scores

With high-dimensional \(X\), we’ll often need a parametric model for the propensity score
- We actually need to predict an outcome - but OLS might give us fitted values above 1 or below zero!
We don’t want to assume that the CEF is linear.

\[\mathbb{E}[Y_i| X_i] = X_i^\prime\beta\]
Instead, we might assume that a function of the CEF is linear

\[g(\mathbb{E}[Y_i | X_i]) = X_i^\prime\beta\]
This is where generalized linear models come in.

Generalized Linear Models

Generalized linear models (GLMs) have three components:
1. A parametric distribution on \(Y_i|X_i\) (“stochastic component”)
2. A linear predictor: \(\eta_i = X_i^{\prime}\beta = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \dotsc \beta_kX_{ik}\) (“systematic component”)
3. A link function \(g()\) applied to the CEF \(E[Y_i|X_i]\) that yields the linear predictor
  
  \[g(E[Y_i | X_i]) = \eta_i\]
- Alternatively, we can write the CEF in terms of the “inverse-link” \(g^{-1}()\) applied to the linear predictor
  
  \[E[Y_i | X_i] = g^{-1}(\eta_i)\]
Consistent estimation of the parameters via maximum likelihood methods (we’ll cover this in PS 818).

Logistic regression

The “logit” or “logistic” GLM models the log-odds of a binary outcome as a function of the linear predictor \(X_i^{\prime}\beta\)

\[E[Y_i | X_i] = Pr(Y_i = 1 | X_i) = \pi_i\]

\[\log\bigg(\frac{\pi_i}{1-\pi_i}\bigg) = X_i^{\prime}\beta\]
Alternatively, this is written in terms of the “inverse-link” function (the logistic function) that relates \(\pi_i\) to \(g^{-1}(X_i^{\prime}\beta)\)

\[\pi_i = \frac{\exp(X_i^{\prime}\beta)}{1 + \exp(X_i^{\prime}\beta)} = \frac{1}{1 + \exp(-X_i^{\prime}\beta)}\]

The logistic function

The logistic function maps inputs on \(\mathbb{R}\) to \((0, 1)\)

Illustration: Keele et. al. (2017)

Let’s go back to our earlier example on the effect of Black mayoral candidates on Black turnout
- Let’s work here with the subset of elections with Black population shares between \(.1\) and \(.4\) (otherwise we saw we had terrible overlap!)
- Incidentally, we run into some modelling issues with year FEs (essentially only 1 obs in some off-cycles, so let’s just fit a linear time trend).
First, what’s the regression-adjusted estimate

# Subset
turn_sub <- turn %>% filter(blackpop_pct1990 > .1&blackpop_pct1990<.4)

# Regression adjustment
regression <- lm_lin(black_turnout ~ black,
       covariates = ~  pop90 + blackpop_pct1990 +
                    college_pct + hs_pct + unemp + income + poverty + home + 
                    year, data = turn_sub)

tidy(regression) %>% filter(term == "black")

   term estimate std.error statistic     p.value conf.low conf.high  df
1 black 4.219744  1.297229  3.252891 0.001208061  1.67196  6.767528 586
        outcome
1 black_turnout

Illustration: Keele et. al. (2017)

Now let’s fit the propensity score model

# Fit a propensity score model - we'll use a logit
pscore_model <- glm(black ~ pop90 + blackpop_pct1990 +
                    college_pct + hs_pct + unemp + income + poverty + home + 
                    year, data=turn_sub, family= binomial(link='logit'))

# Predict the propensity score
turn_sub$e <- predict(pscore_model, type="response")

# Generate the weights
turn_sub$iptw_weight <- 1/turn_sub$e
turn_sub$iptw_weight[turn_sub$black == 0] <- 1/(1-turn_sub$e[turn_sub$black == 0])

Illustration: Keele et. al. (2017)

Histogram of the weights.

Illustration: Keele et. al. (2017)

Did we improve balance?

Illustration: Keele et. al. (2017)

Calculate IPTW estimate of the ATE

# Weighted regression is equivalent to the Hajek estimator
iptw_model <- lm_robust(black_turnout ~ black, data=turn_sub, weights=iptw_weight)
coef(iptw_model)[2] # Ignore the SE

   black 
4.270128

# Weighted regression gives the Hajek (normalized weights) estimator
iptw_est <- weighted.mean(turn_sub$black_turnout[turn_sub$black == 1], turn_sub$iptw_weight[turn_sub$black == 1]) -
  weighted.mean(turn_sub$black_turnout[turn_sub$black == 0], turn_sub$iptw_weight[turn_sub$black == 0])
iptw_est

[1] 4.270128

Illustration: Keele et. al. (2017)

Bootstrap

set.seed(53706)
nIter <- 1000
boot_iptw <- rep(NA,nIter)
for (i in 1:nIter){
  turn_boot <- turn_sub[sample(1:nrow(turn_sub), size = nrow(turn_sub), replace=T),]
   # Fit a propensity score model
  pscore_model_boot <- glm(black ~ pop90 + blackpop_pct1990 + 
                      unemp + college_pct + hs_pct + unemp + income + poverty + home + 
                      year, data=turn_boot, family= binomial(link='logit'))
  
  # Predict the propensity score
  turn_boot$e <- predict(pscore_model_boot, type="response")
  
  # Generate the weights
  turn_boot$iptw_weight <- 1/turn_boot$e
  turn_boot$iptw_weight[turn_boot$black == 0] <- 1/(1-turn_boot$e[turn_boot$black == 0])
  
  # Fit a model
  iptw_model_boot <- lm(black_turnout ~ black, data=turn_boot, weights=iptw_weight)
  boot_iptw[i] <- coef(iptw_model_boot)[2]
}

Illustration:

Bootstrap SE and percentile intervals

# Estimated standard error
boot_se <- sd(boot_iptw)
boot_se

[1] 1.460552

# Asymptotic CI
c(iptw_est - qnorm(.975)*boot_se, iptw_est + qnorm(.975)*boot_se)

[1] 1.407497 7.132758

# Percentile interval
quantile(boot_iptw, c(.025, .975))

    2.5%    97.5% 
1.294929 6.957699

Augmented IPTW

In both of these approaches, we need to get a model correct in order for our estimator to be consistent for the ATE
In outcome regression, we need to get \(m_d(X_i) = \mathbb{E}[Y_i | D_i = d, X_i]\) correct

\[\hat{\tau}_{\text{reg}} = \frac{1}{n}\sum_{i=1}^n \bigg\{\hat{m}_1(X_i) - \hat{m}_0(X_i) \bigg\}\]
In inverse-propensity of treatment weighting, we need to get \(e(X_i) = P(D_i = 1 | X_i)\) correct

\[\hat{\tau}_{\text{ht}} = \frac{1}{n}\sum_{i=1}^n\ \frac{D_i Y_i}{\hat{e}(X_i)} - \frac{(1-D_i)Y_i}{1 - \hat{e}(X_i)}\]
Could we construct an estimator that’s still consistent even if we get one of these wrong?

Augmented IPTW

The Augmented IPTW estimator combines the regression estimator with an IPW estimator that replaces \(Y_i\) with the residuals \(Y_i - \hat{m}_d(X_i)\)

\[\hat{\tau}_{\text{aipw-ht}} = \frac{1}{n}\sum_{i=1}^n\bigg\{\bigg(\hat{m}_1(X_i) - \hat{m}_0(X_i)\bigg) + \bigg(\frac{D_i (Y_i - \hat{m}_1(X_i))}{\hat{e}(X_i)} - \frac{(1-D_i)(Y_i - \hat{m}_0(X_i))}{1 - \hat{e}(X_i)}\bigg)\bigg\}\]
This is doubly-robust in the sense that it’s consistent for the ATE if either
- The outcome models \(\hat{m}_1(X_i)\) and \(\hat{m}_0(X_i)\) are consistent for the true outcome models
- The propensity score model \(\hat{e}(X_i)\) is consistent for the true propensity score
As with IPW, we’ll usually use the Hajek estimator for the IPW part with the weights normalized to 1 within each treatment condition to avoid instability with propensity scores near 0 or 1.

Augmented IPTW

Consider this form of the AIPW estimator. What happens if the outcome model is correct?
- The IPW part is mean zero!
\[\hat{\tau}_{\text{aipw-ht}} = \frac{1}{n}\sum_{i=1}^n\bigg\{\bigg(\hat{m}_1(X_i) - \hat{m}_0(X_i)\bigg) + \bigg(\frac{D_i (Y_i - \hat{m}_1(X_i))}{\hat{e}(X_i)} - \frac{(1-D_i)(Y_i - \hat{m}_0(X_i))}{1 - \hat{e}(X_i)}\bigg)\bigg\}\]
Alternatively, we can rearrange the terms to write the estimator in a different form.
- What happens if the treatment model is correct? The regression part is mean zero!
\[\hat{\tau}_{\text{aipw-ht}} = \frac{1}{n}\sum_{i=1}^n\bigg\{\bigg(\frac{D_i Y_i}{\hat{e}(X_i)} - \frac{(1-D_i)Y_i}{1 - \hat{e}(X_i)}\bigg)+ \bigg(\hat{m}_1(X_i)\bigg(1 - \frac{D_i}{\hat{e}(X_i)}\bigg) - \hat{m}_0(X_i)\bigg(1 - \frac{1 -D_i}{1 -\hat{e}(X_i)}\bigg)\bigg\}\]

Augmented IPTW

Let’s implement this!

# Outcome models
turn_sub$mu_1 <- predict(regression, newdata = turn_sub %>% mutate(black = 1))
turn_sub$mu_0 <- predict(regression, newdata = turn_sub %>% mutate(black = 0))

# AIPW estimator
aipw <- mean(turn_sub$mu_1 - turn_sub$mu_0) + # Regression (we could just pull this from the Lin estimate)
  (weighted.mean(turn_sub$black_turnout[turn_sub$black == 1] - turn_sub$mu_1[turn_sub$black == 1], turn_sub$iptw_weight[turn_sub$black == 1]) - # Hajek IPW
  weighted.mean(turn_sub$black_turnout[turn_sub$black == 0] - turn_sub$mu_0[turn_sub$black == 0], turn_sub$iptw_weight[turn_sub$black == 0]))

aipw

[1] 3.81323

Augmented IPTW

Bootstrap

set.seed(53706)
nIter <- 1000
boot_aipw <- rep(NA,nIter)
for (i in 1:nIter){
  turn_boot <- turn_sub[sample(1:nrow(turn_sub), size = nrow(turn_sub), replace=T),]
   # Fit a propensity score model
  pscore_model_boot <- glm(black ~ pop90 + blackpop_pct1990 + 
                      unemp + college_pct + hs_pct + income + poverty + home + 
                      year, data=turn_boot, family= binomial(link='logit'))
  
  # Predict the propensity score
  turn_boot$e <- predict(pscore_model_boot, type="response")
  
  # Generate the weights
  turn_boot$iptw_weight <- 1/turn_boot$e
  turn_boot$iptw_weight[turn_boot$black == 0] <- 1/(1-turn_boot$e[turn_boot$black == 0])
  
  # Fit a regression model
  regression_boot <- lm_lin(black_turnout ~ black,
       covariates = ~  pop90 + blackpop_pct1990 +
                    college_pct + hs_pct + unemp + income + poverty + home + 
                    year, data = turn_boot)
  
  turn_boot$mu_1 <- predict(regression_boot, newdata = turn_boot %>% mutate(black = 1))
  turn_boot$mu_0 <- predict(regression_boot, newdata = turn_boot %>% mutate(black = 0))

  # AIPW
  boot_aipw[i] <-  mean(turn_boot$mu_1 - turn_boot$mu_0) + # Regression (we could just pull this from the Lin estimate)
  (weighted.mean(turn_boot$black_turnout[turn_boot$black == 1] - turn_boot$mu_1[turn_boot$black == 1], turn_boot$iptw_weight[turn_boot$black == 1]) - # Hajek IPW
  weighted.mean(turn_boot$black_turnout[turn_boot$black == 0] - turn_boot$mu_0[turn_boot$black == 0], turn_boot$iptw_weight[turn_boot$black == 0]))
}

Illustration:

Bootstrap SE and percentile intervals

# Estimated standard error
boot_se_aipw <- sd(boot_aipw)
boot_se_aipw

[1] 1.431131

# Asymptotic CI
c(aipw - qnorm(.975)*boot_se_aipw, aipw + qnorm(.975)*boot_se_aipw)

[1] 1.008264 6.618196

# Percentile interval
quantile(boot_aipw, c(.025, .975))

    2.5%    97.5% 
1.046829 6.508168

Summary

Outcome regression and inverse propensity weighting are two different strategies for adjusting for confounders \(X_i\)
- One approach models the mean of \(Y\) given \(X_i\)
- The other models the probability of treatment given \(X_i\)
Under correct model specification, both approaches are consistent for the ATE
- Careful with the high variance/instability of IPW - especially with propensity scores close to \(1\) or \(0\).
We can get an added double-robustness property by combining the two - Augmented Inverse Propensity Weighting (AIPW) estimator
- Connected to the residuals-on-residuals regression
- You’ll see it in the context of “double machine learning”

Next week

One more method for adjusting for covariates - matching
- For every treated unit, find the closest control
- “Pre-processing” or “Pruning” the data
Can think of this as a very basic machine learning method!
- No strong modelling assumptions
But beware of finite sample biases!
- Matching on more than one covariate w/o additional regression adjustment creates problems!