PS 813 - Causal Inference
March 2, 2026
\[ \require{cancel} \]
We want to estimate the sample average treatment effect
\[\tau = \frac{1}{N}\sum_{i=1}^N Y_i(1) - Y_i(0)\]
Many estimators \(\hat{\tau}\) can be written as differences in imputed potential outcomes
\[\hat{\tau} = \frac{1}{N} \sum_{i=1}^N \widehat{Y}_i(1) - \widehat{Y}_i(0)\]
Previously, we used regression to generate the imputations
An alternative approach is to use some notion of “closeness” to impute the potential outcomes.
\[\hat{\tau} = \frac{1}{N} \sum_{i=1}^N \widehat{Y}_i(1) - \widehat{Y}_i(0)\]
For unit \(i\), if it’s treated (\(D_i = 1\))…
Vice-versa for the control units.
Central question: How do we decide what “close” means?
We need to chhoose a distance metric
Common metrics:
Exact: \(Q_{ij} = 0\) if \(X_i = X_j\) and \(Q_{ij} = \infty\) if \(X_i \neq X_j\)
Standardized Euclidean:
\[Q_{ij} = \sqrt{\sum_{k=1}^K \frac{(X_{ik} - X_{jk})^2}{s_k^2}}\]
Most common: Mahalanobis:
\[Q_{ij} = \sqrt{(X_i - X_j)^{\prime}S^{-1}(X_i - X_j)}\]
where \(S\) is the sample variance-covariance matrix.
For a treated unit with \(D_i = 1\), we impute the potential outcomes as:
\[\widehat{Y}_i(1) = Y_i\]
\[\widehat{Y}_i(0) = \frac{1}{M} \sum_{j \in \mathcal{J}_M(i)} Y_j\]
where \(\mathcal{J}_M(i)\) is the set of \(M\) closest matches to \(i\) among the control observations.
Do the same for the controls (but impute \(\widehat{Y}_i(1)\) using matched treated units)
We can think of matching as a kind of weighting estimator that assigns a weight of \(1 + \frac{K_M(i)}{M}\) to each unit.
\[\hat{\tau^m_M} = \frac{1}{N}\sum_{i=1}^N (2D_i -1) \bigg(1 + \frac{K_M(i)}{M}\bigg) Y_i\]
In many settings where we might want to use matching, we have a handful of treated units and many controls.
So instead of trying to estimate the ATE, we could try to estimate the ATT instead - using the controls only to impute.
\[\hat{\tau^m_{\text{ATT}}} = \frac{1}{N_t}\sum_{i: D_i = 1} Y_i - \widehat{Y_i(0)}\]
In terms of the “matching weights”, this is equivalent to
\[\hat{\tau^m_{\text{ATT}}} = \frac{1}{N}\sum_{i=1}^N \bigg(D_i - (1 - D_i)\frac{K_M(i)}{M}\bigg) Y_i\]
ATT in an observational study is often the more policy-relevant quantity
Unless matching is exact, Abadie and Imbens (2006) show that matching exhibits a bias.
\[B_M = \frac{1}{N}\sum_{i=1}^N (2D_i - 1) \bigg[\frac{1}{M} \sum_{m=1}^M \mu_{1-D_i}(X_i) - \mu_{1-D_i}(X_{\mathcal{J}_m(i)})\bigg]\]
where \(\mu_1(X_i) = E[Y_i(1) | X_i]\) and \(\mu_0(X_i) = E[Y_i(0) | X_i]\) are the CEFs of the two potential outcomes.
Intuitively - the bias term captures the differences in the conditional expectation function between observation \(i\)’s covariates and the covariates of the \(M\) matches in \(\mathcal{J}_m(i)\).
But does this bias go away in large samples?
Let’s construct a simulation with confounding. Start with \(K=8\) i.i.d. covariates \(X_1, X_2, \dotsc X_K\) each distributed \(\mathcal{N}(0,1)\).
Treatment probability is modeled as a logit
\[\text{log}\bigg(\frac{e(X_i)}{1-e(X_i)}\bigg) = \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \dotsc + \beta_k X_k\]
We’ll assume the coefficients are \(\beta_k = \frac{1}{k}\)
Outcome is linear w/ same coefficients \(\beta_k\) and a constant treatment effect of \(2\)
\[Y_i = 2D_i + \mathbf{X}\beta + \epsilon_i\]
Instead of substituting in just the average in the matches, Abadie and Imbens (2011) propose a “bias-corrected” imputation
For \(D_i = 1\)
\[\widehat{Y}_i(1) = Y_i\]
\[\widehat{Y}_i(0) = \frac{1}{M}\sum_{j \in \mathcal{J}_M(i)} (Y_j + \hat{\mu_0}(X_i) - \hat{\mu_0}(X_j))\]
For \(D_i = 0\)
\[\widehat{Y}_i(0) = Y_i\]
\[\widehat{Y}_i(1) = \frac{1}{M}\sum_{j \in \mathcal{J}_M(i)} (Y_j + \hat{\mu_1}(X_i) - \hat{\mu_1}(X_j))\]
Intuition – We combine regression and matching! Regression models adjust for the residual imbalance that matching doesn’t solve while matching helps limit the consequences of regression model misspecification.
Ho, Daniel E., et al. “Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference.” Political analysis 15.3 (2007): 199-236.
Abadie and Imbens (2008) show that the standard pairs bootstrap doesn’t work well for variance estimation.
Otsu and Rai (2017) develop a weighted bootstrap using the residuals from the linearized version of the matching estimator
\[\hat{\tau} = \frac{1}{N}\sum_{i=1}^N (2D_i -1) \bigg(1 + \frac{K_M(i)}{M}\bigg) Y_i = \frac{1}{N}\sum_{i=1}^N \tilde{\tau}_i\]
Alternatively, Abadie and Imbens (2006) derive the asymptotic variance and implement a matching estimator for the outcome variance terms.
Matching package implements this estimator.In the case of post-matching inference when matching without replacement, Abadie and Spiess (2020) show that matching induces dependence within matched sets
A lot of other methods have been developed (primarily in the medicine/biostats literature) that are essentially extensions of matching without replacement
Optimal matching
Full matching
Other parts of the matching literature try tweaking the distance metric
Genetic matching (Diamond and Sekhon, 2013)
Estimate... 0.48066
AI SE...... 1.3318
T-stat..... 0.36093
p.val...... 0.71815
Original number of observations.............. 1006
Original number of treated obs............... 356
Matched number of observations............... 356
Matched number of observations (unweighted). 371
Estimate... 5.698
AI SE...... 1.4385
T-stat..... 3.9612
p.val...... 7.4571e-05
Original number of observations.............. 1006
Original number of treated obs............... 356
Matched number of observations............... 356
Matched number of observations (unweighted). 358
Number of obs dropped by 'exact' or 'caliper' 0
match_results_bc <- Matching::Match(Y = turn$black_turnout, Tr = turn$black,
X = turn %>% dplyr::select(year, pop90, blackpop_pct1990,
college_pct, hs_pct,
unemp, income, poverty, home),
exact = c(T, F, F, F, F, F, F, F, F),
M=1 , Weight = 2, estimand = "ATT", BiasAdjust = T)
# Weight = 2 = Mahalanobis distance
Estimate... -1.028
AI SE...... 1.5732
T-stat..... -0.65343
p.val...... 0.51348
Original number of observations.............. 1006
Original number of treated obs............... 356
Matched number of observations............... 356
Matched number of observations (unweighted). 358
Number of obs dropped by 'exact' or 'caliper' 0
PS 813 - University of Wisconsin - Madison