PS 813 - Causal Inference
April 27, 2026
\[ \require{cancel} \]
We observe an outcome \(Y_i\) and a \(K \times 1\) vector of predictors \(X_i\) for each of \(N\) units (one of which is a constant – the “intercept”)
Consider the ordinary least squares estimator
\[\hat{\beta} = \bigg(\frac{1}{N}\sum_{i=1}^N X_iX_i^\prime\bigg)^{-1}\bigg(\frac{1}{N}\sum_{i=1}^N X_iY_i\bigg) = (\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X}^\prime Y\]
When is OLS consistent for the true conditional expectation function \(\mathbb{E}[Y_i | X_i]\)?
\[Y_i = X_i^\prime \beta + \epsilon_i = \beta_1 X_{i1} + \beta_2 X_{i2} + \dotsc + \beta_K X_{iK} + \epsilon_i\]
\[\mathbb{E}[\epsilon \mid \mathbf{X}] = 0\]
The fitted values from OLS can be written as a linear projection of \(Y\) onto the column space of \(\mathbf{X}\)
\[\hat{Y} = \mathbf{X}\hat{\beta} = \mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X}^\prime Y = \mathbf{H}Y\]
\(\mathbf{H} = \mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X}^\prime\) is the projection matrix or hat matrix (“hat” because it puts the hat on \(Y\)).
\(\mathbf{H}\) is symmetric
\(\mathbf{H}\) is idempotent (\(\mathbf{H}\mathbf{H} = \mathbf{H}\)).
One way to think about the values of \(\mathbf{H}\) is they define the weights that generate \(\hat{Y}\) from \(Y\)
\[\hat{Y}_i = \sum_{j=1}^N h_{ij} Y_j\]
Note that these weights are not convex!
The diagonal elements \(h_{ii}\) of the matrix are referred to as the leverage of each observation
“High leverage” observations are those that are extreme in the covariate space.
We have an estimator, \(\hat{\beta}\). We want to estimate its variance. Two questions:
Let’s first derive the variance. Plugging \(Y = \mathbf{X}\beta + \epsilon\) into the OLS formula:
\[\hat{\beta} - \beta = (\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X}^\prime\epsilon\]
Taking the conditional variance:
\[\text{Var}(\hat{\beta} \mid \mathbf{X}) = (\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X}^\prime \, \mathbb{E}[\epsilon\epsilon^\prime \mid \mathbf{X}] \, \mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1}\]
Define \(\mathbf{\Sigma} \equiv \mathbb{E}[\epsilon\epsilon^\prime \mid \mathbf{X}]\) - the \(N \times N\) variance-covariance matrix of the errors:
\[\text{Var}(\hat{\beta} \mid \mathbf{X}) = (\mathbf{X}^\prime\mathbf{X})^{-1} \, \mathbf{X}^\prime \mathbf{\Sigma} \mathbf{X} \, (\mathbf{X}^\prime\mathbf{X})^{-1}\]
Equivalently:
\[\text{Var}(\hat{\beta} \mid \mathbf{X}) = \frac{1}{N}\bigg(\frac{1}{N}\sum_{i=1}^N X_iX_i^\prime\bigg)^{-1}\bigg(\frac{1}{N}\sum_{i=1}^N \sum_{j=1}^N \sigma_{ij} X_iX_j^\prime\bigg)\bigg(\frac{1}{N}\sum_{i=1}^N X_iX_i^\prime\bigg)^{-1}\]
where \(\sigma_{ij} = \text{Cov}(\epsilon_i, \epsilon_j \mid \mathbf{X})\) is the \((i, j)\) entry of \(\mathbf{\Sigma}\).
Our various approaches to estimation will make different assumptions on \(\mathbf{\Sigma}\) to facilitate estimation.
The classical OLS variance estimator imposes an assumption of spherical errors
\[\text{Var}(\epsilon \mid \mathbf{X}) = \sigma^2\mathbf{I}\]
Two underlying assumptions:
Under spherical errors, the sandwich form simplifies as \(\mathbf{\Sigma} = \sigma^2\mathbf{I}\):
\[\text{Var}(\hat{\beta} \mid \mathbf{X}) = (\mathbf{X}^\prime\mathbf{X})^{-1} (\mathbf{X}^\prime \sigma^2 \mathbf{I} \mathbf{X}) (\mathbf{X}^\prime\mathbf{X})^{-1} = \sigma^2 (\mathbf{X}^\prime\mathbf{X})^{-1}\]
An unbiased estimator for \(\sigma^2\) uses the residuals with a degrees-of-freedom correction
\[\hat{\sigma}^2 = \frac{1}{N - K} \sum_{i=1}^N \hat{e}_i^2\]
Yielding the classical variance estimator
\[\widehat{\text{Var}(\hat{\beta})} = \hat{\sigma}^2 (\mathbf{X}^\prime\mathbf{X})^{-1}\]
Our first relaxation is to allow \(\text{Var}(\epsilon_i \mid X_i) = \sigma_i^2\) to vary across \(i\)
A consistent estimator of \(Var(\hat{\beta})\) replaces \(\sigma_i^2\) with the squared OLS residual \(\hat{e}_i^2\)
\[\widehat{\text{Var}(\hat{\beta})} = (\mathbf{X}^\prime\mathbf{X})^{-1} \bigg(\sum_{i=1}^N \hat{e}_i^2 X_i X_i^\prime\bigg) (\mathbf{X}^\prime\mathbf{X})^{-1}\]
This is the Eicker-Huber-White heteroskedasticity-consistent variance estimator
Now let’s relax the assumption that \(Cov(\epsilon_i, \epsilon_j | \mathbf{X}) = 0\) for \(i \neq j\)
Index observations by cluster \(g = 1, \dotsc, G\) and order the observations by cluster.
We assume \(Cov(\epsilon_i, \epsilon_j | \mathbf{X}) = 0\) if \(i\) and \(j\) do not share a cluster.
This makes \(\mathbf{\Sigma}\) block-diagonal - \(\sigma_{ij} = 0\) for \(i, j\) in different clusters.
\[\text{Var}(\hat{\beta} \mid \mathbf{X}) = (\mathbf{X}^\prime\mathbf{X})^{-1} \bigg(\sum_{g=1}^G X_g^\prime \mathbf{\Sigma}_g X_g\bigg)(\mathbf{X}^\prime\mathbf{X})^{-1}\]
where \(\mathbf{\Sigma}_g = \mathbb{E}[\epsilon_g \epsilon_g^\prime \mid \mathbf{X}]\) is the \(N_g \times N_g\) within-cluster block of \(\mathbf{\Sigma}\).
The Liang-Zeger (1986) cluster-robust estimator plugs in residuals at the cluster level
\[\widehat{\text{Var}(\hat{\beta})} = (\mathbf{X}^\prime\mathbf{X})^{-1} \bigg(\sum_{g=1}^G X_g^\prime \hat{e}_g \hat{e}_g^\prime X_g\bigg) (\mathbf{X}^\prime\mathbf{X})^{-1}\]
Arbitrary correlation structure within each cluster is allowed, but we restrict correlation across cluster
\(G\) clusters with \(n_g = 10\) units each. Treatment \(X_g\) assigned at the cluster level (half of clusters treated).
Errors decompose into a cluster shock and an idiosyncratic shock
\[\epsilon_{ig} = \alpha_g + u_{ig}, \quad \alpha_g \sim \mathcal{N}(0, \sigma_\alpha^2), \quad u_{ig} \sim \mathcal{N}(0, \sigma_u^2)\]
with \(\sigma_\alpha^2 + \sigma_u^2 = 1\), so the intra-class correlation is \(\rho = \sigma_\alpha^2\).
Outcome model with no treatment effect:
\[Y_{ig} = \beta_0 + \epsilon_{ig}\]
Fractional weighted bootstrap: Randomly assign each unit a weight drawn from a uniform Dirichlet distribution multiplied by \(n\) - run the weighted regression.
Wild bootstrap (Wu 1986; Mammen 1993): hold \(X_i\) fixed, perturb the residuals by random sign-flips
\[Y_i^{(b)} = X_i^\prime\hat{\beta} + \omega_i^{(b)} \hat{e}_i, \quad \omega_i^{(b)} \in \{-1, +1\}\]
Wild cluster bootstrap (Cameron, Gelbach, Miller 2008): same idea, but flip the sign of the entire cluster’s residuals together. Works well in small \(G\) cases.
For estimating the “meat”, use the cross-product of the residuals weighted by the distance between the two observations.
\[\widehat{\mathbf{X}^\prime \mathbf{\Sigma} \mathbf{X}} = \sum_{i=1}^N \sum_{j=1}^N w(d_{ij}, d^*) \hat{e}_i \hat{e}_j X_i X_j^\prime\]
\(25 \times 25\) regular grid of points indexed by location (\(N = 625\)). \(d_{ij}\) = Euclidean distance between locations \(i\) and \(j\).
Draw treatment from a gaussian random field \(x = (x_1, \dotsc, x_N)^\prime\)
\[x \sim \mathcal{N}(0, \, \mathbf{\Sigma}_\rho), \qquad [\mathbf{\Sigma}_\rho]_{ij} = \exp(-d_{ij} / \rho)\]
Draw errors independently from the same field (spatially correlated)
Outcome: \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\) with \(\beta_1 = 0\).
PS 813 - University of Wisconsin - Madison