Week 14, Part 1: Regression, Dependence and Standard Errors

PS 813 - Causal Inference

Anton Strezhnev

University of Wisconsin-Madison

April 27, 2026

\[ \require{cancel} \]

Today

  • Going back to estimation with ordinary least squares regression.
    • Estimand: \(\mathbb{E}[Y|X]\)
  • What happens when we have dependence across units
    • Dependence due to cluster-based sampling from a target population
    • Dependence due to cluster-based assignment of treatment
  • Various forms of “robust” standard errors
    • Conventional “heteroskedasticity” robust SEs
    • Cluster-robust standard errors
    • Spatial HAC
  • When do these work? What sample sizes do I need for valid inference? How do I implement a bootstrap?

Regression Review

  • We observe an outcome \(Y_i\) and a \(K \times 1\) vector of predictors \(X_i\) for each of \(N\) units (one of which is a constant – the “intercept”)

    • We’ll stack these into the \(N \times 1\) outcome vector \(Y\) and the \(N \times K\) design matrix \(\mathbf{X}\)
  • Consider the ordinary least squares estimator

    \[\hat{\beta} = \bigg(\frac{1}{N}\sum_{i=1}^N X_iX_i^\prime\bigg)^{-1}\bigg(\frac{1}{N}\sum_{i=1}^N X_iY_i\bigg) = (\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X}^\prime Y\]

Regression Review

  • When is OLS consistent for the true conditional expectation function \(\mathbb{E}[Y_i | X_i]\)?

    1. Linear CEF

    \[Y_i = X_i^\prime \beta + \epsilon_i = \beta_1 X_{i1} + \beta_2 X_{i2} + \dotsc + \beta_K X_{iK} + \epsilon_i\]

    1. Zero Conditional Mean Error

    \[\mathbb{E}[\epsilon \mid \mathbf{X}] = 0\]

    1. No perfect collinearity

Aside: Leverage/Hat values

  • The fitted values from OLS can be written as a linear projection of \(Y\) onto the column space of \(\mathbf{X}\)

    \[\hat{Y} = \mathbf{X}\hat{\beta} = \mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X}^\prime Y = \mathbf{H}Y\]

  • \(\mathbf{H} = \mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X}^\prime\) is the projection matrix or hat matrix (“hat” because it puts the hat on \(Y\)).

  • \(\mathbf{H}\) is symmetric

  • \(\mathbf{H}\) is idempotent (\(\mathbf{H}\mathbf{H} = \mathbf{H}\)).

Aside: Leverage/Hat values

  • One way to think about the values of \(\mathbf{H}\) is they define the weights that generate \(\hat{Y}\) from \(Y\)

    \[\hat{Y}_i = \sum_{j=1}^N h_{ij} Y_j\]

  • Note that these weights are not convex!

  • The diagonal elements \(h_{ii}\) of the matrix are referred to as the leverage of each observation

    • Captures how much the prediction for unit \(i\) will change in response to a change in the observation for \(i\)
  • “High leverage” observations are those that are extreme in the covariate space.

Regression Standard Errors

  • We have an estimator, \(\hat{\beta}\). We want to estimate its variance. Two questions:

    • What is the actual variance? \(Var(\hat{\beta})\)
    • Can we come up with a good estimator of it? \(\widehat{Var}(\hat{\beta})\)
  • Let’s first derive the variance. Plugging \(Y = \mathbf{X}\beta + \epsilon\) into the OLS formula:

    \[\hat{\beta} - \beta = (\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X}^\prime\epsilon\]

  • Taking the conditional variance:

    \[\text{Var}(\hat{\beta} \mid \mathbf{X}) = (\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X}^\prime \, \mathbb{E}[\epsilon\epsilon^\prime \mid \mathbf{X}] \, \mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1}\]

  • Define \(\mathbf{\Sigma} \equiv \mathbb{E}[\epsilon\epsilon^\prime \mid \mathbf{X}]\) - the \(N \times N\) variance-covariance matrix of the errors:

    \[\text{Var}(\hat{\beta} \mid \mathbf{X}) = (\mathbf{X}^\prime\mathbf{X})^{-1} \, \mathbf{X}^\prime \mathbf{\Sigma} \mathbf{X} \, (\mathbf{X}^\prime\mathbf{X})^{-1}\]

Regression Standard Errors

  • Equivalently:

    \[\text{Var}(\hat{\beta} \mid \mathbf{X}) = \frac{1}{N}\bigg(\frac{1}{N}\sum_{i=1}^N X_iX_i^\prime\bigg)^{-1}\bigg(\frac{1}{N}\sum_{i=1}^N \sum_{j=1}^N \sigma_{ij} X_iX_j^\prime\bigg)\bigg(\frac{1}{N}\sum_{i=1}^N X_iX_i^\prime\bigg)^{-1}\]

    where \(\sigma_{ij} = \text{Cov}(\epsilon_i, \epsilon_j \mid \mathbf{X})\) is the \((i, j)\) entry of \(\mathbf{\Sigma}\).

  • Our various approaches to estimation will make different assumptions on \(\mathbf{\Sigma}\) to facilitate estimation.

Classical Standard Errors

  • The classical OLS variance estimator imposes an assumption of spherical errors

    \[\text{Var}(\epsilon \mid \mathbf{X}) = \sigma^2\mathbf{I}\]

  • Two underlying assumptions:

    • Homoskedasticity: \(\text{Var}(\epsilon_i \mid X_i) = \sigma^2\) for all \(i\)
    • Zero error covariance: \(\text{Cov}(\epsilon_i, \epsilon_j \mid \mathbf{X}) = 0\) for \(i \neq j\)
  • Under spherical errors, the sandwich form simplifies as \(\mathbf{\Sigma} = \sigma^2\mathbf{I}\):

    \[\text{Var}(\hat{\beta} \mid \mathbf{X}) = (\mathbf{X}^\prime\mathbf{X})^{-1} (\mathbf{X}^\prime \sigma^2 \mathbf{I} \mathbf{X}) (\mathbf{X}^\prime\mathbf{X})^{-1} = \sigma^2 (\mathbf{X}^\prime\mathbf{X})^{-1}\]

Classical Standard Errors

  • An unbiased estimator for \(\sigma^2\) uses the residuals with a degrees-of-freedom correction

    \[\hat{\sigma}^2 = \frac{1}{N - K} \sum_{i=1}^N \hat{e}_i^2\]

  • Yielding the classical variance estimator

    \[\widehat{\text{Var}(\hat{\beta})} = \hat{\sigma}^2 (\mathbf{X}^\prime\mathbf{X})^{-1}\]

Heteroskedasticity-robust SEs

  • Our first relaxation is to allow \(\text{Var}(\epsilon_i \mid X_i) = \sigma_i^2\) to vary across \(i\)

    • Still assume zero covariance across observations.
    • The variance-covariance matrix of the errors is a diagonal matrix \(\mathbf{\Sigma} = \text{diag}(\sigma_1^2, \dotsc, \sigma_N^2)\).
  • A consistent estimator of \(Var(\hat{\beta})\) replaces \(\sigma_i^2\) with the squared OLS residual \(\hat{e}_i^2\)

    \[\widehat{\text{Var}(\hat{\beta})} = (\mathbf{X}^\prime\mathbf{X})^{-1} \bigg(\sum_{i=1}^N \hat{e}_i^2 X_i X_i^\prime\bigg) (\mathbf{X}^\prime\mathbf{X})^{-1}\]

  • This is the Eicker-Huber-White heteroskedasticity-consistent variance estimator

Heteroskedasticity-robust SEs

  • HC-robust SEs are biased in finite samples.
    • A number of corrections have been proposed which change how the \(\sigma_i^2\) are imputed
    • These estimators tend to perform better in smaller samples
  • HC0: \(\hat{e}_i^2\) - the original estimator
  • HC1: \(\frac{N}{N - K}\hat{e}_i^2\) - a constant re-scaling by the degrees of freedom
  • HC2: \(\frac{\hat{e}_i^2}{1 - h_{ii}}\) - scale the residuals by the leverage
  • HC3: \(\frac{\hat{e}_i^2}{(1 - h_{ii})^2}\) - use the ``jacknife” residual

Cluster-robust SEs

  • Now let’s relax the assumption that \(Cov(\epsilon_i, \epsilon_j | \mathbf{X}) = 0\) for \(i \neq j\)

    • A common structure of dependence is that errors are correlated within groups but independent across groups
  • Index observations by cluster \(g = 1, \dotsc, G\) and order the observations by cluster.

    • Let \(X_g\) be the \(N_g \times K\) design matrix and \(\hat{e}_g\) the residual vector for cluster \(g\).
  • We assume \(Cov(\epsilon_i, \epsilon_j | \mathbf{X}) = 0\) if \(i\) and \(j\) do not share a cluster.

  • This makes \(\mathbf{\Sigma}\) block-diagonal - \(\sigma_{ij} = 0\) for \(i, j\) in different clusters.

    • The “meat” of the sandwich becomes a sum over clusters:

    \[\text{Var}(\hat{\beta} \mid \mathbf{X}) = (\mathbf{X}^\prime\mathbf{X})^{-1} \bigg(\sum_{g=1}^G X_g^\prime \mathbf{\Sigma}_g X_g\bigg)(\mathbf{X}^\prime\mathbf{X})^{-1}\]

    where \(\mathbf{\Sigma}_g = \mathbb{E}[\epsilon_g \epsilon_g^\prime \mid \mathbf{X}]\) is the \(N_g \times N_g\) within-cluster block of \(\mathbf{\Sigma}\).

Cluster-robust SEs

  • The Liang-Zeger (1986) cluster-robust estimator plugs in residuals at the cluster level

    \[\widehat{\text{Var}(\hat{\beta})} = (\mathbf{X}^\prime\mathbf{X})^{-1} \bigg(\sum_{g=1}^G X_g^\prime \hat{e}_g \hat{e}_g^\prime X_g\bigg) (\mathbf{X}^\prime\mathbf{X})^{-1}\]

  • Arbitrary correlation structure within each cluster is allowed, but we restrict correlation across cluster

    • e.g. in many DiD/TWFE regression settings, we assume independence across unit but allow for arbitrary correlation over time.

Cluster-robust SEs

  • Crucially, now the consistency results for cluster-robust standard errors depend on \(G\) going to infinity, not \(N\)
    • With a few clusters (less than \(30\)-ish), the conventional Liang-Zeger estimator will tend to under-cover
  • Small sample corrections (cluster-level analogues of HC1/HC2):
    • CR1 - multiply CR0 by \(\frac{G}{G-1} \cdot \frac{N-1}{N-K}\) - a degrees-of-freedom rescaling at the cluster level
    • CR2 (Bell-McCaffrey 2002): replace \(\hat{e}_g \hat{e}_g^\prime\) with \(\mathbf{A}_g \hat{e}_g \hat{e}_g^\prime \mathbf{A}_g^\prime\), where \(\mathbf{A}_g = (\mathbf{I}_{N_g} - \mathbf{H}_{gg})^{-1/2}\) uses the within-cluster block \(\mathbf{H}_{gg}\) of the hat matrix

Simulation evidence: Setup

  • \(G\) clusters with \(n_g = 10\) units each. Treatment \(X_g\) assigned at the cluster level (half of clusters treated).

  • Errors decompose into a cluster shock and an idiosyncratic shock

    \[\epsilon_{ig} = \alpha_g + u_{ig}, \quad \alpha_g \sim \mathcal{N}(0, \sigma_\alpha^2), \quad u_{ig} \sim \mathcal{N}(0, \sigma_u^2)\]

    with \(\sigma_\alpha^2 + \sigma_u^2 = 1\), so the intra-class correlation is \(\rho = \sigma_\alpha^2\).

  • Outcome model with no treatment effect:

    \[Y_{ig} = \beta_0 + \epsilon_{ig}\]

Simulation: Varying ICC

  • HC1 over-rejects severely as soon as \(\rho > 0\). CR1 and CR2 hold size at \(G = 50\) across all ICC levels.

Simulation: Varying number of clusters

  • HC1 over-rejects everywhere. CR1 over-rejects badly with few clusters; CR2 is much better-behaved but still imperfect at \(G = 5\).

Bootstrap methods

  • We can extend the bootstrap beyond the conventional “pairs” bootstrap to deal with clustered covariance structure
    • Block/cluster bootstrap: resample whole clusters with replacement, keeping all units within a sampled cluster.
  • We can also implement bootstrap procedures that don’t rely on resampling units entirely
    • Fractional weighted bootstrap: Randomly assign each unit a weight drawn from a uniform Dirichlet distribution multiplied by \(n\) - run the weighted regression.

    • Wild bootstrap (Wu 1986; Mammen 1993): hold \(X_i\) fixed, perturb the residuals by random sign-flips

      \[Y_i^{(b)} = X_i^\prime\hat{\beta} + \omega_i^{(b)} \hat{e}_i, \quad \omega_i^{(b)} \in \{-1, +1\}\]

    • Wild cluster bootstrap (Cameron, Gelbach, Miller 2008): same idea, but flip the sign of the entire cluster’s residuals together. Works well in small \(G\) cases.

Spatial autocorrelation

  • Errors can also be correlated in space: observations on neighboring regions are exposed to common shocks.
  • Cluster-robust SEs require a partition of the units but spatial dependence often does not have clear boundaries
  • Conley (1999) “spatial HAC”
    • For estimating the “meat”, use the cross-product of the residuals weighted by the distance between the two observations.

      \[\widehat{\mathbf{X}^\prime \mathbf{\Sigma} \mathbf{X}} = \sum_{i=1}^N \sum_{j=1}^N w(d_{ij}, d^*) \hat{e}_i \hat{e}_j X_i X_j^\prime\]

    where \(d_{ij}\) is the distance between \(i\) and \(j\) and \(w(\cdot)\) is a kernel that goes to zero past cutoff distance \(d^*\).
  • The bandwidth \(d^*\) is a researcher choice.

Spatial simulation: Setup

  • \(25 \times 25\) regular grid of points indexed by location (\(N = 625\)). \(d_{ij}\) = Euclidean distance between locations \(i\) and \(j\).

  • Draw treatment from a gaussian random field \(x = (x_1, \dotsc, x_N)^\prime\)

    \[x \sim \mathcal{N}(0, \, \mathbf{\Sigma}_\rho), \qquad [\mathbf{\Sigma}_\rho]_{ij} = \exp(-d_{ij} / \rho)\]

  • Draw errors independently from the same field (spatially correlated)

    • Both \(X\) and \(\epsilon\) inherit the same spatial structure
  • Outcome: \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\) with \(\beta_1 = 0\).

Visualizing spatial dependence

  • Small \(\rho\) looks like noise; large \(\rho\) produces large coherent “patches”. Both \(X_i\) and \(\epsilon_i\) are independent realizations of this latent field at the same \(\rho\).

Spatial simulation: HC1 vs Conley

  • HC1 over-rejects as \(\rho\) grows. Conley does better but as the “range” of dependence gets large \(\rho\), we also overreject!

Design-based approaches

  • We’ve focused on thinking about dependence from the standpoint of model-based inference
    • Clustered SEs are valid under unobserved cluster-level shocks uncorrelated with the regressors.
  • But when are these estimators justified under design-based grounds?
  • Descriptive inference
    • Clustering reflects the structure of sampling (e.g. sampling villages then individuals)
  • Causal inference - Abadie, Athey, Imbens, Wooldridge (2020; 2023)
    • Can avoid clustering if treatment assignment is independent (e.g. complete randomization) and we’re targeting the SATE.
    • If there is clustering in either the treatment or the sampling scheme, then ignoring clustering will tend to under-cover.
  • Xu and Wooldridge (2022) extend the design-based approach to spatial SEs

Conclusion

  • What standard errors should you use when you fit a linear regression?
  • Classical SEs - Basically never
  • Heteroskedasticity-robust SEs - Genuine i.i.d. sampling/treatment assignment
  • Cluster-robust SEs - i.i.d. sampling/treatment assignment by cluster
    • Can just do the analysis at the cluster level as an alternative.
  • Dyadic/Multi-way SEs (Aronow, Samii, and Assenova 2015; Cameron, Gelbach and Miller (2011))
    • Dependence between units that share a member (e.g. dyads)
  • Spatial SEs - Dependence is smooth in some underlying space.
    • Need pretty rapid decay otherwise you’re functionally in an \(N=1\) setting!
  • Be wary of finite sample problems!
    • Angrist and Pischke (2009): Rule of 42

Next Lecture

  • Come with questions!
    • We have lots of time for review!
  • Concluding topic - interference in treatment assignment
    • How does the estimand change when there is arbitrary (unknown) interference?
    • Are there designs that facilitate estimation under known interference?
    • A common design with interference: the “shift-share” design.