Week 4: Experiments - Covariate Adjustment

PS 813 - Causal Inference

Anton Strezhnev

University of Wisconsin-Madison

February 8, 2026

\[ \require{cancel} \DeclareMathOperator*{\argmin}{arg\,min} \]

\[ \DeclareMathOperator*{\argmax}{arg\,max} \]

Where we’ve been

  • Causal effects as contrasts in potential outcomes
    • Average causal effects as estimands that we can identify under assumptions on the experimental design
  • Randomized experiments identify average treatment effects
    • Observed treated (control) groups are representative of what would happen on average if whole sample received treatment (control)
  • Conditioning on post-treatment covariates breaks your experiment
    • Endogenous selection bias - treated group no longer comparable to the controls
    • For non-compliance - we’ll learn how to identify the Complier Average Causal Effect
  • External validity is a question of effect heterogeneity

Some thoughts on identification assumptions

  • (Nearly) every causal identification assumption can be thought of as a variation on these three ideas

  • Randomization - Something is statistically independent of something else

    • (e.g.) Ignorability of treatment for a conventional randomized experiment
  • Effect homogeneity - We know something about the individual treatment effects

    • (e.g.) Monotonicity - effects are only in one direction.
  • Correct model - We know the true model for something (e.g. outcome \(Y\))

    • Hard to credibly justify in most social science setting, but underpins a lot of “pre-credibility revolution” empirical work even if not explicit.

This week

  • What if we want more precision in our experiments
    • We can increase the sample size…but that’s expensive!
    • For a given sample size, can we do better than complete randomization?
  • Covariate adjustment eliminates variability in \(Y\) driven by non-treatment characteristics
    • Adjustment ex-ante via stratification/blocking
    • Adjustment ex-post via post-stratification
  • Linear regression in its agnostic form.
    • OLS as an estimator of the Best Linear Predictor
    • Can we get the model wrong for \(E[Y | X]\) and still get consistent estimates of treatment effects? Yes!

Stratification

Power and Precision

  • In a conventional hypothesis test, we control the Type I error - the false positive rate

    • Given that the null is true, what’s the probability that we reject it?
    • We choose this number (e.g.) \(\alpha = .05\)
  • But we also want to know the Type II error - the false negative rate

    • Given that the null is false, what’s the probability that we fail to reject it?
  • The false negative rate in our typical hypothesis test depends on:

    • The true effect size
    • The sampling variance of the estimator
  • For the Neyman variance, what makes it smaller?

    \[Var(\hat{\tau}) = \frac{S^2_t}{N_t} + \frac{S^2_c}{N_c}\]

Covariate adjustment

  • How do we reduce the outcome variance?

    • Often we’ll transform the outcome (e.g. differencing using a pre-treatment measure of that outcome)
  • But our most useful technique is to find factors that predict the outcome and eliminate the variance attributable to those factors.

  • Recall from the law of total variance

    \[Var(Y) = \underbrace{\mathbb{E}[Var(Y | X)]}_{\text{unexplained by } X} + \underbrace{Var(\mathbb{E}[Y|X])}_{\text{explained by } X}\]

  • Adjusting for covariates \(X\) in an experiment eliminates variance in \(Y\) that is attributable to \(X\)

Stratification/Blocking

  • It is common to build the covariate adjustment directly into the experimental design through blocking or stratification
  • Before assigning treatment, partition the sample that puts the \(N\) units into \(G\) mutually-exclusive strata
    • With discrete covariates, each bin \(g\) is a unique combination of the covariates
    • With continuous covariates, we can coarsen into discrete bins (e.g. age \(\to\) “25-40”)
    • Each stratum \(g\) has \(N_g\) units.
  • Within each stratum treatment is completely randomized
    • \(N_{t, g}\) units receive treatment
    • \(N_{c, g} = N_g - N_{t, g}\) units receive control.

Quiz

  • If \(N_{t, g}/N_g\) is the same across all strata, can I analyze the experiment like I would with complete randomization?

  • How about if \(N_{t,g}/N_g\) varies between strata?

Estimation

  • Our estimator is the stratified difference-in-means

  • We start by estimating each stratum-specific treatment effect \(\tau_g\) using the difference-in-means in that stratum

    \[\hat{\tau_g} = \frac{1}{N_{t,g}}\sum_{i: G_i = g}^N Y_i D_i - \frac{1}{N_{c,g}}\sum_{i: G_i = g}^N Y_i (1 - D_i)\]

  • And we can estimate the sampling variance using the within-stratum Neyman variance estimator

    \[\widehat{Var(\hat{\tau_g})} = \frac{s_{t,g}^2}{N_{t,g}} + \frac{s_{c,g}^2}{N_{c,g}}\]

  • \(s_{t,g}^2\), \(s_{c,g}^2\) are the sample variance of \(Y\) within the treated group and control group respectively in stratum \(g\).

  • Imagine: We ran \(G\) independent mini-experiments and analyzed them separately. Each is unbiased for the conditional ATE (CATE) and has its own standard error.

    • But we care about the average treatment effect

Estimation under block-randomization

  • How do we aggregate to get an estimate of the ATE? Take a weighted average by stratum size

    \[\hat{\tau} = \sum_{g = 1}^G \hat{\tau_g} \times \frac{N_g}{N}\]

  • And the sampling variance?

    \[\widehat{Var_{\text{strat}}(\hat{\tau})} = \sum_{g = 1}^G \widehat{Var(\hat{\tau_g})} \times \left(\frac{N_g}{N}\right)^2\]

  • When is the variance of the blocked design going to be lower than the variance under complete randomization?

    • When the strata explain some of the variance in \(Y\) (the population \(S^2_{t,g} < S^2_{t}\) and \(S^2_{c,g} < S^2_{c}\) )

Can blocking hurt?

  • Long debate over where it’s possible to go wrong blocking. Athey and Imbens (2017) argue no downside to blocking.
  • This answer depends on the framework for inference.
    • Pashley and Miratrix (2021) give an extensive review under alternative sampling/inference schemes.
  • Athey and Imbens result holds under stratified random sampling from the population and equal treatment probability w/in strata.
    • Intuition: In the worst case scenario, stratification is just a two-stage randomization process equivalent to complete randomization.
  • This also does not guarantee that the estimated standard error will be smaller
    • With an irrelevant covariate, we will have fewer degrees of freedom as we are estimating multiple parameters. Estimated SEs under stratification might be higher.
  • Athey and Imbens suggest falling back on the conservative complete randomization SE,

Post-stratification

  • Suppose we didn’t stratify ex-ante but have some covariates that we observe? Can we analyze as-though we had stratified on these?
    • Yes: Post-stratification
  • Key difference from stratification: Number of treated/control w/in stratum is random and not fixed. Stratum sizes also not fixed.
    • Not as efficient as if we had blocked ex-ante!
  • Miratrix, Sekhon and Yu (2013)
    • Usually not a problem - relative to blocking ex-ante, the differences in variances are small.
    • Problems with many strata + poorly predictive strata.
    • Unlike the Athey and Imbens setting, benefits not guaranteed, but often doesn’t hurt with good covariate choice.

Example: Nyhan and Reifler (2015)

  • Nyhan and Reifler (2015) conduct a field experiment on state legislators in November 2012 to study whether politicians reacted to external monitoring of their statements.
    • Treatment - Legislators received a letter warning them of the risks of having false statements exposed by fact-checkers
    • Placebo - Legislators received a letter warning that their statements would be observed.
    • Control - No mailer
  • Treatment was block-randomized
    • Exact blocking on state, political party, legislative chamber and existence of a previous PolitiFact rating
    • Coarsened matching on previous vote share and fundraising
  • Outcome:
    • Did the legislator receive a negative PolitiFact rating?
    • Was there media coverage of a legislator’s inaccurate statements?

Example: Nyhan and Reifler (2015)

  • For the main analysis, the paper combines the Placebo/Control conditions.
factcheck <- haven::read_dta("assets/data/nyhan-reifler-fe.dta")
table(factcheck$treatment)

  0   1 
777 392 
  • Let’s check balance
    • We know balance is exact on state/party/chamber and whether there is a past fact check.
    • Coarse balance on fundraising and vote share
    • Other covariates observed but not explicitly “blocked” on (e.g. leadership position)

Example: Nyhan and Reifler (2015)

Example: Nyhan and Reifler (2015)

  • The original paper estimates the treatment effect on each of the three outcomes using a weighted difference-in-means.
    • The weights are negligible but address the slight non-uniformity in treatment assignment across randomization strata.
  • Here, we replicate the original estimate for the combined “any PolitiFact check or LexisNexis news article questioning accuracy” outcome
dm_est <- tidy(lm_robust(combined ~ treatment, data=factcheck, weights=aw))
dm_est %>% filter(term == "treatment") %>% select(estimate, std.error, statistic, p.value)
     estimate   std.error statistic    p.value
1 -0.01568843 0.008225802 -1.907223 0.05673708
  • But this is kind of a waste of blocking.
    • Why do the blocking if you’re not going to incorporate it into the analysis?!
    • Again, the point of blocking in an experiment is not bias reduction but variance reduction!
    • This paper uses a too conservative variance estimator.

Example: Nyhan and Reifler (2015)

  • The paper does include the randomization strata. Unfortunately many of those strata have only a single treated observation.
    • Why is this a problem? We can’t estimate the stratum-specific variance with a single observation!
  • Advice: If you’re stratifying, make sure to have at least two or so observations per condition per stratum
    • Otherwise, if you are pair-randomizing, just take the within-pair differences and use a conventional sample mean estimator.
  • Let’s take a look at the strata and see what covariate levels they capture - maybe we can combine some?
block_summaries <- factcheck %>% group_by(blockdummy) %>% summarize(state = state[1],
                                                                    chamber = chamber[1],
                                                                    gop = mean(gop), 
                                                                    anycheck = mean(anycheck),
                                                                    N_Treated = sum(treatment),
                                                                    N_Control = sum(treatment == 0))

Example: Nyhan and Reifler (2015)

Example: Nyhan and Reifler (2015)

  • Basically a lot of strata have too few members within a party who are previously fact checked (fact checks are rare). We’ll combine some of these strata for the analysis.
    • Could go through and manually merged - to save time, I’ll just ignore chamber (it’s not that prognostic)
  • Let’s make those strata (paste() just concatenates the raw variable values)
factcheck <- factcheck %>% mutate(groupedblocks = paste(state, gop, anycheck, sep="-"))
factcheck$groupedblocks[factcheck$groupedblocks == "Virginia-1-1"] <- "Virginia-0-1" # Merge some Virginia strata that still have too few units
  • Now, we can calculate the stratified difference-in-means estimate and compute the correct sampling variance!
    • You can do this with group_by and mean() and var() (or for this case, weighted.mean() and weighted.var())
    • But it’s so much easier to do this with a (special) regression.
    • We’ll explain why this works later, but for now, let’s show you the estimate!

Example: Nyhan and Reifler (2015)

  • We use the Lin (2013) estimator in the estimatr package
    • Mathematically equivalent to our stratified difference-in-means estimator when using dummy indicators for strata (we’ll show you why later)
stratified_est <- tidy(estimatr::lm_lin(combined ~ treatment, covariates = ~ groupedblocks, data=factcheck, weights=aw))
stratified_est %>% filter(term == "treatment") %>% select(estimate, std.error, statistic, p.value)
     estimate   std.error statistic p.value
1 -0.01568843 0.008138779 -1.927615 0.05416
  • Compare with the case where we use the conservative variance estimator that ignores the strata
dm_est %>% filter(term == "treatment") %>% select(estimate, std.error, statistic, p.value)
     estimate   std.error statistic    p.value
1 -0.01568843 0.008225802 -1.907223 0.05673708

Summary

  • Stratification/Blocking is a tool for reducing the variance of your treatment effect estimator
    • Ex-ante - Doesn’t hurt
    • Ex-post - Usually doesn’t hurt if the covariates are predictive enough.
  • Take care to use an estimator that actually leverages the variance-reduction
    • A stratified difference-in-means
    • Equivalent to OLS with de-meaned dummies for the strata interacted with the treatment indicator (the “Lin” estimator)
  • What to block on?
    • Things that predict the outcome
    • Surprisingly hard to find good ones! Best ones tend to be prior measures of the outcome.

Linear Regression

Linear Regression

  • We often want to adjust for additional covariates when trying to estimate a treatment effect

    • In an RCT - prognostic covariates that reduce variance in Y
    • (Next week) - In an observational design - both prognostic covariates and confounders of Y and X
  • By the law of total expectation, we can write the ATE as an expectation over the distribution of CATEs

    \[\tau = \mathbb{E}_X\bigg[\mathbb{E}[Y_i(1) | X_i] - \mathbb{E}[Y_i(0) | X_i]\bigg]\]

  • Under our consistency + ignorability assumptions:

    \[E[Y_i(1) | X_i] = E[Y_i | D_i = 1, X_i]\]

  • Now we have a statistical problem, estimating the CEF of \(Y_i\) given \(X_i\) and \(D_i\)

    • This is the task of regression
    • A ubiquitous regression estimator is ordinary least squares

Set-up

  • We’ll denote our vector of \(N\) observed outcomes with \(Y\)

  • We observe \(K\) predictors for each unit \(i\) collected in a matrix \(\mathbf{X}\)

    • Each row is a \(K\)-length vector \(X_i\) denoting the covariates observed for unit \(i\)

    \[\mathbf{X} = \begin{pmatrix} X_1^\prime \\ X_2^\prime \\ \vdots \\ X_N^\prime \end{pmatrix}_{N \times K} \qquad Y = \begin{pmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_N \end{pmatrix}_{N \times 1}\]

  • We’ll use the \(\prime\) notation to denote the transpose of a matrix and treat vectors by default as column vectors (so \(X_i^\prime\) is a \(1 \times K\) row-vector)

  • We assume an \(i.i.d.\) random sample from the target population of interest (though can relax this!)

Best Linear Predictor (BLP)

  • The Best Linear Predictor or population regression is a function of some input vector \(x\) of length \(K\)

    \[m(x_i) = x^\prime\beta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \dotsc + \beta_K x_K\]

  • The \(\beta\) parameters minimize the expected prediction error

    \[\beta = \argmin_b \ \mathbb{E}[(Y_i - X_i^\prime b)^2]\]

  • Note that this is a population quantity

    • \(m(x)\) is an estimand like any other estimand we’ve talked about
    • \(m(x)\) and \(E[Y|X]\) might differ substantially!

Best Linear Predictor (BLP)

  • Is there a closed form expression for the population regression coefficients \(\beta\)?

    • Yes - with some algebra (solving the optimization problem), we get

    \[\beta = (\mathbb{E}[X_iX_i^\prime])^{-1}\mathbb{E}[X_iY_i]\]

  • Recall from intro that in the bivariate case, that the expression for the slope coefficient \(\beta_1\) can be written as:

    \[\beta_1 = \frac{Cov(X_i, Y_i)}{Var(X_i)}\]

  • What have we assumed so far?

    • Finite mean/variance/covariances for \(Y\) and \(X\)
    • \(\mathbb{E}[X_iX_i^\prime]\) is invertible (no perfect collinearity in columns of \(X\))

Projection error

  • What can we say about the difference between the BLP and \(Y\)?

    \[e_i = Y_i - m(X_i)\]

  • Rewriting, we have

    \[Y_i = X_i^\prime\beta + e_i\]

  • \(e_i\) is the projection error since the BLP is the “best” projection of \(Y\) into the space of linear combinations of the columns of \(\mathbf{X}\)

Projection error

  • We can show that the projection error is mechanically uncorrelated with the predictors

\[\begin{align*} \mathbb{E}[X_i e_i] &= \mathbb{E}[X_i(Y_i - X_i^\prime\beta)]\\ &= \mathbb{E}[X_iY_i] - \mathbb{E}[X_iX_i^\prime]\beta\\ &= \mathbb{E}[X_iY_i] - \mathbb{E}[X_iX_i^\prime](\mathbb{E}[X_iX_i^\prime])^{-1}\mathbb{E}[X_iY_i]\\ &= \mathbb{E}[X_iY_i] - \mathbb{E}[X_iY_i] = 0\\ \end{align*}\]

  • Since \(X_{ik} = 1\) is typically one of the regressors (the “intercept”), this also implies \(\mathbb{E}[e_i] = 0\)

Projection error

  • This means that the covariance of \(X_i\) and the projection error \(e_i\) is zero!

\[\begin{align*} Cov(X_i, \epsilon) = \mathbb{E}[X_ie_i] - E[X_i]E[e_i] = 0 - 0 = 0 \end{align*}\]

  • Importantly, we have derived these properties on the Best Linear Predictor and the projection error without making any assumptions on the error itself!
    • This is mechanically true by the way we’ve defined the BLP

Estimating the BLP

  • Remember, \(m(X)\) is a population quantity - can we come up with an estimator in our sample that is consistent for it?

    • plug-in principle - Where we have population expectations, plug in their sample equivalents
  • Our estimand

    \[\beta = (\mathbb{E}[X_iX_i^\prime])^{-1}\mathbb{E}[X_iY_i]\]

  • Our estimator

    \[\hat{\beta} = \bigg(\frac{1}{N} \sum_{i=1}^N X_iX_i^\prime\bigg)^{-1}\bigg(\frac{1}{N} \sum_{i=1}^N X_iY_i\bigg)\]

  • Often written as:

    \[\hat{\beta} = (\mathbf{X}^\prime\mathbf{X})^{-1}(\mathbf{X}^\prime Y)\]

Estimating the BLP

  • To visualize how \(\mathbf{X}^\prime\mathbf{X}\) is a sum over the outer products \(X_{i}X_{i}^\prime\)

\[\mathbf{X}^\prime \mathbf{X} = \begin{pmatrix} x_{11} & x_{21} & x_{31} & \cdots & x_{N1} \\ x_{12} & x_{22} & x_{32} & \cdots & x_{N2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ x_{1K} & x_{2K} & x_{3K} & \cdots & x_{NK} \end{pmatrix}_{K \times N} \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1K} \\ x_{21} & x_{22} & \cdots & x_{2K} \\ x_{31} & x_{32} & \cdots & x_{3K} \\ \vdots & \vdots & \ddots & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{NK} \end{pmatrix}_{N \times K}\]

\[= \underbrace{\begin{pmatrix} x_{11}^2 & x_{11}x_{12} & \cdots & x_{11}x_{1K} \\ x_{12}x_{11} & x_{12}^2 & \cdots & x_{12}x_{1K} \\ \vdots & \vdots & \ddots & \vdots \\ x_{1K}x_{11} & x_{1K}x_{12} & \cdots & x_{1K}^2 \end{pmatrix}}_{X_1 X_1^\prime} + \cdots + \underbrace{\begin{pmatrix} x_{N1}^2 & x_{N1}x_{N2} & \cdots & x_{N1}x_{NK} \\ x_{N2}x_{N1} & x_{N2}^2 & \cdots & x_{N2}x_{NK} \\ \vdots & \vdots & \ddots & \vdots \\ x_{NK}x_{N1} & x_{NK}x_{N2} & \cdots & x_{NK}^2 \end{pmatrix}}_{X_N X_N^\prime}\]

Estimating the BLP

  • And likewise for \(\mathbf{X}^\prime Y\)

\[\mathbf{X}^\prime Y = \begin{pmatrix} x_{11} & x_{21} & x_{31} & \cdots & x_{N1} \\ x_{12} & x_{22} & x_{32} & \cdots & x_{N2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ x_{1K} & x_{2K} & x_{3K} & \cdots & x_{NK} \end{pmatrix}_{K \times N} \begin{pmatrix} Y_1 \\ Y_2 \\ Y_3 \\ \vdots \\ Y_N \end{pmatrix}_{N \times 1} = \begin{pmatrix} x_{11}Y_1 + x_{21}Y_2 + x_{31}Y_3 + \cdots + x_{N1}Y_N \\ x_{12}Y_1 + x_{22}Y_2 + x_{32}Y_3 + \cdots + x_{N2}Y_N \\ \vdots \\ x_{1K}Y_1 + x_{2K}Y_2 + x_{3K}Y_3 + \cdots + x_{NK}Y_N \end{pmatrix}_{K \times 1}\]

Estimating the BLP

  • Is \(\hat{\beta}\) consistent for \(\beta\)?

  • Let’s write \(\hat{\beta}\) as a function of \(\beta\) and the projection error

    \[\hat{\beta} = \beta + \bigg(\frac{1}{n} \sum_{i=1}^n X_iX_i^\prime\bigg)^{-1}\bigg(\frac{1}{n} \sum_{i=1}^n X_ie_i\bigg) \]

  • We can use the weak law of large numbers

    \[\frac{1}{N} \sum_{i=1}^N X_iX_i^\prime \overset{p}{\to} \mathbb{E}[X_iX_i^\prime]\]

    \[\frac{1}{N} \sum_{i=1}^N X_ie_i \overset{p}{\to} \mathbb{E}[X_ie_i] = 0\]

Estimating the BLP

  • Plugging it all in, we have

    \[\hat{\beta} \overset{p}{\to} \beta\]

  • Ordinary least squares is consistent for the best linear predictor under relatively mild assumptions

    • i.i.d. sample (can weaken this with LLNs for dependent random variables)
    • finite expectation/variance of \(Y\), \(X\)
    • no perfect collinearity (you can check if this is a problem!)
  • What have we not assumed

    • Linear CEF
    • Homoskedastic errors
    • Normal outcome

Asymptotic normality

  • In large samples, what is the distribution of \(\hat{\beta}\)?

    \[\sqrt{n}(\hat{\beta} - \beta) = \bigg(\frac{1}{n} \sum_{i=1}^n X_iX_i^\prime\bigg)^{-1}\bigg(\frac{1}{\sqrt{n}} \sum_{i=1}^n X_ie_i\bigg)\]

  • We know that \(\bigg(\frac{1}{n} \sum_{i=1}^n X_iX_i^\prime\bigg)^{-1} \overset{p}{\to} \mathbb{E}[X_iX_i^\prime]^{-1}\)

Asymptotic normality

  • We can apply the CLT to the other term. We know the expectation is zero, what about the variance?

    \[Var(X_ie_i) = \mathbb{E}[X_i e_i (X_i e_i)^\prime] = \mathbb{E}[e_i^2X_i X_i^\prime]\]

  • By the CLT, we have

    \[\frac{1}{\sqrt{n}} \sum_{i=1}^n X_ie_i \overset{d}{\to} \mathcal{N}(0, \mathbb{E}[e_i^2X_i X_i^\prime])\]

  • Combining with our other convergence result using Slutsky’s theorem gives

    \[\sqrt{n}(\hat{\beta} - \beta) \overset{d}{\to} \mathcal{N}\bigg(0, (\mathbb{E}[X_iX_i^\prime])^{-1} (\mathbb{E}[e_i^2X_i X_i^\prime]) (\mathbb{E}[X_iX_i^\prime])^{-1}\bigg)\]

Variance estimation

  • Again, we have a bunch of population expectations, let’s plug in their sample equivalents!

  • Our estimand

    \[Var(\hat{\beta}) = \frac{1}{n} (E[X_iX_i^\prime])^{-1} (\mathbb{E}[e_i^2X_i X_i^\prime]) (E[X_iX_i^\prime])^{-1}\]

  • Our estimator

    \[\widehat{Var(\hat{\beta})} = \frac{1}{n} \bigg(\frac{1}{n} \sum_{i=1}^n X_i X_i^\prime \bigg)^{-1}\bigg(\frac{1}{n} \sum_{i=1}^n \hat{e_i}^2 X_i X_i^\prime \bigg)\bigg(\frac{1}{n} \sum_{i=1}^n X_i X_i^\prime \bigg)^{-1}\]

  • Equivalently

    \[\widehat{Var(\hat{\beta})} = \bigg(\sum_{i=1}^n X_i X_i^\prime \bigg)^{-1}\bigg(\sum_{i=1}^n \hat{e_i}^2 X_i X_i^\prime \bigg)\bigg(\sum_{i=1}^n X_i X_i^\prime \bigg)^{-1}\]

Variance estimation

  • This is often known as the “sandwich” estimator

    \[\widehat{Var(\hat{\beta})} = (\mathbf{X}^\prime\mathbf{X})^{-1} (\mathbf{X}^\prime \hat{\Sigma} \mathbf{X}) (\mathbf{X}^\prime\mathbf{X})^{-1}\]

  • where \(\hat{\Sigma} = \text{diag}(\hat{e_1}^2, \hat{e_2}^2, \dotsc \hat{e_n}^2)\)

  • You’ll see this referred to as the Eicker-Huber-White heteroskedasticity-consistent variance estimator (or just the robust standard errors)

    • Notably, we have made no assumptions on the variance of \(e_i\), we’re just plugging in the regression residuals!

OLS and Projections

  • One way to think about the fitted values from OLS is that they are a projection of the n-dimensional vector \(Y\) into the column space of \(\mathbf{X}\)

    • A column space is the set of all linear combinations of the columns of \(\mathbf{X}\)
  • We define the projection matrix or hat matrix

    \[\mathbf{P}_{\mathbf{X}} = \mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X^\prime}\]

  • Our fitted values \(\hat{Y}\) are a projection of \(Y\) to the “closest” vector that can be represented as a linear combination of the columns of \(\mathbf{X}\)

    \[\hat{Y} = \mathbf{P}_{\mathbf{X}}Y = \mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X^\prime}Y = \mathbf{X}\hat{\beta}\]

Regression adjustment in randomized experiments

  • We’ve focused on a minimal assumption approach to justifying linear regression.

  • But usually you see regression taught using the Gauss-Markov assumptions (plus normal errors)

    1. Linearity of the CEF
    2. Strict exogeneity of the errors
    3. No perfect collinearity
    4. Spherical errors (homoskedasticity)
    5. Normal errors
  • We’ll review these later in this lecture, but do we need these assumptions to use regression in a randomized experiment?

    • Freedman (2008) “…randomization does not justify the assumptions behind the OLS model”

Regression adjustment in randomized experiments

  • Lin (2013) shows that even a misspecified ordinary least squares regression will yield a consistent estimator of the sample ATE if the regression estimator:
    1. De-means the covariates
    2. Interacts the covariates with the treatment indicator
  • Mathematically, this is equivalent to…
    1. Fitting two regressions - one in the treated group, one in the control group
    2. Predicting the potential outcome under treatment for every unit in the sample using the treatment model
    3. Predicting the potential outcome under control for every unit in the sample using the control model
    4. Taking the average difference between the two predictions!

Imputation estimators

  • Many of the treatment effect estimators we’ll see can be written in terms of imputed potential outcomes

    \[\hat{\tau} = \frac{1}{N} \sum_{i=1}^N \widehat{Y_i(1)} - \widehat{Y_i(0)}\]

  • Our simple difference-in-means estimator is a simple case of this - we just impute the mean for every \(i\)

    • \(\widehat{Y_i(1)} = \bar{Y}_1 = \frac{1}{N_t}\sum_{i = 1}^N Y_i D_i\)
    • \(\widehat{Y_i(0)} = \bar{Y}_0 = \frac{1}{N_c}\sum_{i = 1}^N Y_i (1 - D_i)\)
  • But we could also generate predictions of \(Y_i(1)\) and \(Y_i(0)\) from a regression model!

    • Let \(\beta^{(1)}\) be the coefficients from regressing \(Y_i\) on \(X_i\) among the treated group
    • Let \(\beta^{(0)}\) are the coefficients from regressing \(Y_i\) on \(X_i\) among the control group
    • Our regression imputation estimator is: \(\widehat{Y_i(1)} = X_i^\prime \beta^{(1)}\) and \(\widehat{Y_i(0)} = X_i^\prime \beta^{(0)}\)

Lin (2013) estimator

  • We can recover this difference between the two regression functions from a single model regressing \(Y_i\) on the de-meaned \(X_i\) fully interacted with the treatment \(D_i\)
    • This is what is known as the Lin (2013) regression
    • Implemented in estimatr::lm_lin() which handles the de-meaning for you!
  • Let’s show the equivalence using the Nyhan and Reifler (2015) experiment
    • Suppose that in addition to the strata, we want to adjust for stuff we didn’t explicitly block on
adjusted_est <- tidy(estimatr::lm_lin(combined ~ treatment, 
                                      covariates = ~ groupedblocks + partyleader + commleader + fundraising + voteshare, data=factcheck, weights=aw))
adjusted_est %>% filter(term == "treatment") %>% select(estimate, std.error, statistic, p.value)
    estimate   std.error statistic    p.value
1 -0.0152531 0.008249513  -1.84897 0.06473204

Lin (2013) estimator

  • Fitting the regression in the treated group and control group
treated_reg <- lm(combined ~ groupedblocks + partyleader + commleader + fundraising + voteshare, 
                  data=factcheck %>% filter(treatment == 1), weights=aw)
control_reg <- lm(combined ~ groupedblocks + partyleader + commleader + fundraising + voteshare, 
                  data=factcheck %>% filter(treatment == 0), weights=aw)
  • Predicting for all units
factcheck$Y_1 <- predict(treated_reg, newdata=factcheck)
factcheck$Y_0 <- predict(control_reg, newdata=factcheck)
  • Verify that we get the same result as the Lin (2013) regression!
weighted.mean(factcheck$Y_1, factcheck$aw) - weighted.mean(factcheck$Y_0, factcheck$aw)
[1] -0.0152531

Next week

  • Introduction to observational designs
    • Treatment is not directly randomized by researchers!
  • Selection-on-observables
    • Can we assume that we have adjusted for all factors predicting treatment and outcome?
    • Simple adjustment using a stratification estimator.
  • Encoding our selection-on-observables assumptions using directed acyclic graphs
    • Adjustment criterion - what is a sufficient set of “controls” that would allow us to identify a treatment effect?