\(s_{t,g}^2\), \(s_{c,g}^2\) are the sample variance of \(Y\) within the treated group and control group respectively in stratum \(g\).
Imagine: We ran \(G\) independent mini-experiments and analyzed them separately. Each is unbiased for the conditional ATE (CATE) and has its own standard error.
But we care about the average treatment effect
Estimation under block-randomization
How do we aggregate to get an estimate of the ATE? Take a weighted average by stratum size
When is the variance of the blocked design going to be lower than the variance under complete randomization?
When the strata explain some of the variance in \(Y\) (the population \(S^2_{t,g} < S^2_{t}\) and \(S^2_{c,g} < S^2_{c}\) )
Can blocking hurt?
Long debate over where it’s possible to go wrong blocking. Athey and Imbens (2017) argue no downside to blocking.
This answer depends on the framework for inference.
Pashley and Miratrix (2021) give an extensive review under alternative sampling/inference schemes.
Athey and Imbens result holds under stratified random sampling from the population and equal treatment probability w/in strata.
Intuition: In the worst case scenario, stratification is just a two-stage randomization process equivalent to complete randomization.
This also does not guarantee that the estimated standard error will be smaller
With an irrelevant covariate, we will have fewer degrees of freedom as we are estimating multiple parameters. Estimated SEs under stratification might be higher.
Athey and Imbens suggest falling back on the conservative complete randomization SE,
Post-stratification
Suppose we didn’t stratify ex-ante but have some covariates that we observe? Can we analyze as-though we had stratified on these?
Yes: Post-stratification
Key difference from stratification: Number of treated/control w/in stratum is random and not fixed. Stratum sizes also not fixed.
Not as efficient as if we had blocked ex-ante!
Miratrix, Sekhon and Yu (2013)
Usually not a problem - relative to blocking ex-ante, the differences in variances are small.
Problems with many strata + poorly predictive strata.
Unlike the Athey and Imbens setting, benefits not guaranteed, but often doesn’t hurt with good covariate choice.
Example: Nyhan and Reifler (2015)
Nyhan and Reifler (2015) conduct a field experiment on state legislators in November 2012 to study whether politicians reacted to external monitoring of their statements.
Treatment - Legislators received a letter warning them of the risks of having false statements exposed by fact-checkers
Placebo - Legislators received a letter warning that their statements would be observed.
Control - No mailer
Treatment was block-randomized
Exact blocking on state, political party, legislative chamber and existence of a previous PolitiFact rating
Coarsened matching on previous vote share and fundraising
Outcome:
Did the legislator receive a negative PolitiFact rating?
Was there media coverage of a legislator’s inaccurate statements?
Example: Nyhan and Reifler (2015)
For the main analysis, the paper combines the Placebo/Control conditions.
Basically a lot of strata have too few members within a party who are previously fact checked (fact checks are rare). We’ll combine some of these strata for the analysis.
Could go through and manually merged - to save time, I’ll just ignore chamber (it’s not that prognostic)
Let’s make those strata (paste() just concatenates the raw variable values)
factcheck <- factcheck %>%mutate(groupedblocks =paste(state, gop, anycheck, sep="-"))factcheck$groupedblocks[factcheck$groupedblocks =="Virginia-1-1"] <-"Virginia-0-1"# Merge some Virginia strata that still have too few units
Now, we can calculate the stratified difference-in-means estimate and compute the correct sampling variance!
You can do this with group_by and mean() and var() (or for this case, weighted.mean() and weighted.var())
But it’s so much easier to do this with a (special) regression.
We’ll explain why this works later, but for now, let’s show you the estimate!
Example: Nyhan and Reifler (2015)
We use the Lin (2013) estimator in the estimatr package
Mathematically equivalent to our stratified difference-in-means estimator when using dummy indicators for strata (we’ll show you why later)
We’ll use the \(\prime\) notation to denote the transpose of a matrix and treat vectors by default as column vectors (so \(X_i^\prime\) is a \(1 \times K\) row-vector)
We assume an \(i.i.d.\) random sample from the target population of interest (though can relax this!)
Best Linear Predictor (BLP)
The Best Linear Predictor or population regression is a function of some input vector \(x\) of length \(K\)
Importantly, we have derived these properties on the Best Linear Predictor and the projection error without making any assumptions on the error itself!
This is mechanically true by the way we’ve defined the BLP
Estimating the BLP
Remember, \(m(X)\) is a population quantity - can we come up with an estimator in our sample that is consistent for it?
plug-in principle - Where we have population expectations, plug in their sample equivalents
where \(\hat{\Sigma} = \text{diag}(\hat{e_1}^2, \hat{e_2}^2, \dotsc \hat{e_n}^2)\)
You’ll see this referred to as the Eicker-Huber-Whiteheteroskedasticity-consistent variance estimator (or just the robust standard errors)
Notably, we have made no assumptions on the variance of \(e_i\), we’re just plugging in the regression residuals!
OLS and Projections
One way to think about the fitted values from OLS is that they are a projection of the n-dimensional vector \(Y\) into the column space of \(\mathbf{X}\)
A column space is the set of all linear combinations of the columns of \(\mathbf{X}\)
Our fitted values \(\hat{Y}\) are a projection of \(Y\) to the “closest” vector that can be represented as a linear combination of the columns of \(\mathbf{X}\)
We’ve focused on a minimal assumption approach to justifying linear regression.
But usually you see regression taught using the Gauss-Markov assumptions (plus normal errors)
Linearity of the CEF
Strict exogeneity of the errors
No perfect collinearity
Spherical errors (homoskedasticity)
Normal errors
We’ll review these later in this lecture, but do we need these assumptions to use regression in a randomized experiment?
Freedman (2008) “…randomization does not justify the assumptions behind the OLS model”
Regression adjustment in randomized experiments
Lin (2013) shows that even a misspecified ordinary least squares regression will yield a consistent estimator of the sample ATE if the regression estimator:
De-means the covariates
Interacts the covariates with the treatment indicator
Mathematically, this is equivalent to…
Fitting two regressions - one in the treated group, one in the control group
Predicting the potential outcome under treatment for every unit in the sample using the treatment model
Predicting the potential outcome under control for every unit in the sample using the control model
Taking the average difference between the two predictions!
Imputation estimators
Many of the treatment effect estimators we’ll see can be written in terms of imputed potential outcomes
But we could also generate predictions of \(Y_i(1)\) and \(Y_i(0)\) from a regression model!
Let \(\beta^{(1)}\) be the coefficients from regressing \(Y_i\) on \(X_i\) among the treated group
Let \(\beta^{(0)}\) are the coefficients from regressing \(Y_i\) on \(X_i\) among the control group
Our regression imputation estimator is: \(\widehat{Y_i(1)} = X_i^\prime \beta^{(1)}\) and \(\widehat{Y_i(0)} = X_i^\prime \beta^{(0)}\)
Lin (2013) estimator
We can recover this difference between the two regression functions from a single model regressing \(Y_i\) on the de-meaned\(X_i\) fully interacted with the treatment \(D_i\)
This is what is known as the Lin (2013) regression
Implemented in estimatr::lm_lin() which handles the de-meaning for you!
Let’s show the equivalence using the Nyhan and Reifler (2015) experiment
Suppose that in addition to the strata, we want to adjust for stuff we didn’t explicitly block on