Week 2: Potential Outcomes and Experiments

PS 813 - Causal Inference

Anton Strezhnev

University of Wisconsin-Madison

January 26, 2026

\[ \require{cancel} \]

This week

  • Defining causal estimands
    • The “potential outcomes” model of causation
  • Causal identification
    • Linking causal estimands to observable quantities
  • Randomized experiments as a solution to the identification problem
    • Treatment assignment is independent of the potential outcomes
  • Statistical inference for completely randomized experiments
    • Neyman’s approach
    • Fisher’s approach

The potential outcomes model

Thinking about causal effects

  • Two types of causal questions (Gelman and Rubin, 2013)

  • Causes of effects

    • What are the factors that generate some outcome \(Y\)?
    • “Why?” questions: Why do states go to war? Why do politicians get re-elected?
  • Effects of causes

    • If \(X\) were to change, what might happen to \(Y\)?
    • “What if?” questions: If a politician were an incumbent, would they be more likely to be re-elected compared to if they were a non-incumbent?
  • Our focus in this class is on effects of causes

    • Why? We can connect them to well-defined statistical quantities of interest (e.g. an “average treatment effect”)
    • “Causes of effects” are still important questions, but they’re more questions of theory

Defining a causal effect

  • Historically, causality was seen as a deterministic process.
    • Hume (1740): Causes are regularities in events of “constant conjunctions”
    • Mill (1843): Method of difference
  • This became problematic – empirical observation alone does not demonstrate causality.
    • Russell (1913): Scientists aren’t interested in causality!
  • How do we talk about causation that both incorporates uncertainty in measurement and clearly defines what we mean by a “causal effect”?

The potential outcomes model

  • Rubin (1974) - formalizes a framework for understanding causation from a statistical perspective.

    • Inspired by earlier Neyman (1923) and Fisher (1935) on randomized experiments.
  • We’ll spend most of our time with this approach, often called the Rubin Causal Model or potential outcomes framework.

  • Core idea:

    • Causal effects are effects of interventions
    • Causal effects are contrasts in counterfactuals
  • The potential outcomes framework clarifies:

    1. What action is doing the causing?
    2. Compared to what alternative action?
    3. On what outcome metric?
    4. How would we learn about the effect from data?

Statistical setup.

  • Population of units
    • Finite population or infinite super-population
  • Sample of \(N\) units from the population indexed by \(i\)
  • Observed outcome \(Y_i\)
  • Binary treatment indicator \(D_i\).
    • Units receiving “treatment”: \(D_i = 1\)
    • Units receiving “control”: \(D_i = 0\)
  • Covariates (observed prior to treatment) \(X_i\)

Potential outcomes

  • The potential outcome \(Y_i(d)\) is the value that the outcome would take if \(D_i\) were set to \(d\).
    • For binary \(D_i\): \(Y_i(1)\) is the value we would observe if unit \(i\) were treated.
    • \(Y_i(0)\) is the value we would observe if unit \(i\) were under control
  • We model the potential outcomes as fixed attributes of the units.
    • “Potential” in the sense that what is actually observed is a function of treatment assignment
  • Notation alert! – Sometimes you’ll see potential outcomes written as:
    • \(Y_i^1\), \(Y_i^0\) or \(Y_i^{d=1}\), \(Y_i^{d=0}\)
    • \(Y_{i0}\), \(Y_{i1}\)
    • \(Y_1(i)\), \(Y_0(i)\)
  • Causal effects are contrasts in potential outcomes.
    • Individual treatment effect: \(\tau_i = Y_i(1) - Y_i(0)\)
    • Can consider ratios or other transformations (e.g. \(\frac{Y_i(1)}{Y_i(0)}\))

Consistency/SUTVA

  • How do we link the potential outcomes to observed ones?

  • Consistency/Stable Unit Treatment Value (SUTVA) assumption

    \[Y_i(d) = Y_i \text{ if } D_i = d\]

  • Sometimes you’ll see this w/ binary \(D_i\) (often in econometrics)

    \[Y_i = Y_i(1)D_i + Y_i(0)(1-D_i)\]

  • Implications

    1. No interference - other units’ treatments don’t affect \(i\)’s potential outcomes.
    2. Single version of treatment
    3. \(D\) is in principle manipulable - a “well-defined intervention”
    4. The means by which treatment is assigned is irrelevant (a version of 2)

Positivity/Overlap

  • We also need some assumptions on the treatment assignment mechanism \(D_i\).

  • In order to be able to observe some units’ values of \(Y_i(1)\) or \(Y_i(0)\) treatment can’t be deterministic. For all \(i\):

    \[ 0 < Pr(D_i = 1) < 1 \]

  • If no units could ever receive treatment or control it would be impossible to learn about \(\mathbb{E}[Y_i | D_i = 1]\) or \(\mathbb{E}[Y_i | D_i = 0]\)

  • This is sometimes called a positivity or overlap assumption.

    • Pretty trivial in a randomized experiment, but can be tricky in observational studies when \(D_i\) is perfectly determined by some covariates \(X_i\)

A missing data problem

  • It’s useful to think of the causal inference problem in terms of missingness in the complete table of potential outcomes.
Unit \(i\) Treatment \(D_i\) \(Y_i(1)\) \(Y_i(0)\) Observed \(Y_i\)
\(1\) \(1\) \(5\) ? \(5\)
\(2\) \(0\) ? \(-3\) \(-3\)
\(3\) \(1\) \(9\) ? \(9\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
\(N\) \(0\) ? \(8\) \(8\)
  • If we could observe both \(Y_i(1)\) and \(Y_i(0)\) for each unit, then this would be easy!
  • But we can’t - we only observe what we’re given by \(D_i\)
  • Holland (1986) calls this “The Fundamental Problem of Causal Inference”

Causal Estimands

  • The individual causal effect: \(\tau_i\) (can’t identify this w/o strong assumptions!)

    \[\tau_i = Y_i(1) - Y_i(0)\]

  • The sample average treatment effect (SATE): \(\tau_s\)

    \[\tau_s = \frac{1}{N}\sum_{i=1}^N Y_i(1) - Y_i(0)\]

  • The population average treatment effect (PATE) \(\tau_p\)

    \[\tau_p = \mathbb{E}[Y_i(1) - Y_i(0)] = \mathbb{E}[Y_i(1)] - \mathbb{E}[Y_i(0)]\]

Sample vs. Population Estimands

  • With the SATE and PATE, we’ve made an important distinction between two sources of uncertainty
    • Random assignment of treatment (unobserved P.O.s)
    • Sampling from a population.
  • Even if we’re just interested in the treatment effect within our sample, there’s still uncertainty
  • When can we go from SATE to PATE?
    • If we have a random sample from the target population
    • If there are no sources of effect heterogeneity that differ between sample and target population
    • We’ll spend Week 3 talking about this problem - external validity

Causal vs. Associational Estimands

Causal Identification

  • Causal identification: Can we learn about the value of a causal effect from the observed data?
    • Can we express the causal estimand (e.g. \(\tau_p = \mathbb{E}[Y_i(1) - Y_i(0)]\)) entirely in terms of observable quantities?
  • Causal identification comes prior to questions of estimation
    • It doesn’t matter whether you’re using regression, weighting, matching, doubly-robust estimation, double-LASSO, etc…
    • If you can’t answer the question “What’s your identification strategy?” then no amount of fancy stats will solve your problems.
  • Identification requires assumptions about the connection between the observed data \(Y_i\), \(D_i\) and the unobserved counterfactuals \(Y_i(d)\)
    • (e.g.) Under what assumptions will the observed difference-in-means identify the average treatment effect?

Identifying the ATT

  • Suppose we want to identify the (population) Average Treatment Effect on the Treated (ATT)

    \[\tau_{\text{ATT}} = \mathbb{E}[Y_i(1) - Y_i(0) | D_i = 1]\]

  • Let’s see what our consistency/SUTVA assumption gets us!

  • First, let’s use linearity:

    \[\tau_{\text{ATT}} = \mathbb{E}[Y_i(1) | D_i = 1] - \mathbb{E}[Y_i(0) | D_i = 1]\]

  • Next, consistency

    \[\tau_{\text{ATT}} = \mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i(0) | D_i = 1]\]

Identifying the ATT

  • Still not enough though. We have an unobserved term \(\mathbb{E}[Y_i(0) | D_i = 1]\). Why can’t we observe this directly?

    \[\tau_{\text{ATT}} = \mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i(0) | D_i = 1]\]

  • Let’s see what the difference would be between the ATT and the simple difference-in-means \(\mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i | D_i = 0]\). Add and subtract \(\mathbb{E}[Y_i | D_i = 0]\)

    \[\tau_{\text{ATT}} = \mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i(0) | D_i = 1] - \mathbb{E}[Y_i | D_i = 0] + \mathbb{E}[Y_i | D_i = 0]\]

  • Rearranging terms

    \[\tau_{\text{ATT}} = \bigg(\mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i | D_i = 0]\bigg) - \bigg(\mathbb{E}[Y_i(0) | D_i = 1] - \mathbb{E}[Y_i | D_i = 0]\bigg)\]

Identifying the ATT

  • Now we have an expression for the ATT in terms of the difference-in-means and a bias term

    \[\tau_{\text{ATT}} = \underbrace{\bigg(\mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i | D_i = 0]\bigg)}_{\text{Difference-in-means}} - \underbrace{\bigg(\mathbb{E}[Y_i(0) | D_i = 1] - \mathbb{E}[Y_i(0) | D_i = 0]\bigg)}_{\text{Selection-into-treatment bias}}\]

  • What does this bias term represent? How can we interpret it?

    • How much higher are the potential outcomes under control for units that receive treatment vs. those that receive control.
    • Sometimes called a selection-into-treatment problem - units that choose treatment may have higher or lower potential outcomes than those that choose control.
  • Can do the same analysis for the average treatment effect under control (ATC) and by extension the average treatment effect

Selection-into-treatment bias

  • Can use theory to “sign the bias”:
    • Suppose \(Y_i\) was an indicator of whether someone voted in an election and \(D_i\) was an indicator for whether they received a political mailer.
    • Consider a world where the mailer was sent out non-randomly to everyone who had signed up for a politician’s mailing list.
    • If we took the difference in turnout rates between voters who received the mailer and voters who did not receive the mailer, would we be over-estimating or under-estimating the effect of treatment?

Ignorability/Unconfoundedness

  • Suppose we want point identification and not just bounds on the causal effect?

    • What assumption can we make such that the difference-in-means identifies the ATT (or ATE)?
  • We assume: the selection-into-treatment bias is \(0\)

    \[\mathbb{E}[Y_i(0) | D_i = 1] = \mathbb{E}[Y_i(0) | D_i = 0]\]

    \[\mathbb{E}[Y_i(1) | D_i = 1] = \mathbb{E}[Y_i(1) | D_i = 0]\]

  • This will be true if treatment is independent of the potential outcomes.

    \[\{Y_i(1), Y_i(0)\} {\perp \! \! \! \perp} D_i\]

  • Common names for this assumption: exogeneity, unconfoundedness, ignorability

    • In simple terms: Treatment is not systematically more/less likely to be assigned to units that have higher/lower potential outcomes.

Ignorability/Unconfoundedness

  • The difference-in-means identifies the average treatment effect under three assumptions:

    1. Consistency/SUTVA
    2. Positivity/Overlap
    3. Ignorability/Unconfoundedness
  • Consistency gives us:

    \[\mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i | D_i = 0] = \mathbb{E}[Y_i(1) | D_i = 1] - \mathbb{E}[Y_i(0) | D_i = 0]\]

  • And ignorability gives us:

    \[\mathbb{E}[Y_i(1) | D_i = 1] - \mathbb{E}[Y_i(0) | D_i = 0] = \mathbb{E}[Y_i(1)] - \mathbb{E}[Y_i(0)] = \tau\]

Ignorability/Unconfoundedness

Experiments

Randomized Experiments

  • What sort of research design justifies ignorability?

    • A randomized experiment!
  • An experiment is any study where a researcher knows and controls the treatment assignment probability \(Pr(D_i = 1)\)

  • A randomized experiment is an experiment that satisfies:

    • Positivity: \(0 < Pr(D_i = 1) < 1\) for all units
    • Ignorability: \(Pr(D_i = 1| \mathbf{Y}(1), \mathbf{Y}(0)) = Pr(D_i = 1)\)
      • Another implication of \(\mathbf{Y}(1), \mathbf{Y}(0) {\perp \! \! \! \perp} D_i\)
      • Treatment assignment probabilities do not depend on the potential outcomes.

Types of experiments

  • Lots of ways in which we could design a randomized experiment where ignorability holds:
  • Let \(N_t\) be the number of treated units, \(N_c\) number of controls
  • Bernoulli randomization:
    • Independent coin flips for each \(D_i\). \(Pr(D_i = 1) = p\)
    • \(D_i {\perp \! \! \! \perp} D_j\) for all \(i\), \(j\).
    • \(N_t\), \(N_c\) are random variables
  • Complete randomization
    • Fix \(N_t\) and \(N_c\) in advance. Randomly select \(N_t\) units to be treated.
    • Each unit has an equal probability to be treated.
    • Each assignment with \(N_t\) treated units is equally likely to occur
    • \(D_i\) is independent of potential outcomes, but treatment assignment is slightly dependent across units.

Types of experiments

  • Stratified randomization
    • Using covariates \(X_i\), form \(J\) total blocks or strata of units with similar or identical covariate values.
    • Completely randomize within each of the \(J\) blocks
    • In the limit, pair-randomization w/ strata of size \(2\).
  • Cluster randomization
    • Each unit \(i\) belongs to some larger cluster. \(C_i = \{1, 2, \dotsc, C\}\), \(C < N\).
    • Treatment is assigned by complete randomization at the cluster level
      • Randomly select some number of clusters to be treated, remainder get control.
    • If units share cluster membership, they get the same treatment (\(C_i = C_j \leadsto D_i = D_j\))

Complete randomization

  • How do we do estimation and inference under complete randomization?

    • We’ll start with the finite-sample setting and illustrate the Neyman (1923) approach to inference for the SATE.
  • Define our quantity of interest, the sample average treatment effect

    \[\tau_{\text{s}} = \frac{1}{N}\sum_{i=1}^N Y_i(1) - Y_i(0)\]

  • Our estimator is the sample difference-in-means.

    \[\hat{\tau} = \frac{1}{N_t} \sum_{i=1}^N Y_i D_i - \frac{1}{N_c} \sum_{i=1}^N Y_i (1 - D_i)\]

Finite sample inference

  • Consider a study with \(N_t = 3\), \(N_c = 3\) and suppose we could see the true “table of science”
    • Under one realization of the treatment \(\mathbf{D}\), we have:
Unit \(i\) Treatment \(D_i\) \(Y_i(1)\) \(Y_i(0)\) Observed \(Y_i\)
\(1\) \(1\) \(1\) \(0\) \(1\)
\(2\) \(0\) \(0\) \(1\) \(1\)
\(3\) \(1\) \(1\) \(0\) \(1\)
\(4\) \(0\) \(0\) \(1\) \(1\)
\(5\) \(0\) \(0\) \(0\) \(0\)
\(6\) \(1\) \(1\) \(1\) \(1\)
  • For this assignment, our realization of \(\hat{\tau}\) (our estimate) would be:

    \[\frac{1 + 1 + 1}{3} - \frac{1 + 1 + 0}{3} = \frac{1}{3}\]

Finite sample inference

  • How about another, equally likely realization?
Unit \(i\) Treatment \(D_i\) \(Y_i(1)\) \(Y_i(0)\) Observed \(Y_i\)
\(1\) \(0\) \(1\) \(0\) \(0\)
\(2\) \(0\) \(0\) \(1\) \(1\)
\(3\) \(1\) \(1\) \(0\) \(1\)
\(4\) \(1\) \(0\) \(1\) \(0\)
\(5\) \(1\) \(0\) \(0\) \(0\)
\(6\) \(0\) \(1\) \(1\) \(1\)
  • For this randomization, our realization of \(\hat{\tau}\) would be:

    \[\frac{1 + 0 + 0}{3} - \frac{0 + 1 + 1}{3} = -\frac{1}{3}\]

Finite sample inference

  • Overall all possible randomizations, what is the distribution?
    • We can run a quick simulation and find out
### Define the data frame
data <- data.frame(Y_1 = c(1,0,1,0,0,1), 
                   Y_0 = c(0, 1, 0, 1, 0, 1))

## Simulate the sampling distribution
nIter = 10000
sate_est = rep(NA, nIter)
set.seed(53703)
for(i in 1:nIter){
  data$D = sample(rep(c(0,1), each=3))
  data$Y = data$D*data$Y_1 + (1-data$D)*data$Y_0
  sate_est[i] = mean(data$Y[data$D==1]) - mean(data$Y[data$D==0])
}

Finite sample inference

  • First, what’s the expectation of our estimator?
mean(sate_est)
[1] -0.00387
  • Next, what’s the variance?
var(sate_est)
[1] 0.0661

Finite sample inference

Finite sample inference

  • Of course, in real data, we only get one estimate.
    • Need to rely on theory to understand the distribution that estimate came from in order to do inference.
  • Is \(\hat{\tau}|\mathbf{Y}(1), \mathbf{Y}(0)\) unbiased for the SATE?
    • Under complete randomization: Yes!
  • What is the sampling variance \(Var(\hat{\tau})\) under a finite sample (fixed \(\mathbf{Y}(1)\), \(\mathbf{Y}(0)\))?
    • Surprisingly, depends on the amount of effect heterogeneity
  • What should our estimator of the sampling variance \(\widehat{Var(\hat{\tau})}\) be?
    • Our conventional variance estimator for the difference-in-means is conservative
    • Sadly, we can’t leverage the effect heterogeneity part (w/o more assumptions)!
    • Fundamental problem of causal inference strikes again!

Unbiasedness

  • Let’s show \(\hat{\tau}\) is unbiased for the SATE. First, by linearity of expectations:

    \[\mathbb{E}[\hat{\tau} | \mathbf{Y}(1), \mathbf{Y}(0)] = \frac{1}{N_t} \sum_{i=1}^N \mathbb{E}\bigg[Y_i D_i \bigg| \mathbf{Y}(1), \mathbf{Y}(0)\bigg] - \frac{1}{N_c} \sum_{i=1}^N \mathbb{E}\bigg[Y_i (1 - D_i) \bigg| \mathbf{Y}(1), \mathbf{Y}(0) \bigg]\]

  • By consistency \(Y_iD_i = Y_i(1)D_i\) and \(Y_i(1-D_i) = Y_i(0)(1-D_i)\)

    \[\mathbb{E}[\hat{\tau}| \mathbf{Y}(1), \mathbf{Y}(0)] = \frac{1}{N_t} \sum_{i=1}^N \mathbb{E}\bigg[Y_i(1) D_i \bigg| \mathbf{Y}(1), \mathbf{Y}(0)\bigg] - \frac{1}{N_c} \sum_{i=1}^N \mathbb{E}\bigg[Y_i(0) (1 - D_i) \bigg| \mathbf{Y}(1), \mathbf{Y}(0)\bigg]\]

  • Conditional on the potential outcomes, \(Y_i(1)\) and \(Y_i(0)\) are constants

    \[\mathbb{E}[\hat{\tau}| \mathbb{Y}(1), \mathbb{Y}(0)] = \frac{1}{N_t} \sum_{i=1}^N Y_i(1) \mathbb{E}\bigg[ D_i\bigg| \mathbb{Y}(1), \mathbb{Y}(0)\bigg] - \frac{1}{N_c} \sum_{i=1}^N Y_i(0) \mathbb{E}\bigg[(1 - D_i) \bigg| \mathbb{Y}(1), \mathbb{Y}(0)\bigg]\]

Unbiasedness

  • \(D_i\) has a known distribution under complete randomization and its expectation is \(Pr(D_i = 1)\), which is just \(N_t/N\)

    \[\mathbb{E}[\hat{\tau}| \mathbf{Y}(1), \mathbf{Y}(0)] = \frac{1}{N_t} \sum_{i=1}^N Y_i(1) \frac{N_t}{N} - \frac{1}{N_c} \sum_{i=1}^N Y_i(0) \frac{N_c}{N}\]

  • Pulling out the constants

    \[\mathbb{E}[\hat{\tau}| \mathbf{Y}(1), \mathbf{Y}(0)] = \frac{1}{N} \sum_{i=1}^N Y_i(1) - \frac{1}{N} \sum_{i=1}^N Y_i(0)\]

  • And we have the SATE!

    \[\mathbb{E}[\hat{\tau}| \mathbf{Y}(1), \mathbf{Y}(0)] = \frac{1}{N} \sum_{i=1}^N Y_i(1) - Y_i(0) = \tau_s\]

Sampling variance

  • What’s the variance of \(\hat{\tau}\), conditional on the sample? Slightly tricky: \(D_i\) is not independent of \(D_j\)

    \[Var\bigg(\hat{\tau}| \mathbf{Y}(1), \mathbf{Y}(0)\bigg) = \frac{S^2_t}{N_t} + \frac{S^2_c}{N_c} - \frac{S^2_{\tau_i}}{N}\]

  • The outcome variances are:

    \[S_t^2 = \frac{1}{N-1} \sum_{i=1}^N \bigg(Y_i(1) - \bar{Y(1)}\bigg)^2\] \[S_c^2 = \frac{1}{N-1} \sum_{i=1}^N \bigg(Y_i(0) - \bar{Y(0)}\bigg)^2\]

  • And the third term is the sample variance of the treatment effects

    \[S_{\tau_i}^2 = \frac{1}{N-1} \sum_{i=1}^N \bigg((Y_i(1) - Y_i(0))- (\bar{Y(1)} - \bar{Y(0)}) \bigg)^2\]

Sampling variance

  • Can we estimate the sampling variance?

  • \(S^2_t\) and \(S^2_c\) can be estimated from their sample analogues (just the sample variances within treated/control groups)

    \[s_t^2 = \frac{1}{N_t-1} \sum_{i:D_i = 1} \bigg(Y_i(1) - \bar{Y_t^{\text{obs}}}\bigg)^2\]

    \[s_c^2 = \frac{1}{N_c-1} \sum_{i:D_i = 0} \bigg(Y_i(0) - \bar{Y_c^{\text{obs}}}\bigg)^2\]

  • But…we can’t estimate \(S^2_{\tau_i}\) directly from the sample!

    • The fundamental problem of causal inference! Can’t observe individual treatment effects.

Neyman variance

  • Neyman suggested just ignoring that third term and using our familiar estimator

    \[\widehat{\mathbb{V}}_{\text{Neyman}} = \frac{s_t^2}{N_t} + \frac{s_c^2}{N_c}\]

  • What are its properties?

    • We know it’s conservative since \(S_{\tau_i}^2 \ge 0\).
    • Confidence intervals using the Neyman standard error \(\sqrt{\widehat{\mathbb{V}}_{\text{Neyman}}}\) will be no smaller than they should be.
    • If treatment effects are constant, it’s unbiased!

Neyman variance

  • Why do we see a difference between the true variance and the Neyman variance?
    • Let’s go back to our \(N=6\) example!
Unit \(i\) Treatment \(D_i\) \(Y_i(1)\) \(Y_i(0)\) Observed \(Y_i\)
\(1\) \(1\) \(1\) \(0\) \(1\)
\(2\) \(0\) \(0\) \(1\) \(1\)
\(3\) \(1\) \(1\) \(0\) \(1\)
\(4\) \(0\) \(0\) \(1\) \(1\)
\(5\) \(0\) \(0\) \(0\) \(0\)
\(6\) \(1\) \(1\) \(1\) \(1\)

Neyman variance

  • The variance from our simulation was…
var(sate_est)
[1] 0.0661
  • We can verify our exact variance calculation from before…
var(data$Y_1)/3 + var(data$Y_0)/3 - var(data$Y_1 - data$Y_0)/6
[1] 0.0667
  • But if we ignore the variance of the treatment effects…
var(data$Y_1)/3 + var(data$Y_0)/3
[1] 0.2

Neyman variance

  • When is the Neyman variance estimator unbiased/consistent for the true variance of \(\hat{\tau}\)?
    • When effects are constant
    • …or when we’re targeting the population ATE (under a “random sampling” assumption)
  • Intuition: With random sampling from a target population, can think of treated group and control group as two separate \(N_t\) and \(N_c\) size samples from the population \(Y_i(1)\) and \(Y_i(0)\) respectively.

Illustration: Gerber, Green and Larimer (2008)

  • Gerber, Green and Larimer (2008) want to know what causes people to vote.
    • What sorts of encouragements will get people to turn out more or less?
  • Five treatment conditions in a randomized GOTV mailer experiment:
    • No mailer (0)
    • “Researchers will be studying your turnout” mailer (Hawthorne) (1)
    • “Voting is a civic duty” mailer (Civic Duty) (2)
    • “Your and your neighbors’ voting history” mailer (Neighbors) (3)
    • “Your turnout history” mailer (Self) (4)
  • We’re going to analyze at the household level
    • Treatment is randomized by household - useful to analyze at the level of randomization.
    • Is \(Y_i(d)\) well-defined for an individual? Somewhat tricky - likely spillovers across household members.

Illustration: Gerber, Green and Larimer (2008)

# Load the data
data <- read_dta('assets/data/ggr_2008_individual.dta')

# Aggregate to the household level
data_hh <- data %>% group_by(hh_id) %>% summarize(treatment = treatment[1], voted = mean(voted))

# For each treatment condition, calculate N and share voting
hh_means <- data_hh %>% group_by(treatment) %>% summarize(N = n(), voted = mean(voted))
kable(hh_means)
treatment N voted
0 99999 0.304
1 20002 0.332
2 20001 0.325
3 20000 0.389
4 20000 0.357

Illustration: Gerber, Green and Larimer (2008)

  • Let’s estimate the ATE of the “Neighbors” treatment relative to control
# Estimated ATE of Neighbors (3) vs. Control (0)
ate <- mean(data_hh$voted[data_hh$treatment == 3]) -
  mean(data_hh$voted[data_hh$treatment == 0])
ate
[1] 0.0848
  • And let’s compute the Neyman variance
# Estimate the sampling variance
var_ate = var(data_hh$voted[data_hh$treatment == 3])/sum(data_hh$treatment == 3) +
  var(data_hh$voted[data_hh$treatment == 0])/sum(data_hh$treatment == 0)

# Square root to get estimated SE
sqrt(var_ate)
[1] 0.0034

Illustration: Gerber, Green and Larimer (2008)

  • 95% asymptotic confidence interval and p-value against null of no ATE.
# Confidence interval (assuming asymptotic normality)
ate_95CI = c(ate - qnorm(.975)*sqrt(var_ate),
  ate + qnorm(.975)*sqrt(var_ate))
ate_95CI
[1] 0.0781 0.0915
# P-value H_0: \tau = 0, H_a: \tau \neq 0
p_val = 2*pnorm(-abs(ate/sqrt(var_ate)))
p_val
[1] 3.64e-137

Illustration: Gerber, Green and Larimer (2008)

  • Fun fact: You can get this via OLS regression!
    • OLS with a single binary regressor is just the difference-in-means.
    • Take care w/ inference: the classic OLS standard errors impose too many additional assumptions (homoskedasticity)
    • Heteroskedasticity-“robust” SEs are close to the Neyman variance in large samples…
    • …and Samii and Aronow (2012) show that the HC2 finite sample correction gets you the Neyman variance exactly.
  • That’s the default for lm_robust() in the estimatr package!
lm_robust(voted ~ I(treatment==3), data=data_hh %>% filter(treatment == 3|treatment == 0))
                      Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper
(Intercept)             0.3043    0.00132   230.2  0.00e+00   0.3017   0.3069
I(treatment == 3)TRUE   0.0848    0.00340    24.9 8.13e-137   0.0781   0.0915
                          DF
(Intercept)           119997
I(treatment == 3)TRUE 119997

Fisher’s Exact Test

Fisher’s Exact Test

  • Neyman’s framework:
    • Estimand: Average Treatment Effect: \(\tau = E[Y_i(1) - Y_i(0)]\)
    • Difference-in-means estimator: Known expectation and sampling variance
    • Hypothesis test. Null of no ATE \(H_0: \tau = 0\)
    • Large sample/asymptotic theory to get the distribution of \(\hat{\tau}\) to calculate p-values
  • Fisher’s framework:
    • Can we get a p-value under (some) null hypothesis without the large-sample assumptions?
    • What does randomization alone justify?
    • Exact p-values under a sharp null of no individual treatment effect \(Y_i(1) = Y_i(0)\) for all \(i\).
    • “Randomization test” approach to p-values under a known randomization scheme.

Hypothesis testing review

  1. Define the null hypothesis.
    • Neyman framework: \(H_0: \tau = \tau_0\)
    • The probability of observing a particular value of the test statistic depends on what is “true” about the underlying parameter.
    • Our thought experiment: If the null were true, how likely would we see what we observe (or more extreme).
  2. Choose a test statistic
    • In classical hypothesis testing, we pick something that has useful statistical properties:

      \[T = (\hat{\tau} - \tau_0)/\sqrt{\widehat{Var({\hat{\tau}})}}\]

Hypothesis testing review

  1. Determine the distribution of the test statistic under the null
    • In classical testing, by CLT, in large samples, \(T \sim \mathcal{N}(0, 1)\)
    • In smaller samples, you may have made further assumptions (e.g. outcome is normally distributed) to get an exact distribution (e.g. \(t-\) distribution)
    • We need a distribution to compute probabilities from the CDF!
  2. What is the probability of observing the test statistic \(T\) that you observe in-sample (or a more extreme value) given the known distribution under the null?
    • That’s a p-value!

What’s different about randomization testing?

  1. Different null hypothesis (a “sharp” null)
  2. No distributional assumptions/asymptotics to derive the distribution of \(T\)
    • We can instead literally just calculate the value under each possible realization of treatment.
    • And we know the distribution of treatment assignments because we control them in an experiment.

Sharp null of no effect.

  • The sharp null hypothesis states:

    \[H_0: \tau_i = Y_i(1) - Y_i(0) = 0 \text{ } \forall i\]

  • Zero individual-level treatment effect for all units

  • Sharp null implies zero ATE

    • But zero ATE does not imply sharp null!
  • Why do we make this assumption?

    • Because now the observed data tells us everything we need to know about the potential outcomes

The sharp null

  • Remember our table of science? For a single realization of \(\mathbf{D}\) we only observe half the potential outcomes?
Unit \(i\) Treatment \(D_i\) \(Y_i(1)\) \(Y_i(0)\) Observed \(Y_i\)
\(1\) \(1\) \(1\) ? \(1\)
\(2\) \(0\) ? \(1\) \(1\)
\(3\) \(1\) \(1\) ? \(1\)
\(4\) \(0\) ? \(1\) \(1\)
\(5\) \(0\) ? \(0\) \(0\)
\(6\) \(1\) \(1\) ? \(1\)
  • But what does the sharp null imply about the unobserved potential outcomes? Can we fill in those question marks?

The sharp null

  • Yes! Under the sharp null, \(Y_i(1) = Y_i(0)\)
Unit \(i\) Treatment \(D_i\) \(Y_i(1)\) \(Y_i(0)\) Observed \(Y_i\)
\(1\) \(1\) \(1\) \(1\) \(1\)
\(2\) \(0\) \(1\) \(1\) \(1\)
\(3\) \(1\) \(1\) \(1\) \(1\)
\(4\) \(0\) \(1\) \(1\) \(1\)
\(5\) \(0\) \(0\) \(0\) \(0\)
\(6\) \(1\) \(1\) \(1\) \(1\)
  • Why is this useful?
    • We can calculate the value of our test statistic not only under the observed \(\mathbf{D}\) but under all other possible realizations of \(\mathbf{D}\)!

The test statistic

  • Our test statistic is any function of treatment assignments \(\mathbf{D}\) and observed outcomes \(\mathbf{Y}\).

  • Lots of choices with different degrees of power for different kinds of treatment effects.

  • We want to pick a test statistic that will return large values when the null is false and small values when it is true.

  • A reasonable default: (absolute) difference-in-means

    \[t(\mathbf{D}, \mathbf{Y}) = \bigg|\frac{1}{N_t} \sum_{i=1}^N Y_i D_i - \frac{1}{N_c} \sum_{i=1}^N Y_i (1 - D_i) \bigg|\]

  • What sorts of alternatives might this be bad for?

    • Offsetting positive and negative effects will return small values of \(t(\mathbf{D}, \mathbf{Y})\) as would no effects.
    • We might pick a different test statistic in this case!

The randomization distribution

  • Under the sharp null, we can calculate \(t(\mathbf{D}, \mathbf{Y})\) for every possible realization of \(\mathbf{D}\).

    • Why? Because under the sharp null, observed \(Y_i\) is unaffected by treatment assignment.
  • Then to get a distribution for \(t(\mathbf{D}, \mathbf{Y})\), we just need to know the distribution of \(\mathbf{D}\). We know this by designing the experiment!

  • We get our p-value by comparing the observed test statistic for our particular sample \(t^*\) to the distribution of \(t(\mathbf{D}, \mathbf{Y})\)

    • For complete randomization, each of the \(K\) possible realizations of \(\mathbf{D}\) is equally likely, so we just enumerate all possible assignments \(\mathbf{d} \in \Omega\) and calculate the share that are greater than our observed test statistic.

      \[Pr(t^* \ge t(\mathbf{D}, \mathbf{Y})) = \frac{\sum_{\mathbf{d} \in \Omega} \mathbb{I}(t(\mathbf{d}, \mathbf{Y}) \ge t^*)}{K}\]

  • This is our p-value, which we compare to some threshold level \(\alpha\) and reject the null when it’s below that level.

Monte carlo approximation

  • We could enumerate every possible treatment vector and actually just calculate \(t(\mathbf{D}, \mathbf{Y})\).
  • Even in fairly small samples this can involve quite a lot of computations! (e.g. \({20 \choose 10 } = 184756\))
  • We’ll typically use a monte carlo approximation to the exact p-value.
    • This is also easier for more complicated randomization schemes.
  • Procedure:
    • For \(K\) iterations:
      1. Draw a realization of the treatment vector \(\mathbf{d}_k\) from the known distribution of \(\mathbf{D}\).
      2. Calculate the test statistic \(t_k = t(\mathbf{d}_k, \mathbf{Y})\)
    • Our p-value is the share of these \(K\) test statistics that are greater than the observed \(t^*\)

Putting it all together

  • To do randomization inference under the sharp null
    1. Choose a test statistic
    2. Calculate the observed test statistic in your sample \(t^* = t(\mathbf{D}, \mathbf{Y})\)
    3. Draw another treatment vector \(\mathbf{d}_1\) from the known distribution of \(\mathbf{D}\)
    4. Calculate \(t_1 = t(\mathbf{d}_1, \mathbf{Y})\)
    5. Repeat 3 and 4 as long as you want to get \(K\) samples from the distribution of the test statistic under the null
    6. Calculate \(p = \frac{1}{K}\sum_{i=1}^K \mathbb{I}(t_k \ge t^*)\)

Inverting tests to get confidence intervals

  • Randomization tests alone give us p-values but no confidence intervals.
  • One approach: “invert” the test - for what values of a “treatment effect” would we fail to reject the null
    • A \(100(1-\alpha)\%\) confidence interval contains the set of parameter values for which an \(\alpha\)-level hypothesis test would fail to reject the null.
  • Slight complication: We now need to actually define a “treatment effect” parameter:
    • For example, assume a constant additive effect for all units

      \[Y_i(1) - Y_i(0) = \tau_0\]

    • Our confidence set would be all of the values of \(\tau_0\) for which we’d fail to reject the null

    • Calculate via a grid search through possible values of \(\tau_0\).

Conclusion

  • We’ve learned two frameworks for estimating and conducting inference for treatment effects in randomized experiments
  • Neyman
    • Define an estimand
      • the SATE: \(\tau_s = \sum_{i=1}^N Y_i(1) - Y_i(0)\)
      • the PATE: \(\tau_p = \mathbb{E}[Y_i(1) - Y_i(0)]\)
    • Choose an estimator
      • The difference-in-means
    • Estimator has known mean and variance under our design; asymptotically normal.
  • Fisher
    • Forget the estimation process, let’s just do an (exact) test.
    • Sharp null of no effect ( \(Y_i(1) = Y_i(0) \text{ } \forall i\) )
    • No large-sample approximations - inference justified by randomization alone

Next week

  • What can break your experiment?
    • Differential attrition
    • Non-compliance
    • Can we do partial identification/obtain bounds?
  • How do we generalize from experiments?
    • Dimensions of external validity
    • How to think about aggregating knowledge from many studies?