Week 3: Experiments - Part 2

PS 813 - Causal Inference

Anton Strezhnev

University of Wisconsin-Madison

February 2, 2026

Last week

  • Why randomized experiments work
    • Guarantee treatment \(D_i\) is independent of potential outcomes \(\{Y_i(1), Y_i(0)\}\)
  • Inference for treatment effects
    • Fisher: Can get exact p-values under the sharp null just knowing the distribution of treatment assignments.
    • Neyman: A conservative variance estimator for the difference-in-means + large-sample asymptotics
    \[\widehat{Var(\hat{\tau})} = \frac{s^2_t}{N_t} + \frac{s_c^2}{N_c}\]

This week

  • One post-experiment use of covariates
    • Balance checking
  • When should we not condition on covariates?
    • When they’re post-treatment!
    • Attrition/Non-compliance
    • Bounds when we have to condition
  • How do we generalize from a single experiment?
    • What are the dimensions of external validity?
    • What do we have to believe to transport a treatment effect?

Balance checking

\[ \require{cancel} \]

Balance tests

  • One reason to use covariates even in completely randomized designs is to check whether the experiment actually did what it was supposed to do.
  • Under any randomization scheme that satisfies ignorability:

\[X_i {\perp \! \! \! \perp} D_i\]

\[E[X_i | D_i = 1] = E[X_i | D_i = 0]\]

  • If we correctly randomized treatment, then the expectation (and the distribution) of covariates should be the same in treatment and control.
  • But in any given sample, we’ll observe a difference just by chance – how do we know if this is a problem?

Against balance tests

  • One view in some experimentally-focused fields is that you should never waste time checking balance
  • Senn (1994) Statistics in Medicine
    1. Randomization ensures balance over all randomizations.
    2. Observing any particular imbalance in our sample doesn’t disprove 1
  • What if we did screw up? What if the randomization software had a bug? What if there was some implementation issue?
    • Maybe a balance test helps here?
  • But how would we interpret a “failed” balance test?
    • A reason to go back and check the randomization process - if we think that it did actually work as intended, we may just have gotten unlucky.
    • Might want a stricter threshold than \(p < .05\) if we really believe treatment was randomized.

Against balance tests

  • Another more nuanced argument is that you shouldn’t use balance tests to decide whether to include or not include covariates in the analysis.
  • Mutz, Pemantle and Pham (2019)
    • Sometimes, researchers run a lot of univariate balance tests in an experiment.
    • If a test for some covariate fails, they include that covariate in adjustment.
    • This process risks raising false-positive rates through researcher “degrees-of-freedom”
  • This is a correct argument, but is less an argument against balance testing per-se and more against ad-hoc or data-dependent covariate choices.
  • When we talk about covariates in experiments next week, we’ll emphasize the importance of ex-ante choices.
    • Don’t decide on what to include in your analysis based on tests conducted after the experiment is run and using information from the experiment outcomes

Example: Broockman and Kalla (2023)

  • Broockman and Kalla (2023) “Consuming cross-cutting media causes learning and moderates attitudes: A field experiment with Fox News viewers”
    • Sample of 763 individuals in 695 households who regularly watch FOX News
    • Treatment: Incentivized to watch CNN instead of Fox News for one month (304 individuals)
    • Control: No incentive (continue normal viewing habits) (459 individuals)
  • Randomization was done at the household level, stratified by baseline characteristics
  • Let’s check balance on pre-treatment covariates!

Broockman and Kalla: Loading the data

library(haven)
library(cobalt)

# Load the Broockman and Kalla data
bk <- read_dta("assets/data/primary_dataset_unstd.dta")

# Treatment variable: treat (1 = CNN incentive, 0 = control)
table(bk$treat)

  0   1 
459 304 
# Key pre-treatment covariates from baseline survey (t1_) and voter file (vf_)
baseline_covs <- c(
  # Demographics from voter file
  "vf_age",                    # Age
  # Baseline survey measures (t1_ = time 1, pre-treatment)
  "t1_pid7",                   # 7-point party ID (1 = Strong Dem to 7 = Strong Rep)
  "t1_ideo_self",              # Self-reported ideology
  "t1_therm_trump",            # Feeling thermometer: Trump
  "t1_therm_biden",            # Feeling thermometer: Biden
  "t1_therm_fox",              # Feeling thermometer: Fox News
  "t1_therm_cnn",              # Feeling thermometer: CNN
  "t1_trust_fox",              # Trust in Fox News
  "t1_trust_cnn",              # Trust in CNN
  # Pre-treatment TV viewership (from set-top box data)
  "pre_treat_fox_minutes",     # Minutes watching Fox pre-treatment
  "pre_treat_cnn_minutes"      # Minutes watching CNN pre-treatment
)

Broockman and Kalla: Aggregate to household level

# Randomization was at the household level, so aggregate covariates
bk_hh <- bk %>%
  group_by(hh_id) %>%
  summarize(
    treat = first(treat),
    # Take the mean of covariates within household
    across(all_of(baseline_covs), ~mean(.x, na.rm = TRUE))
  ) %>%
  ungroup()

# Check: 695 households
nrow(bk_hh)
[1] 695
# Check treatment assignments
table(bk_hh$treat)

  0   1 
417 278 

Broockman and Kalla: Balance table

# Use cobalt to create a balance table (at household level)
bal_tab <- bal.tab(
  x = bk_hh %>% select(all_of(baseline_covs)),   # Covariate data
  treat = bk_hh$treat,           # Treatment indicator
  binary = "std",                # Standardize binary variables
  continuous = "std",            # Standardize continuous variables
  s.d.denom = "pooled"           # Use pooled SD for standardization
)
print(bal_tab)

Broockman and Kalla: Balance table

Balance Measures
                         Type Diff.Un
vf_age                Contin.  0.0352
t1_pid7               Contin.  0.0367
t1_ideo_self          Contin.  0.0130
t1_ideo_self:<NA>      Binary  0.0000
t1_therm_trump        Contin.  0.1688
t1_therm_biden        Contin. -0.0855
t1_therm_fox          Contin. -0.0224
t1_therm_cnn          Contin. -0.0086
t1_trust_fox          Contin.  0.0421
t1_trust_cnn          Contin.  0.0755
pre_treat_fox_minutes Contin.  0.0480
pre_treat_cnn_minutes Contin. -0.1350

Sample sizes
    Control Treated
All     417     278

Broockman and Kalla: Love plot

Multiple testing

  • We could put a simple difference-in-means hypothesis test on each of these covariates.
    • But with enough covariates, we’d find a \(p < .05\) just by chance!
lm_robust(t1_therm_trump ~ treat, data=bk_hh)
             Estimate Std. Error   t value   Pr(>|t|)   CI Lower  CI Upper  DF
(Intercept) 81.758593   1.040406 78.583337 0.00000000 79.7158668 83.801319 693
treat        3.360112   1.519283  2.211643 0.02731688  0.3771622  6.343062 693
  • Is there a quick way to deal with multiple testing in balance checks?
    • Could correct for multiple comparisons…
    • …or construct a single test.
    • How? With randomization inference!

Permutation/Randomization Testing for Balance

  • Step 1: Run a regression predicting treatment using all the covariates. Store the F-statistic (or any statistic summarizing “goodness of fit”)
treat_reg <- lm_robust(treat ~ vf_age + t1_pid7 + t1_ideo_self + t1_therm_trump + t1_therm_biden +
                         t1_therm_fox + t1_therm_cnn + t1_trust_fox + t1_trust_cnn + pre_treat_fox_minutes +
                         pre_treat_cnn_minutes, data=bk_hh)
tstat_obs <- treat_reg$fstatistic[1]

Permutation/Randomization Testing for Balance

  • Step 2: Permute the treatment assignment based on the known assignment scheme
  • Step 3: Calculate the test statistic under alternative assignments
set.seed(53706)
iterations <- 10000
tstat_null <- rep(NA, iterations)
for (i in 1:iterations){
  bk_hh$treat_perm <- sample(bk_hh$treat)
  treat_reg_null <- lm_robust(treat_perm ~ vf_age + t1_pid7 + t1_ideo_self + t1_therm_trump + t1_therm_biden +
                         t1_therm_fox + t1_therm_cnn + t1_trust_fox + t1_trust_cnn + pre_treat_fox_minutes +
                         pre_treat_cnn_minutes, data=bk_hh)
  tstat_null[i] <- treat_reg_null$fstatistic[1]
}

Permutation/Randomization Testing for Balance

  • Step 4: Compare the observed statistic to the permuted null distribution
mean(tstat_null > tstat_obs)
[1] 0.0808

Guidelines for balance testing

  • Testing for balance to assess whether randomization occurred as intended: Good
    • Careful with multiple testing/false positives
    • \(p < .05\) is probably too low of a threshold, but you should probably be concerned if \(p < 1 \times 10^{-6}\)
    • What do do if a balance check fails? Check your experiment!
  • Testing for balance to pick which covariates to adjust for: Bad
    • “Garden of forking paths”
    • You should choose covariates ex-ante (even if not blocking)
    • Pick covariates that predict \(Y\) - balance checks are the wrong criteria.

Post-treatment bias

Post-treatment covariates

  • When talking about covariates, we’ve emphasized that \(X_i\) must be pre-treatment
  • What happens when we condition on some post-treatment variable (call it \(M_i\)).
  • Intuition: \(M_i\) is post-treatment. It has potential outcomes: \(\{M_i(1), M_i(0)\}\)
    • But we can’t condition on the latent potential outcomes, we only condition on the observed \(M_i\)
    • This induces a form of enodgenous “selection bias”
  • Many cases in political science
    • Experiments with non-compliance
    • Attention checks in survey experiments.
    • Administrative data (police interactions are only recorded if a stop occurs)
    • Attrition induced by treatment (e.g. court proceedings that settle)

Post-treatment bias

  • Let \(M_i\) denote the post-treatment covariate. Since it’s post-treatment, it has potential outcomes \(\{M_i(1), M_i(0)\}\) as though it were any other outcome.
  • By randomization

\[\{M_i(1), M_i(0)\} {\perp \! \! \! \perp} D_i\]

  • What happens if we take the difference-in-means conditional on \(M_i = 1\)

\[E[Y_i | D_i = 1, M_i = 1] - E[Y_i | D_i = 0, M_i = 1]\]

  • By consistency:

\[E[Y_i(1) | D_i = 1, M_i(1) = 1] - E[Y_i(0) | D_i = 0, M_i(0) = 1]\]

Post-treatment bias

  • Ignorability gets us

\[E[Y_i(1) | M_i(1) = 1] - E[Y_i(0) | M_i(0) = 1]\]

  • Is this the ATE?
    • No! \(M_i(1) = 1\) and \(M_i(0) = 1\) define two different subsets of the sample
  • Under what assumptions would we get the ATE?
  • Either:
    1. No individual effect of treatment on \(M_i\): \(M_i(1) = M_i(0) \text{ } \forall i\)
    2. \(\{M_i(1), M_i(0)\} {\perp \! \! \! \perp} \{Y_i(1), Y_i(0\}\)
  • Neither of these assumptions is guaranteed by an experiment since we don’t randomize \(M_i\)
  • Therefore, conditioning on a post-treatment quantity breaks the experiment – now it’s an observational study.

Principal strata

  • We can think of the combination of \(D_i\) and \(M_i\) as defining a “sub-group” of units - these are referred to as “principal strata”
    • They have different names in the literature - one common convention comes from the compliance literature
Stratum \(M_i(1)\) \(M_i(0)\)
“Always-takers” \(1\) \(1\)
“Never-takers” \(0\) \(0\)
“Compliers” \(1\) \(0\)
“Defiers” \(0\) \(1\)
  • Units with \(M_i = 1\) could be any three of these strata. Even observing \(D_i\) narrows it down to only two - we can’t observe the strata directly.
  • Strata aren’t necessarily independent of potential outcomes \(Y_i(d)\)!
    • (e.g.) Units that would never respond to a door-to-door canvasser likely have lower propensity to vote.

Example: Administrative data

  • Knox, Lowe and Mummolo (2020) consider the problem of estimating the effect of civilian race on police use of force.
    • Typically, past studies would use administrative data from police departments on stops
    • Compare police use of force among Black civilians who are stopped and white civilians who are stopped.
    • Problem: Stops are post-treatment!
  • Define \(D_i\) as the treatment (race of civilian), \(M_i\) is an indicator for whether a stop occurs, \(Y_i\) is severe use of force
  • The difference-in-means does not identify the treatment effect unless…
    • \(D_i\) has no effect on \(M_i\) (race of civilian doesn’t affect whether an officer makes a stop)
    • \(M_i(1), M_i(0)\) is independent of \(Y_i(1), Y_i(0)\) (civilian propensity to be stopped (net of race) is uncorrelated with propensity to use force)
  • Given substantive knowledge of this setting, both assumptions seem implausible.

The Survivor Average Causal Effect

  • In the attrition setting, we focus on the Survivor Average Causal Effect (SACE)
    • The ATE among those who would be observed irrespective of treatment status
    • In the Knox, Lowe and Mummolo (2020) setting: the ATE of civilian race on use of force among civilians who would be stopped irrespective of their race.
    \[\tau_{\text{SACE}} = E[Y_i(1) - Y_i(0) | M_i(1) = 1, M_i(0) = 1]\]
  • Recall that the difference-in-means only gets us

\[E[Y_i(1) | M_i(1) = 1] - E[Y_i(0) | M_i(0) = 1]\]

  • \(M_i(1) = 1\) is not the same subset of units as \(M_i(0) = 1\)
    • \(M_i(1) = 1\) includes both survivors and the “compliers” (or “helped by treatment”)
    • \(M_i(0) = 1\) includes both survivors and the “defiers” (or “hurt by treatment”)

Monotonicity

  • One approach to bounding from Lee (2009) relies on an additional monotonicity assumption

\[M_i(1) \ge M_i(0) \text{ } \forall i \quad \text{or} \quad M_i(1) \le M_i(0) \text{ } \forall i\]

  • Treatment affects selection in only one direction
    • Either treatment can only increase the probability of being observed (no “defiers”)
    • Or treatment can only decrease the probability of being observed (no “compliers”)
  • In the Knox, Lowe and Mummolo (2020) setting
    • Monotonicity: non-minorities who are stopped would also have been stopped had they been a minority.

Intuition for Lee bounds

  • Suppose treatment increases observation probability: \(M_i(1) \ge M_i(0)\)
    • More units observed in the treated group than control group
  • Monotonicity lets us pin down more of the principal strata.
  • Among observed treated units:
    • Some are “survivors” (would be observed even under control)
    • Some are “compliers” (observed only because they got treatment)
  • Among observed control units:
    • All are “survivors” (they’re observed despite being in control)

Intuition for Lee bounds

  • If we could figure out who the compliers are and remove them - we could get point identification.
    • We can’t - but we can get the share of compliers.
    • And we can worst-case which units those happen to be.
  • Additionally, the share of survivors is balanced between treated and control

    \[Pr(M_i(1) = M_i(0) = 1 | D_i = 0) = Pr(M_i(1) = M_i(0) = 1 | D_i = 1)\]

  • All observed units in the control group are survivors

    \[Pr(M_i = 1 | D_i = 0) = Pr(M_i(1) = M_i(0) = 1 | D_i = 0)\]

  • And observed units in the treated group are survivors + compliers

    \[Pr(M_i = 1 | D_i = 1) = Pr(M_i(1) = M_i(0) = 1 | D_i = 1) + Pr(M_i(1) = 1, M_i(0) = 0 | D_i = 1)\]

The trimming procedure

  • Let \(p_1 = Pr(M_i = 1 | D_i = 1)\) and \(p_0 = Pr(M_i = 1 | D_i = 0)\)
    • Under monotonicity with \(M_i(1) \ge M_i(0)\): \(p_1 \ge p_0\)
  • We can use the proportion of survivors identified in the control group to identify the share of “compliers” in the treated group by differencing
    • Treated group is survivors + compliers - we subtract the survivors using what we observe in control
  • So the fraction of treated-and-observed who are “compliers” is:

\[q = \frac{p_1 - p_0}{p_1}\]

  • The premise of bounds and partial identification is that we want to characterize the set of treatment effects consistent with the observed data.
    • When we have point identification, only one ATE is consistent with the observed data.
    • But sometimes a range of effects is consistent with what we see because the data only partially pins down the potential outcomes.

The trimming procedure

  • Under the monotonicity assumption, we can apply a trimming procedure to obtain a “worst case” and a “best case” for the ATE
    • The identification problem comes from the fact that the treated group has some share of compliers - we identify that quantity \(q\)
    • Trimming the top \(q\) observations (in terms of the outcome) gives a lower bound on \(\tau_{\text{SACE}}\)
    • Trimming the bottom \(q\) observations (in terms of the outcome) gives an upper bound on \(\tau_{\text{SACE}}\)
  • Lower bound: Trim the top \(q\) proportion of treated outcomes

\[\tau^{LB} = \mathbb{E}[Y_i | D_i = 1, M_i = 1, Y_i \le y^{1-q}_1] - \mathbb{E}[Y_i | D_i = 0, M_i = 1]\]

  • Upper bound: Trim the bottom \(q\) proportion of treated outcomes

\[\tau^{UB} = E[Y_i | D_i = 1, M_i = 1, Y_i \ge y^{q}_1] - E[Y_i | D_i = 0, M_i = 1]\]

  • Where \(y^{q}_1\) is the \(q\)-th quantile of the treated outcome distribution

Visualizing the Lee Bounds

  • Let’s run a quick simulation to show how the bounds work.

  • Consider a case with \(0\) treatment effect, standard normal outcome, but selection associated with both treatment and outcome

    • \(Y > 0\) - \(70\%\) chance of being a survivor, \(30\%\) chance of being a “complier”/“helped by treatment”
    • \(Y < 0\) - \(30\%\) chance of being a survivor, \(70\%\) chance of being a “complier”/“helped by treatment”
  • Marginally, half of all observations are compliers and half are survivors.

  • Generate \(5000\) observations, and look at their outcome distributions?

set.seed(53706)
N <- 5000
lee_df <- data.frame(Y = rnorm(n = N), D = rbinom(N, 1, .5))
lee_df <- lee_df %>% mutate(treatment = case_when(D == 1 ~ "Treated",
                                                  D == 0 ~ "Control"))
lee_df <- lee_df %>% mutate(survivor_prob = case_when(Y > 0 ~ .7,
                                                      Y < 0 ~ .3))
lee_df <- lee_df %>% mutate(survivor = rbinom(N, 1, survivor_prob))
lee_df <- lee_df %>% mutate(M = as.numeric(D == 0)*survivor + as.numeric(D == 1))

Visualizing the Lee Bounds

  • What do the observed outcome distributions look like?
  • It looks like there’s a negative effect?
    • There isn’t - this is selection!
    • We’re systematically filtering out the treated high \(Y\) outcomes at a higher rate to control.

Visualizing the Lee Bounds

  • What’s our estimated share of compliers in the treatment group?
q_complier <- (mean(lee_df$M[lee_df$D == 1]) - mean(lee_df$M[lee_df$D == 0]))/
  (mean(lee_df$M[lee_df$D == 1]))
q_complier
[1] 0.4968127
  • Consistent with our simulation, about half of the units in the treated group are compliers
  • Our bounds are going to be constructed by dropping either…
    • …the top half of the distribution (lower bound)…
    • …or the bottom half of the distribution (upper bound)
  • What quantiles are we cutting at?
 49.68127%  50.31873% 
0.02370004 0.03848793 

Visualizing the Lee Bounds

  • What happens when we trim the bottom
  • Upper bound is positive

Visualizing the Lee Bounds

  • What happens when we trim the top
  • Lower bound is negative - The bounds contain the true effect of zero

Properties of Lee bounds

  • Sharp bounds: Cannot be tightened without additional assumptions
    • Any value within the bounds is consistent with the data and assumptions
  • Bounds collapse to a point when:
    • Treatment has no effect on selection (\(p_1 = p_0\), so \(q = 0\))
    • In this case, SACE = ATE (no selection problem!)
    • But the monotonicity assumption is key here!
  • Can construct confidence intervals for bounds using bootstrap or analytical standard errors
    • Note that the bounds reflect fundamental uncertainty in mapping from observed outcomes to potential outcomes
    • Further uncertainty is driven by the usual sources (random sampling/treatment assignment)

Effect Heterogeneity and External Validity

Heterogeneous treatment effects (HTE)

  • When targeting the average treatment effect, we (try to be) entirely agnostic about the variation in \(\tau_i\) between units in the sample

    \[\tau_{\text{ATE}} = \mathbb{E}[Y_i(1) - Y_i(0)]\]

  • But often we have predictions about average effects for different sub-groups in the sample.
    • Effects of treatment are rarely homogeneous: Republicans respond to cues from Trump differently than Democrats!
    • Can we target a different quantity of interest?
  • The Conditional Average Treatment Effect (CATE)

    \[\tau(x) = \underbrace{E[Y_i(1) | X_i = x]}_{\text{Mean P.O. under treatment among units with } X_i = x} - \underbrace{E[Y_i(0) | X_i = x]}_{\text{Mean P.O. under control among units with } X_i = x}\]

Estimating the CATE

  • In a completely randomized experiment, it’s straightforward to estimate the CATE
    • Just subset down to units with \(X_i = x\) and take the difference-in-means
    • Conventional inference using the Neyman variance (though be careful assuming asymptotic normality when sub-groups are small!)

\[\hat{\tau}(x) = \frac{1}{N_{t,x}}\sum_{i: X_i = x}^N Y_i D_i - \frac{1}{N_{c,x}}\sum_{i: X_i = x}^N Y_i (1 - D_i)\]

where \(N_{t, x}\) is the number of units with \(D_i = 1, X_i = x\) and \(N_{c, x}\) is the number of units with \(D_i = 0, X_i = x\)

  • Be careful in interpretation - CATEs assign no causal interpretation to \(X_i\)
    • (e.g.) A difference in treatment effects between Democrats and Republicans tells us nothing about the question “what if person \(i\) were a Democrat rather than a Republican.
    • Many \(X_i\) are non-manipulable

Illustration: Gerber, Green and Larimer (2008)

  • Let’s return to the Gerber, Green and Larimer (2008) example
    • Does social pressure get people to vote more?
# Load the data
data <- read_dta('assets/data/ggr_2008_individual.dta')

# Aggregate to the household level
data_hh <- data %>% group_by(hh_id) %>% summarize(treatment = treatment[1], voted = mean(voted),
                                                   voted_p2004 = mean(p2004))

Illustration: Gerber, Green and Larimer (2008)

  • We might ask whether social pressure works more on the less habitual voters
    • Were voters who turned out in the 2004 primary affected differently by the Neighbors treatment?
# Estimated ATE of Neighbors (3) vs. Control (0)
# At least one member of the household voted in Primary 2004
ate_voters <- mean(data_hh$voted[data_hh$treatment == 3&data_hh$voted_p2004 > 0]) -
  mean(data_hh$voted[data_hh$treatment == 0&data_hh$voted_p2004 > 0])
ate_voters
[1] 0.09209231
# No member of the household voted in Primary 2004
ate_nonvoters <- mean(data_hh$voted[data_hh$treatment == 3&data_hh$voted_p2004 == 0]) -
  mean(data_hh$voted[data_hh$treatment == 0&data_hh$voted_p2004 == 0])
ate_nonvoters
[1] 0.07612996

Illustration: Gerber, Green and Larimer (2008)

  • Let’s compute the Neyman variance
# Estimate the sampling variance
# At least one member of the household voted in Primary 2004
var_ate_voters = var(data_hh$voted[data_hh$treatment == 3&data_hh$voted_p2004 > 0])/sum(data_hh$treatment == 3&data_hh$voted_p2004 > 0) +
  var(data_hh$voted[data_hh$treatment == 0&data_hh$voted_p2004 > 0])/sum(data_hh$treatment == 0&data_hh$voted_p2004 > 0)

# No member of the household voted in Primary 2004
var_ate_nonvoters = var(data_hh$voted[data_hh$treatment == 3&data_hh$voted_p2004 == 0])/sum(data_hh$treatment == 3&data_hh$voted_p2004 == 0) +
  var(data_hh$voted[data_hh$treatment == 0&data_hh$voted_p2004 == 0])/sum(data_hh$treatment == 0&data_hh$voted_p2004 == 0)

Illustration: Gerber, Green and Larimer (2008)

  • 95% asymptotic confidence intervals
# Confidence intervals (assuming asymptotic normality)
ate_95CI_voters = c(ate_voters - qnorm(.975)*sqrt(var_ate_voters),
  ate_voters + qnorm(.975)*sqrt(var_ate_voters))
# At least one member of the household voted in Primary 2004
ate_95CI_voters
[1] 0.08272279 0.10146183
ate_95CI_nonvoters = c(ate_nonvoters - qnorm(.975)*sqrt(var_ate_nonvoters),
  ate_nonvoters + qnorm(.975)*sqrt(var_ate_nonvoters))
# No member of the household voted in Primary 2004
ate_95CI_nonvoters
[1] 0.06678196 0.08547795

Illustration: Gerber, Green and Larimer (2008)

  • It looks like more regular voters were actually more affected by treatment than the less regular voters.
    • But is this just due to chance?
  • Suppose we’re interested in the difference in the population CATEs
    • We need to construct a hypothesis test explicitly for this difference!
    • The difference between significant and not significant is not necessarily significant!
  • Remember, the variance of the difference in our CATE estimators is larger than the variance of any one CATE
    • Estimates of the CATEs themselves are less noisy than the difference in the estimates.
  • For inference on the population difference in CATEs, remember that the variances are additive

    \[\widehat{Var}(\hat{\tau}_{\text{voter}} - \hat{\tau}_{\text{non-voter}}) = \frac{s_{t, \text{voter}}^2}{N_{t, \text{voter}}} + \frac{s_{c, \text{voter}}^2}{N_{c, \text{voter}}} + \frac{s_{t, \text{non-voter}}^2}{N_{t, \text{non-voter}}} + \frac{s_{c, \text{non-voter}}^2}{N_{c, \text{non-voter}}}\]

Illustration: Gerber, Green and Larimer (2008)

# Confidence intervals for the **difference** in CATEs
ate_95CI_voters = c(ate_voters - ate_nonvoters - qnorm(.975)*sqrt(var_ate_voters + var_ate_nonvoters),
  ate_voters - ate_nonvoters + qnorm(.975)*sqrt(var_ate_voters + var_ate_nonvoters))

ate_voters - ate_nonvoters
[1] 0.01596235
ate_95CI_voters
[1] 0.002727064 0.029197645

Illustration: Gerber, Green and Larimer (2008)

  • Again, with a binary treatment/binary covariate, you can do this with OLS and an interaction term
lm_robust(voted ~ I(treatment==3)*I(voted_p2004 > 0), data=data_hh %>% filter(treatment == 3|treatment == 0))
                                               Estimate  Std. Error    t value
(Intercept)                                  0.25896339 0.001821941 142.135971
I(treatment == 3)TRUE                        0.07612996 0.004769472  15.961924
I(voted_p2004 > 0)TRUE                       0.08847971 0.002625902  33.694977
I(treatment == 3)TRUE:I(voted_p2004 > 0)TRUE 0.01596235 0.006752823   2.363805
                                                  Pr(>|t|)    CI Lower
(Intercept)                                   0.000000e+00 0.255392418
I(treatment == 3)TRUE                         2.696797e-57 0.066781869
I(voted_p2004 > 0)TRUE                       9.922548e-248 0.083332984
I(treatment == 3)TRUE:I(voted_p2004 > 0)TRUE  1.808993e-02 0.002726931
                                               CI Upper     DF
(Intercept)                                  0.26253437 119995
I(treatment == 3)TRUE                        0.08547805 119995
I(voted_p2004 > 0)TRUE                       0.09362643 119995
I(treatment == 3)TRUE:I(voted_p2004 > 0)TRUE 0.02919778 119995

Illustration: Gerber, Green and Larimer (2008)

  • Voters who voted in the 2004 primary were more affected by the social pressure mailer (we’ll reveal your voting history to your neighbors) than voters who did not.
  • Is this because…
    • …these voters are the types of people to be more susceptible to social pressure (they already do pro-social things like voting)?
    • …these voters received a different treatment content (the treatment told them they were voters)?
    • …other explanations?
  • Does this tell us anything about the effect of voting in the 2004 primary?
  • If we re-ran this experiment in a place that exhibits generally lower turnout, do we expect effects…
    • …to be larger?
    • …to be smaller?
    • …to be the same?

External validity

External validity

  • Two types of validity in experiments (Shadish, Cook and Campbell, 2002)
    • Internal validity - Does the study identify a causal parameter
    • External validity - Does the study identify the causal parameter that we care about?
  • Experiments guarantee the first, but only theory can give us the latter.
  • Often also characterized in terms of generalization and transportability

    • Generalization: Does the sample average treatment effect generalize to the ATE in the population from which the sample was drawn?
    • Transportability: Does an effect from one population “transport” to another, different, population?
    • Personally, I don’t think this distinction is that important, but it’s sometimes made.

External validity

Findley, Kikuta, Denly (2021) “External Validity” Annual Review of Political Science

Typology of External Validity

  • Egami and Hartman (2022) - My preferred take
    • Starting point: The experiment identifies the Sample Average Treatment Effect \(\tau_{\text{SATE}}\)
    • Goal: What do we need to assume to generalize to the Target-Population Average Treatment Effect: \(\tau_{\text{T-PATE}}\)
    • Combines generalization (sample to population) and transportation (source to target)
  • What do we need to assume in order to generalize? Four sources of variability between SATE and T-PATE
    • X-validity - Differences in characteristics of units
    • D-validity - Differences in characteristics of treatments
    • Y-validity - Differences in characteristics of outcomes
    • C-validity - Differences in the contexts

X-validity

  • Often samples and target populations differ in the types of units they contain
    • Convenience samples, demographic variation between study site and target site, etc…
  • What assumption do we need to make in order to address this threat to inference?
    • Treatment effects do not vary systematically in the differences in \(X_i\) between sample and target.
  • Two ways this could be satisfied
    1. Genuine random sampling assumptions (sample vs. population don’t systematically differ)
    2. Effect homogeneity assumption - the covariates that drive sample vs. target differences are not correlated with treatment effects
  • If we knew the relevant \(X_i\), we could also re-weight the sample to match the target distribution.
    • We’ll talk about IPW for causal inference later on - this is the same idea!
  • Significant cross-discipline and cross-sub-discipline variation in belief in effect homogeneity
    • Lab experimentalists tend towards belief in homogeneity (e.g. political psychology)
    • Field experimentalists tend towards belief in heterogeneity (e.g. “Metaketa” project)

D-validity

  • Often the way we run our study and implement our treatment is not how it will be done in the “real world”
    • Realism vs. abstraction in our intervention
    • “Hawthorne effects”/observation
    • “Real world” effects are sometimes bundles of interventions
  • What assumption do we need to make in order to address this threat to inference?
    • Effect homogeneity across variations of treatment
    • Note that “random sampling” doesn’t really solve anything here since it’s a characteristic of the intervention being studied!
  • Examples: Brutger, R., Kertzer, J.D., Renshon, J., Tingley, D. and Weiss, C.M., 2023. Abstraction and detail in experimental design. American Journal of Political Science, 67(4), pp.979-995.
    • Debates in IR vignette experiments about how much abstraction to use (e.g. real event or hypothetical; China or “a country”)
    • Findings: No variation effects across degrees of “hypotheticality” but some heterogeneity based on additional contextual detail and actor identity

Y-validity

  • Sometimes we can’t actually measure the outcome we’re interested in evaluating
    • e.g. In medicine - sometimes need to use a surrogate endpoint \(Y^*\) when outcome of interest \(Y\) is costly to obtain.
    • In surveys - We observe a stated preference but are actually interested in revealed preferences and behavior
  • What assumption do we need to make in order to address this threat to inference?
    • It’s effect homogeneity again! Type of outcome doesn’t modify treatment effect
  • Here we can sometimes use evidence about the predictive power of the outcome we measure \(Y^*\) on the outcome of interest \(Y\)
    • “Surrogate endpoint” literature in medicine
    • But beware the surrogacy paradox!
    • Positive effect of \(D\) on \(Y^*\) and a positive association between \(Y^*\) and \(Y\) does not guarantee a positive effect of \(D\) on \(Y\)

C-validity

  • Treatment effects may depend on the context in which the study was conducted
    • Different cities
    • Different countries
    • Different times (Munger, 2023)
  • What assumption do we need to make in order to address this threat to inference?
    • Once again, it’s effect homogeneity by context!
    • C-validity is basically just X-validity with no overlap
    • So even if we knew the covariate distributions in both, we can’t reweight the sample to match the target population.
  • How do we figure out whether context matters?
    • Run many studies in many settings and compare the results!

Summary

  • Questions of external validity are questions of effect heterogeneity
    • Is the average effect of treatment the same for these units as it is for those units
  • The degree of effect heterogeneity is a question of scientific knowledge
    • Highly discipline dependent! Some fields think we can generalize more than others.
  • Think about the dimensions across which sample and target population can vary
    • Units
    • Outcomes
    • Treatments
    • Settings
    • Mechanisms

Next week

  • Using covariates to improve precision in experiments
    • Why blocking/stratification can’t hurt you!
    • Covariate adjustment using linear regression
  • Agnostic approach to linear regression
    • Linear regression as the best linear predictor
    • What assumptions do we really need to justify OLS as an estimator
    • Why (properly specified) OLS regression is fine in a randomized experiment even if you get the model wrong!
  • Analysis of cluster-randomized experiments (time permitting!)