Week 9: Differences-in-Differences

PS 813 - Causal Inference

Anton Strezhnev

University of Wisconsin-Madison

March 16, 2026

\[ \require{cancel} \]

Last week

  • Identification under unobserved confounding
  • Instrumental variables - we can identify the local average treatment effect (LATE) with an instrument that…
    • …is ignorable/conditionally ignorable…
    • …and affects the outcome only through the treatment (exclusion restriction)…
    • …and has a monotonic effect on treatment.
  • Our simple IV estimator:
    • Reduced form (instrument’s effect on outcome) in the numerator
    • First stage (instrument’s effect on the treatment) in the denominator

This week

  • More strategies for identification under unobserved confounding
  • When we have repeated observations over time, can we use pre-treatment outcomes to help with inference?
    • Time 1: Some units treated, some units under control
    • Time 0: All units under control
  • What if the confounding in time 1 were unobserved…
    • … but the amount of confounding in time 1 is the same as in time 0?
  • Then we can use the pre-treatment (time 0) difference in the treated and control arms to de-bias the time 1 difference.
    • Difference-in-differences
  • Assumptions: no anticipation and parallel trends
    • Pre-treatment outcomes are unaffected by treatment exposure.
    • The trend in the average potential outcome in the treated group would have been the same as the trend in the control group absent treatment.
  • Generalizes to any setting where we believe there is confounding but where the true effect of treatment is known to be 0
    • “Negative Outcome Control”

Difference-in-differences

John Snow and Cholera

John Snow and Cholera

  • 1854: Large cholera outbreak near Broad Street in London.

    • Physician John Snow hypothesized that cholera was transmitted through the water
    • Contrary to popular belief that it was airborne (“miasma theory”)
  • Snow convinced the local authorities to remove the handle of the Broad Street pump

    • Cholera deaths declined
    • But was this causal?
  • Even Snow didn’t necessarily think so…

    There is no doubt that the mortality was much diminished, as I said before, by the flight of the population, which commenced soon after the outbreak; but the attacks had so far diminished before the use of the water was stopped, that it is impossible to decide whether the well still contained the cholera poison in an active state, or whether, from some cause, the water had become free from it. (Snow, “On the Mode of Communication of Cholera, 1855)

John Snow and Cholera

  • The more interesting John Snow story was not the Broad Street pump, but another 1856 paper titled “Cholera and the water supply in the south districts of London in 1854”
  • Key insight: South London was served by two major water companies: Lambeth Company and Southwark and Vauxhall Company.
    • Lambeth switched to a less contaminated source between 1849 and 1853

    Between the epidemics of 1849 and that of 1853, one of the water companies supplying the south districts of London changed its source of supply from the middle of the town, near the foot of the Hungerford Suspension Bridge, to Thames Ditton, at a part of the river which is beyond the influence of the tide, and, therefore, out of reach of the sewage of the metropolis. (Snow, 1856)

John Snow and Cholera

John Snow and Cholera

  • Snow compared Lambeth (treated) districts with Southwark and Vauxhall (control) districts – less mortality in Lambeth.

    …Taking into account the population supplied respectively by each company, the mortality was, at this period of the epidemic, nearly eight times as great in that supplied by the Southwark and Vauxhall Company as in that supplied by the Lambeth Company. (Snow, 1856)

  • But this isn’t enough - what if Lambeth districts differed in unobserved ways from Southwark and Vauxhall districts?

  • So Snow also compared the observed mortality in 1853 to mortality in 1849, when both districts used contaminated water.

    In the autumn of 1853 it was shown by Dr. Farr* that the districts partly supplied by this, the Lambeth Water Company, with improved water, suffered less than the districts supplied entirely by the Southwark and Vauxhall Company with the water from the river at Battersea Fields, although in 1849 they had suffered rather more than the latter districts (Snow, 1856).

John Snow and Cholera

  • This was one of the earliest “difference-in-differences” designs (see also Semmelweiss’ work on antisepsis in 1861).
    • Not just a before-after comparison
    • Not just a cross-sectional comparison
  • Implicit assumption: If there were something different about Lambeth (aside from the treatment) it would have the same effect on the outcome in the pre-treatment (1849) period as it would in 1853.
    • An assumption on the counterfactuals: Had treatment not changed in Lambeth, the average trend (from 1849 to 1853) in Lambeth would have been the same as the trend in Southwark and Vauxhall

DiD with two periods

  • Two groups (treated/control); two time periods (0, 1).

    • \(D_i = 1\): treated in time \(1\), \(D_i = 0\) control in time \(1\)
    • All units under control in time \(0\)
  • Two outcomes observed

    • \(Y_{i1}\): outcome in period \(1\)
    • \(Y_{i0}\) outcome in period \(0\)
  • Potential outcomes + consistency assumption.

    \[Y_{i1}(d) = Y_{i1} \text{ if } D_i = d\]

    \[Y_{i0}(d) = Y_{i0} \text{ if } D_i = d\]

Causal Estimand

  • Causal estimand: Average Treatment Effect on the Treated (ATT) in time \(1\)

    \[\tau_{\text{ATT}} = E[Y_{i1}(1) | D_i = 1] - E[Y_{i1}(0) | D_i = 1]\]

  • The first part we can get directly from the data (observed outcome among the treated group)

    \[\tau_{\text{ATT}} = E[Y_{i1} | D_i = 1] - E[Y_{i1}(0) | D_i = 1]\]

  • Second part we don’t observe directly and need some additional assumptions.

    • But now we won’t assume ignorability of treatment: \(Y_{i1}(0) \cancel{{\perp \! \! \! \perp}} D_i\)

Identifying assumptions

  • No anticipation

    • The pre-treatment outcome is unaffected by receipt of treatment.

      \[Y_{i0}(1) = Y_{i0}(0) = Y_{i0}\]

  • Parallel trends

    • In the absence of treatment, the treated group’s average outcome trend would be the same as the control group’s

      \[\underbrace{\left\{E[Y_{i1}(0) | D_i = 1] - E[Y_{i0}(0) | D_i = 1] \right\}}_{\text{Average counterfactual trend among treated}} = \underbrace{\left\{E[Y_{i1}(0)| D_i = 0] - E[Y_{i0}(0)| D_i = 0]\right\}}_{\text{Average trend among controls}}\]

    • Equivalently

      \[\underbrace{\left\{E[Y_{i1}(0) | D_i = 1] - E[Y_{i1}(0)| D_i = 0]\right\}}_{\text{Selection bias at time 1}} = \underbrace{\left\{E[Y_{i0}(0) | D_i = 1] - E[Y_{i0}(0)| D_i = 0]\right\}}_{\text{Selection bias at time 0}}\]

Identification

  • Remember the selection bias formula for the ATT:

    \[\tau_{\text{ATT}} = \underbrace{\left\{E[Y_{i1} | D_i = 1] - E[Y_{i1} | D_i = 0]\right\}}_{\text{Difference-in-means in time 1}} - \underbrace{\left\{E[Y_{i1}(0) | D_i = 1] - E[Y_{i1}(0)| D_i = 0]\right\}}_{\text{Selection bias}}\]

  • We can observe \(E[Y_{i1}(0)| D_i = 0]\), but can’t observe \(E[Y_{i1}(0)| D_i = 1]\)

  • Under parallel trends :

    \[\underbrace{\left\{E[Y_{i1}(0) | D_i = 1] - E[Y_{i1}(0)| D_i = 0]\right\}}_{\text{Selection bias}} = \underbrace{\left\{E[Y_{i0}(0) | D_i = 1] - E[Y_{i0}(0)| D_i = 0]\right\}}_{\text{Selection bias at time 0}}\]

  • And under no anticipation:

    \[\underbrace{\left\{E[Y_{i1}(0) | D_i = 1] - E[Y_{i1}(0)| D_i = 0]\right\}}_{\text{Selection bias}} = \underbrace{\left\{E[Y_{i0} | D_i = 1] - E[Y_{i0}| D_i = 0]\right\}}_{\text{Observed difference at time 0}}\]

Difference-in-differences

Estimation

  • With repeated observations at the unit level, we can use a simple regression of the differenced outcomes on the treatment indicator

    \[Y_{i1} - Y_{i0} = \alpha + \tau D_i + \epsilon_i\]

  • Each row in the data is a single unit with outcomes in two time periods.

    • Straightforward asymptotic inference with Neyman SEs
  • Prefer this approach when possible

    • Equivalent to a (mean) ignorability assumption w.r.t. the difference in outcomes from time 0 to time 1
    • Easier to work with for conditional parallel trends (e.g. adjust for time-invariant covariates w/ AIPW) (Sant’anna and Zhao, 2020)

Two-way fixed effects

  • Suppose our dataset is organized where each row is a unit/time period - \(it\).

    • Let \(D_{it}\) denote whether a unit is treated at time \(t\)
  • We can recover our 2x2 DiD estimator (in the simple two unit, two period setting) using a “two-way” fixed effects regression

    • Unique parameter for each unit or treatment timing group
    • Unique parameter for each time period
  • Commonly written as:

    \[Y_{it} = \alpha_i + \delta_{t} + \tau D_{it} + \epsilon_{it}\]

  • Equivalently (with timing group FEs instead of unit FEs)

    \[Y_{it} = \alpha + \beta D_i + \delta_{t} + \tau D_{it} + \epsilon_{it}\]

Two-way fixed effects

  • Expectations:
    • \(E[Y_{i0} | D_i = 0] = \alpha + \delta_0\)
    • \(E[Y_{i1} | D_i = 0] = \alpha + \delta_1\)
    • \(E[Y_{i0} | D_i = 1] = \alpha + \beta + \delta_0\)
    • \(E[Y_{i1} | D_i = 1] = \alpha + \beta + \delta_1 + \tau\)
  • Differences
    • \(E[Y_{i1} | D_i = 1] - E[Y_{i0} | D_i = 1] = \delta_1 - \delta_0 + \tau\)
    • \(E[Y_{i1} | D_i = 0] - E[Y_{i0} | D_i = 0] = \delta_1 - \delta_0\)
  • Difference-in-differences
    • \(\{E[Y_{i1} | D_i = 1] - E[Y_{i0} | D_i = 1]\} - \{E[Y_{i1} | D_i = 0] - E[Y_{i0} | D_i = 0]\} = \tau\)

Two-way fixed effects

  • Does not generalize neatly to many time periods w/ variation in treatment timing.
    • Need to also assume a constant, instantaneous treatment effect \(\tau\)
    • We’ll spend a lot more time on this next week!
  • Need to account for dependence in observations between time periods w/in same unit
    • “Cluster-robust” SEs or block bootstrap
    • We’ll discuss this more later!

Example: Card and Krueger (1994, AER)

  • Does increasing the minimum wage reduce employment?
    • Classical theoretical models suggest yes…
    • But empirical evidence is hard to come by - no one has (yet) randomized the minimum wage.
  • Card and Krueger use a policy change in New Jersey relative to Pennsylvania
  • In 1992, NJ raised its minimum wage from 4.25 dollars per hour to 5.05 per hour
    • PA stayed at 4.25 dollars per hour
  • Surveyed 410 fast food restaurants before and after the change was put into place
    • Compared change in employment before/after in NJ with change before/after in PA.
  • Key assumption - Had NJ not implemented the minimum wage increase, the average trend in NJ fast food restaurant employment would have been the same as the average trend in PA fast food restaurant employment

Example: Card and Krueger (1994, AER)

# Load the data for Card and Krueger (1994)
minwage <- read_csv("assets/data/minwage.csv")
# Index of observations
minwage$unit <- 1:nrow(minwage)

# Change in full-time employment
minwage$CHG_EMPFT <- minwage$EMPFT2 - minwage$EMPFT

# Regress change on treatment (STATE = 1 for NJ)
diff <- lm_robust(CHG_EMPFT ~ STATE, data=minwage, se_type = "HC2")
tidy(diff)
         term estimate std.error statistic p.value conf.low conf.high  df
1 (Intercept)    -2.49      1.65     -1.52  0.1306   -5.728     0.743 356
2       STATE     2.93      1.73      1.69  0.0915   -0.475     6.329 356
    outcome
1 CHG_EMPFT
2 CHG_EMPFT

Example: Card and Krueger (1994, AER)

# Equivalence of TWFE in 2x2 case
minwage_long <- minwage %>% pivot_longer(cols = starts_with("EMPFT"),
                names_to = "time_str", names_prefix = "EMPFT", values_to = "EMPFT")

# Recode time variable
minwage_long$time <- NA
minwage_long$time[minwage_long$time_str == ""] <- 0
minwage_long$time[minwage_long$time_str == "2"] <- 1

# Make the treatment variable
minwage_long$treat <- as.integer(minwage_long$STATE == 1&minwage_long$time==1)

# TWFE
twfe_reg <- lm_robust(EMPFT ~ treat + as.factor(time) + as.factor(unit),
                      data=minwage_long, cluster=unit, se_type = "CR2")
tidy(twfe_reg) %>% filter(term == "treat")
   term estimate std.error statistic p.value conf.low conf.high   df outcome
1 treat     2.93      1.73      1.69  0.0938   -0.505      6.36 98.7   EMPFT

Example: Card and Krueger (1994, AER)

Example: Card and Krueger (1994, AER)

  • Suppose we thought that parallel trends in Card and Krueger (1994) only held conditional on the type of fast food restaurant
# 1=Burger King; 2=KFC; 3=Roy Rogers; 4=Wendy's
minwage %>% group_by(STATE) %>% summarize(mean(CHAIN == 1), mean(CHAIN==2), mean(CHAIN == 3), mean(CHAIN ==4))
# A tibble: 2 × 5
  STATE `mean(CHAIN == 1)` `mean(CHAIN == 2)` `mean(CHAIN == 3)`
  <dbl>              <dbl>              <dbl>              <dbl>
1     0              0.463              0.149              0.224
2     1              0.405              0.223              0.251
# ℹ 1 more variable: `mean(CHAIN == 4)` <dbl>
# Fit a propensity score model
weight_model <- glm(STATE ~ as.factor(CHAIN), data=minwage, family=binomial(link="logit"))

# Predict weights
minwage$e <- predict(weight_model, type="response")
minwage$did_wt <- (1/mean(minwage$STATE)) * ((minwage$STATE -minwage$e)/(1-minwage$e))

Example: Card and Krueger (1994, AER)

# Point estimate
mean(minwage$CHG_EMPFT*minwage$did_wt)
[1] 2.46
# Slight fix of the weights to make this work in OLS
minwage$did_wt_reg <- minwage$did_wt*minwage$STATE - minwage$did_wt*(1-minwage$STATE)
lm_robust(CHG_EMPFT ~ STATE, data=minwage, weight=did_wt_reg)
            Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper  DF
(Intercept)    -2.02       1.49   -1.36    0.174   -4.946    0.898 356
STATE           2.46       1.58    1.56    0.120   -0.647    5.564 356

Example: Card and Krueger (1994, AER)

# Bootstrap
set.seed(60637)
niter <- 1000
boot_est <- rep(NA, niter)
for(i in 1:niter){
  boot_minwage <- minwage[sample(1:nrow(minwage), nrow(minwage), replace=T),]
  # Fit a propensity score model
  weight_model_boot <- glm(STATE ~ as.factor(CHAIN), data=boot_minwage, family=binomial(link="logit"))

  # Predict weights
  boot_minwage$e <- predict(weight_model_boot, type="response")
  boot_minwage$did_wt <- (1/mean(boot_minwage$STATE)) * ((boot_minwage$STATE - boot_minwage$e)/(1-boot_minwage$e))

  # Point est
  boot_est[i] <- mean(boot_minwage$CHG_EMPFT*boot_minwage$did_wt)
}
#Bootstrap 95% CI
quantile(boot_est, c(.025, .975))
  2.5%  97.5% 
-0.707  5.464 

Example: Card and Krueger (1994, AER)

  • Sant’anna and Zhao (2020) doubly-robust estimator
# This package requires using the *long* data
library(DRDID) 
# Notably this doesn't add much with the fully saturated propensity score model - we're really just stratifying by chain
dr_minwage <- drdid(yname="EMPFT",
                    tname="time",
                    idname="unit",
                    dname="STATE",
                    xformla = ~as.factor(CHAIN),
                    data=minwage_long,
                    estMethod = "trad")
summary(dr_minwage)
 Call:
drdid(yname = "EMPFT", tname = "time", idname = "unit", dname = "STATE", 
    xformla = ~as.factor(CHAIN), data = minwage_long, estMethod = "trad")
------------------------------------------------------------------
 Locally efficient DR DID estimator for the ATT:
 
   ATT     Std. Error  t value    Pr(>|t|)  [95% Conf. Interval] 
  2.4585     1.5383     1.5981      0.11     -0.5567     5.4736  
------------------------------------------------------------------
 Estimator based on panel data.
 Outcome regression est. method: OLS.
 Propensity score est. method: maximum likelihood.
 Analytical standard error.
------------------------------------------------------------------
 See Sant'Anna and Zhao (2020) for details.

Example: Card and Krueger (1994, AER)

Difference-in-differences with many time periods

Classic 2x2 DiD

  • Difference-in-differences estimators can be understood in terms of their component 2 \(\times\) 2 comparisons
    • …between two treatment timing groups \(g\) and \(g^\prime\)

    • …and between two time periods \(t\) and \(t^\prime\)

      \[\underbrace{\bigg[\bar{Y}_{g,t} - \bar{Y}_{g^\prime, t} \bigg]}_{\text{cross-sectional difference at time } t} - \underbrace{\bigg[\bar{Y}_{g,t^\prime} - \bar{Y}_{g^\prime, t^\prime} \bigg]}_{\text{cross-sectional difference at time } t^\prime}\]

  • In the classic 2x2 case, we have…
    • time period \(t\) as the period where treatment is assigned to some units and \(t^\prime\) as the “pre-treatment” period.
    • timing group \(g\) as the group that receives treatment (\(D_i = 1\)) and timing group \(g^\prime\) as the units that are always under control (\(D_i = 0\))

Classic 2x2 DiD

DiD with 2 periods, 2 groups

Many time periods, two timing groups

  • Suppose that instead of having two time periods, we now have \(T\) treatment periods.
    • We’ll now characterize our treatment groups based on when they start treatment.
    • Let \(G_i\) denote the time period when unit \(i\) initiates treatment
  • We’ll stick with the two timing group case for now
    • Treated units start treatment at some time \(g^*: 1 < g^* \le T\)
    • Control units start treatment at time after \(T\) (we’ll use \(\infty\))
  • Define potential outcomes in terms of assignment to a timing group
    • \(Y_{it}(g) = Y_{it}\) for units with \(G_i = g\)
    • We’ll denote \(Y_{it}(\infty)\) as the “control” potential outcome

Many time periods, two timing groups

  • Estimand: The group-time average treatment effect on the treated
    • What would have happened on average at time \(t\) had a unit that started treatment at time \(g\) instead never started treatment.

      \[\text{ATT}_{g}(t) = \mathbb{E}[Y_{it}(g) - Y_{it}(\infty) | G_i = g]\]

  • Other target estimands are averages of the group-time ATTs
    • e.g. the average of all post-treatment group-time ATTs

DiD with no staggered adoption

DiD with 4 periods, 2 groups

DiD with no staggered adoption

DiD with 4 periods, 2 groups - relative treatment time

Identifying assumptions

  • No anticipation: Pre-treatment outcomes are unaffected by future treatment status.

    \[Y_{it}(g) = Y_{it}(\infty) \ \forall\ t < g\]

  • (General) Parallel trends: For any two time periods \(t\) and \(t^\prime\) and two timing groups \(g\) and \(g^\prime\)

    \[E[Y_{it}(\infty) - Y_{it'}(\infty) | G_i = g] = E[Y_{it}(\infty) - Y_{it'}(\infty) | G_i = g^\prime ]\]

  • Generalizes our prior assumptions to all time periods and across all treatment timing groups

Two-way fixed effects estimators

  • In the case with two treatment timing groups and many time periods, we have two natural quantities of interest

    • The \(ATT_{g^*}(t)\) for each time period \(t \ge g^*\)
    • The average of all post-treatment group-time ATTs \(ATT_{g^*} = \frac{1}{T - g^* + 1}\sum_{t = g^*}^T ATT_{g^*}(t)\)
  • There are two common ways to estimate these effects with two-way fixed effects (TWFE) regressions

  • The Static TWFE

    \[Y_{it} = \alpha_i + \delta_{t} + \tau D_{it} + \epsilon_{it}\]

    where \(D_{it}\) is an indicator for whether unit \(i\) is under treatment at time \(t\)

  • In the two timing-group case we’re looking at here, \(D_{it} = \mathbf{1}(G_i = g^*) \times \mathbf{1}(t \ge g^*)\)

Two-way fixed effects estimators

  • The Dynamic TWFE

    \[Y_{it} = \alpha_i + \delta_{t} + \sum_{l \neq -1} \tau_l D^{(l)}_{it} + \epsilon_{it}\]

    where \(D_{it}^{(l)}\) is a dummy indicator for observation \(i\) being \(l\) periods from treatment initiation at time \(t\)

  • In the two timing-group case, this regression looks like:

    \[Y_{it} = \alpha_i + \delta_{t} + \sum_{\substack{l = -(g^* - 1) \\ l \neq -1}}^{T - g^*} \tau_l \times \mathbf{1}(G_i = g^*)\times\mathbf{1}(t = g^* + l) + \epsilon_{it}\]

  • Note that we need to omit at least one relative treatment time indicator!

    • Otherwise we have perfect collinearity with the TWFE

Identification

  • Do these regressions identify the target parameters we’re looking for in the two timing group case.

  • Static TWFE: Yes

    • \(\hat{\tau}\) is an average over the 2x2 diff-in-diffs between each post-treatment and each pre-treatment period.
    • Equivalent to collapsing the data into a 2x2 DiD by averaging overall pre- and post- outcomes for treated/control
    • Identifies the average of post-treatment ATTs \(ATT_{g^*}\)
  • Dynamic TWFE: Yes!

    • \(\hat{\tau_l}\) is the 2x2 DiD between relative treatment time \(l\) and the held-out “baseline” period (period \(-1\))
    • Each coefficient identifies \(ATT_{g^*}(g^* + l)\)

TWFE as an average over 2x2s

Static TWFE 2x2s

TWFE as an average over 2x2s

Dynamic TWFE - relative time 1

Example: Ferwerda (2021)

Ferwerda, Jeremy. “Immigration, voting rights, and redistribution: Evidence from local governments in Europe.” The Journal of Politics 83.1 (2021): 321-339.

  • Does expanding voting rights in municipal elections to non-citizen foreign residents increase social expenditures?
    • Setting: Swiss municipalities in the 2000s
  • In Switzerland, voting rights vary by cantons
    • Vaud, Fribourg and Geneva implemented foreign voting rights in 2003, 2004 and 2005 respectively.
  • We’re going to focus on Vaud (earliest adopter) as the treated group and Geneva (latest adopter) as the control group
    • Actual paper uses a different analysis than DiD/TWFE and has covariates, but we’ll do a simple DiD to illustrate

Example: Ferwerda (2021)

# Pre-processing
voting <- read_dta("assets/data/Swiss_master.dta", encoding='latin1')
voting_complete <- voting %>% filter(year >= 2000, year <= 2005) %>% filter(!is.na(log_net_welfare_head))
# Keep only units with complete panels (all 6 years)
voting_complete <- voting_complete %>% group_by(bfsnr) %>% filter(n() == 6) %>% ungroup()
voting_complete <-suppressWarnings(voting_complete %>% 
                          mutate(first_year = min(year[vf == 1]), .by=bfsnr))

Example: Ferwerda (2021)

Unique Treatment Histories in Ferwerda (2021) - Dropping all units with years missing from 2000-2006

Example: Ferwerda (2021)

  • Let’s fit the simple static TWFE on the Vaud and Geneva cases
## Filter down to the two treatment timing groups
voting_vaud_geneva <- voting_complete %>% filter(cantonid %in% c(22, 25))
## Estimate with fixest::feols
static_twfe <- feols(log_net_welfare_head ~ vf | bfsnr + year, data=voting_vaud_geneva, cluster="bfsnr")
etable(static_twfe)
                         static_twfe
Dependent Var.: log_net_welfare_head
                                    
vf                  -0.0927 (0.0599)
Fixed-Effects:  --------------------
bfsnr                            Yes
year                             Yes
_______________ ____________________
S.E.: Clustered            by: bfsnr
Observations                   2,028
R2                           0.95818
Within R2                    0.00791
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Verify this is equivalent to the simple 2x2 DiD
mean(voting_vaud_geneva$log_net_welfare_head[voting_vaud_geneva$year>2003&voting_vaud_geneva$cantonid == 22]) -
mean(voting_vaud_geneva$log_net_welfare_head[voting_vaud_geneva$year> 2003&voting_vaud_geneva$cantonid == 25]) -
mean(voting_vaud_geneva$log_net_welfare_head[voting_vaud_geneva$year<=2003&voting_vaud_geneva$cantonid == 22]) +
mean(voting_vaud_geneva$log_net_welfare_head[voting_vaud_geneva$year<= 2003&voting_vaud_geneva$cantonid == 25])
[1] -0.0927

Example: Ferwerda (2021)

  • How about the dynamic TWFE regression?
    • In the non-staggered case, you can just interact an indicator for the treated unit with the time FEs.
    • We’ll just use the sunab() syntax here to match next week’s lecture
dynamic_twfe <- feols(log_net_welfare_head ~ sunab(first_year, year, ref.p = -1) | bfsnr + year,
                         data = voting_vaud_geneva, cluster = "bfsnr")
etable(dynamic_twfe)
                        dynamic_twfe
Dependent Var.: log_net_welfare_head
                                    
year = -4         0.2409*** (0.0563)
year = -3          0.1709** (0.0569)
year = -2            0.0635 (0.0441)
year = 0             0.0378 (0.0337)
year = 1             0.0144 (0.0566)
Fixed-Effects:  --------------------
bfsnr                            Yes
year                             Yes
_______________ ____________________
S.E.: Clustered            by: bfsnr
Observations                   2,028
R2                           0.95920
Within R2                    0.03213
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example: Ferwerda (2021)

“Event study” plots

  • What we want!

“Event study” plots

  • What is likely very concerning!

Example: Ferwerda (2021)

Example: Ferwerda (2021)

  • Let’s estimate the linear time trend specification in feols
lin_trend <- feols(log_net_welfare_head ~ sunab(first_year, year, ref.p = c(-1:-4)) +
                     as.factor(cantonid)*year| bfsnr + year,
                         data = voting_vaud_geneva, cluster = "bfsnr")
etable(lin_trend)
                                        lin_trend
Dependent Var.:              log_net_welfare_head
                                                 
year = 0                       0.1266*** (0.0373)
year = 1                        0.1862** (0.0581)
as.factor(cantonid)25 x year   0.0830*** (0.0178)
Fixed-Effects:               --------------------
bfsnr                                         Yes
year                                          Yes
____________________________ ____________________
S.E.: Clustered                         by: bfsnr
Observations                                2,028
R2                                        0.95919
Within R2                                 0.03189
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example: Ferwerda (2021)

  • Do we get different results if we specify different parametric forms?
quad_trend <- feols(log_net_welfare_head ~ sunab(first_year, year, ref.p = c(-1:-4)) + 
                      as.factor(cantonid)*year + as.factor(cantonid)*I(year^2) | bfsnr + year,
                         data = voting_vaud_geneva, cluster = "bfsnr")
etable(quad_trend)
                                            quad_trend
Dependent Var.:                   log_net_welfare_head
                                                      
year = 0                               0.1183 (0.0798)
year = 1                               0.1679 (0.1528)
as.factor(cantonid)25 x year             6.725 (61.70)
as.factor(cantonid)25 x I(year^2)     -0.0017 (0.0154)
Fixed-Effects:                    --------------------
bfsnr                                              Yes
year                                               Yes
_________________________________ ____________________
S.E.: Clustered                              by: bfsnr
Observations                                     2,028
R2                                             0.95919
Within R2                                      0.03190
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example: Ferwerda (2021)

  • Let’s hold out some periods to use as placebos

Example: Ferwerda (2021)

  • What if we fit a quadratic time trend instead?

Conclusion

  • Most differences-in-differences designs have many pre- and post-treatment periods
    • Can estimate effects for many post-treatment periods - plot trajectory of the effect over time
    • Can estimate effects for pre-treatment periods - placebo/pre-trends tests
  • When we have no staggering in treatment adoption (two timing groups), TWFE estimators are equivalent to averages over valid 2x2 differences-in-differences
    • Static - Single coefficient on treatment
    • Dynamic - Coefficients for every relative-treatment-time
    • Don’t forget to be clear about what your baseline (held-out) period is when using the dynamic specification!

Next week

  • What happens when we have staggered adoption of treatment?
    • TWFE estimators are biased unless we make strong constant effects assumptions - “forbidden comparisons”/“negative weighting problem”
    • Static and Dynamic TWFE require different constant effects assumptions!
  • Alternatives: “New DiD”
    • First-differences approaches: construct the DiD for each group-time ATT and aggregate (Callaway/Sant’anna estimator)
    • Regression imputation: Fit TWFE to controls and impute on the treateds (Borusyak/Jaravel/Spiess estimator)
    • Properly saturate the TWFE (Sun and Abraham/Wooldridge estimators)
    • Many equivalencies between these!
  • Partial identification by bounding the parallel trends violations - Rambachan/Roth approach