Week 9: Differences-in-Differences

PS 813 - Causal Inference

Anton Strezhnev

strezhnev@wisc.edu

University of Wisconsin-Madison

March 16, 2026

\[ \require{cancel} \]

Last week

Identification under unobserved confounding
Instrumental variables - we can identify the local average treatment effect (LATE) with an instrument that…
- …is ignorable/conditionally ignorable…
- …and affects the outcome only through the treatment (exclusion restriction)…
- …and has a monotonic effect on treatment.
Our simple IV estimator:
- Reduced form (instrument’s effect on outcome) in the numerator
- First stage (instrument’s effect on the treatment) in the denominator

This week

More strategies for identification under unobserved confounding
When we have repeated observations over time, can we use pre-treatment outcomes to help with inference?
- Time 1: Some units treated, some units under control
- Time 0: All units under control
What if the confounding in time 1 were unobserved…
- … but the amount of confounding in time 1 is the same as in time 0?
Then we can use the pre-treatment (time 0) difference in the treated and control arms to de-bias the time 1 difference.
- Difference-in-differences
Assumptions: no anticipation and parallel trends
- Pre-treatment outcomes are unaffected by treatment exposure.
- The trend in the average potential outcome in the treated group would have been the same as the trend in the control group absent treatment.
Generalizes to any setting where we believe there is confounding but where the true effect of treatment is known to be 0
- “Negative Outcome Control”

Difference-in-differences

John Snow and Cholera

John Snow and Cholera

1854: Large cholera outbreak near Broad Street in London.
- Physician John Snow hypothesized that cholera was transmitted through the water
- Contrary to popular belief that it was airborne (“miasma theory”)
Snow convinced the local authorities to remove the handle of the Broad Street pump
- Cholera deaths declined
- But was this causal?
Even Snow didn’t necessarily think so…

There is no doubt that the mortality was much diminished, as I said before, by the flight of the population, which commenced soon after the outbreak; but the attacks had so far diminished before the use of the water was stopped, that it is impossible to decide whether the well still contained the cholera poison in an active state, or whether, from some cause, the water had become free from it. (Snow, “On the Mode of Communication of Cholera, 1855)

John Snow and Cholera

The more interesting John Snow story was not the Broad Street pump, but another 1856 paper titled “Cholera and the water supply in the south districts of London in 1854”
Key insight: South London was served by two major water companies: Lambeth Company and Southwark and Vauxhall Company.
- Lambeth switched to a less contaminated source between 1849 and 1853
Between the epidemics of 1849 and that of 1853, one of the water companies supplying the south districts of London changed its source of supply from the middle of the town, near the foot of the Hungerford Suspension Bridge, to Thames Ditton, at a part of the river which is beyond the influence of the tide, and, therefore, out of reach of the sewage of the metropolis. (Snow, 1856)

John Snow and Cholera

John Snow and Cholera

Snow compared Lambeth (treated) districts with Southwark and Vauxhall (control) districts – less mortality in Lambeth.

…Taking into account the population supplied respectively by each company, the mortality was, at this period of the epidemic, nearly eight times as great in that supplied by the Southwark and Vauxhall Company as in that supplied by the Lambeth Company. (Snow, 1856)
But this isn’t enough - what if Lambeth districts differed in unobserved ways from Southwark and Vauxhall districts?
So Snow also compared the observed mortality in 1853 to mortality in 1849, when both districts used contaminated water.

In the autumn of 1853 it was shown by Dr. Farr* that the districts partly supplied by this, the Lambeth Water Company, with improved water, suffered less than the districts supplied entirely by the Southwark and Vauxhall Company with the water from the river at Battersea Fields, although in 1849 they had suffered rather more than the latter districts (Snow, 1856).

John Snow and Cholera

This was one of the earliest “difference-in-differences” designs (see also Semmelweiss’ work on antisepsis in 1861).
- Not just a before-after comparison
- Not just a cross-sectional comparison
Implicit assumption: If there were something different about Lambeth (aside from the treatment) it would have the same effect on the outcome in the pre-treatment (1849) period as it would in 1853.
- An assumption on the counterfactuals: Had treatment not changed in Lambeth, the average trend (from 1849 to 1853) in Lambeth would have been the same as the trend in Southwark and Vauxhall

DiD with two periods

Two groups (treated/control); two time periods (0, 1).
- \(D_i = 1\): treated in time \(1\), \(D_i = 0\) control in time \(1\)
- All units under control in time \(0\)
Two outcomes observed
- \(Y_{i1}\): outcome in period \(1\)
- \(Y_{i0}\) outcome in period \(0\)
Potential outcomes + consistency assumption.

\[Y_{i1}(d) = Y_{i1} \text{ if } D_i = d\]

\[Y_{i0}(d) = Y_{i0} \text{ if } D_i = d\]

Causal Estimand

Causal estimand: Average Treatment Effect on the Treated (ATT) in time \(1\)

\[\tau_{\text{ATT}} = E[Y_{i1}(1) | D_i = 1] - E[Y_{i1}(0) | D_i = 1]\]
The first part we can get directly from the data (observed outcome among the treated group)

\[\tau_{\text{ATT}} = E[Y_{i1} | D_i = 1] - E[Y_{i1}(0) | D_i = 1]\]
Second part we don’t observe directly and need some additional assumptions.
- But now we won’t assume ignorability of treatment: \(Y_{i1}(0) \cancel{{\perp \! \! \! \perp}} D_i\)

Identifying assumptions

No anticipation
- The pre-treatment outcome is unaffected by receipt of treatment.
  
  \[Y_{i0}(1) = Y_{i0}(0) = Y_{i0}\]
Parallel trends
- In the absence of treatment, the treated group’s average outcome trend would be the same as the control group’s
  
  \[\underbrace{\left\{E[Y_{i1}(0) | D_i = 1] - E[Y_{i0}(0) | D_i = 1] \right\}}_{\text{Average counterfactual trend among treated}} = \underbrace{\left\{E[Y_{i1}(0)| D_i = 0] - E[Y_{i0}(0)| D_i = 0]\right\}}_{\text{Average trend among controls}}\]
- Equivalently
  
  \[\underbrace{\left\{E[Y_{i1}(0) | D_i = 1] - E[Y_{i1}(0)| D_i = 0]\right\}}_{\text{Selection bias at time 1}} = \underbrace{\left\{E[Y_{i0}(0) | D_i = 1] - E[Y_{i0}(0)| D_i = 0]\right\}}_{\text{Selection bias at time 0}}\]

Identification

Remember the selection bias formula for the ATT:

\[\tau_{\text{ATT}} = \underbrace{\left\{E[Y_{i1} | D_i = 1] - E[Y_{i1} | D_i = 0]\right\}}_{\text{Difference-in-means in time 1}} - \underbrace{\left\{E[Y_{i1}(0) | D_i = 1] - E[Y_{i1}(0)| D_i = 0]\right\}}_{\text{Selection bias}}\]
We can observe \(E[Y_{i1}(0)| D_i = 0]\), but can’t observe \(E[Y_{i1}(0)| D_i = 1]\)
Under parallel trends :

\[\underbrace{\left\{E[Y_{i1}(0) | D_i = 1] - E[Y_{i1}(0)| D_i = 0]\right\}}_{\text{Selection bias}} = \underbrace{\left\{E[Y_{i0}(0) | D_i = 1] - E[Y_{i0}(0)| D_i = 0]\right\}}_{\text{Selection bias at time 0}}\]
And under no anticipation:

\[\underbrace{\left\{E[Y_{i1}(0) | D_i = 1] - E[Y_{i1}(0)| D_i = 0]\right\}}_{\text{Selection bias}} = \underbrace{\left\{E[Y_{i0} | D_i = 1] - E[Y_{i0}| D_i = 0]\right\}}_{\text{Observed difference at time 0}}\]

Parallel trends

Substituting back in yields an expression for the ATT in terms of the difference in observed differences

\[\tau_{\text{ATT}} = \underbrace{\left\{E[Y_{i1} | D_i = 1] - E[Y_{i1} | D_i = 0]\right\}}_{\text{Difference-in-means in time 1}} - \underbrace{\left\{E[Y_{i0} | D_i = 1] - E[Y_{i0} | D_i = 0]\right\}}_{\text{Difference-in-means at time 0}}\]
Or equivalently

\[\tau_{\text{ATT}} = \underbrace{\left\{E[Y_{i1} - Y_{i0} | D_i = 1]\right\}}_{\text{Average change in the treated group}} - \underbrace{\left\{E[Y_{i1} - Y_{i0} | D_i = 0]\right\}}_{\text{Average change in the control group}}\]
We can estimate each of these four expectations with the sample means.

Difference-in-differences

Estimation

With repeated observations at the unit level, we can use a simple regression of the differenced outcomes on the treatment indicator

\[Y_{i1} - Y_{i0} = \alpha + \tau D_i + \epsilon_i\]
Each row in the data is a single unit with outcomes in two time periods.
- Straightforward asymptotic inference with Neyman SEs
Prefer this approach when possible
- Equivalent to a (mean) ignorability assumption w.r.t. the difference in outcomes from time 0 to time 1
- Easier to work with for conditional parallel trends (e.g. adjust for time-invariant covariates w/ AIPW) (Sant’anna and Zhao, 2020)

Two-way fixed effects

Suppose our dataset is organized where each row is a unit/time period - \(it\).
- Let \(D_{it}\) denote whether a unit is treated at time \(t\)
We can recover our 2x2 DiD estimator (in the simple two unit, two period setting) using a “two-way” fixed effects regression
- Unique parameter for each unit or treatment timing group
- Unique parameter for each time period
Commonly written as:

\[Y_{it} = \alpha_i + \delta_{t} + \tau D_{it} + \epsilon_{it}\]
Equivalently (with timing group FEs instead of unit FEs)

\[Y_{it} = \alpha + \beta D_i + \delta_{t} + \tau D_{it} + \epsilon_{it}\]

Two-way fixed effects

Expectations:
- \(E[Y_{i0} | D_i = 0] = \alpha + \delta_0\)
- \(E[Y_{i1} | D_i = 0] = \alpha + \delta_1\)
- \(E[Y_{i0} | D_i = 1] = \alpha + \beta + \delta_0\)
- \(E[Y_{i1} | D_i = 1] = \alpha + \beta + \delta_1 + \tau\)
Differences
- \(E[Y_{i1} | D_i = 1] - E[Y_{i0} | D_i = 1] = \delta_1 - \delta_0 + \tau\)
- \(E[Y_{i1} | D_i = 0] - E[Y_{i0} | D_i = 0] = \delta_1 - \delta_0\)
Difference-in-differences
- \(\{E[Y_{i1} | D_i = 1] - E[Y_{i0} | D_i = 1]\} - \{E[Y_{i1} | D_i = 0] - E[Y_{i0} | D_i = 0]\} = \tau\)

Two-way fixed effects

Does not generalize neatly to many time periods w/ variation in treatment timing.
- Need to also assume a constant, instantaneous treatment effect \(\tau\)
- We’ll spend a lot more time on this next week!
Need to account for dependence in observations between time periods w/in same unit
- “Cluster-robust” SEs or block bootstrap
- We’ll discuss this more later!

Example: Card and Krueger (1994, AER)

Does increasing the minimum wage reduce employment?
- Classical theoretical models suggest yes…
- But empirical evidence is hard to come by - no one has (yet) randomized the minimum wage.
Card and Krueger use a policy change in New Jersey relative to Pennsylvania
In 1992, NJ raised its minimum wage from 4.25 dollars per hour to 5.05 per hour
- PA stayed at 4.25 dollars per hour
Surveyed 410 fast food restaurants before and after the change was put into place
- Compared change in employment before/after in NJ with change before/after in PA.
Key assumption - Had NJ not implemented the minimum wage increase, the average trend in NJ fast food restaurant employment would have been the same as the average trend in PA fast food restaurant employment

Example: Card and Krueger (1994, AER)

# Load the data for Card and Krueger (1994)
minwage <- read_csv("assets/data/minwage.csv")
# Index of observations
minwage$unit <- 1:nrow(minwage)

# Change in full-time employment
minwage$CHG_EMPFT <- minwage$EMPFT2 - minwage$EMPFT

# Regress change on treatment (STATE = 1 for NJ)
diff <- lm_robust(CHG_EMPFT ~ STATE, data=minwage, se_type = "HC2")
tidy(diff)

         term estimate std.error statistic p.value conf.low conf.high  df
1 (Intercept)    -2.49      1.65     -1.52  0.1306   -5.728     0.743 356
2       STATE     2.93      1.73      1.69  0.0915   -0.475     6.329 356
    outcome
1 CHG_EMPFT
2 CHG_EMPFT

Example: Card and Krueger (1994, AER)

# Equivalence of TWFE in 2x2 case
minwage_long <- minwage %>% pivot_longer(cols = starts_with("EMPFT"),
                names_to = "time_str", names_prefix = "EMPFT", values_to = "EMPFT")

# Recode time variable
minwage_long$time <- NA
minwage_long$time[minwage_long$time_str == ""] <- 0
minwage_long$time[minwage_long$time_str == "2"] <- 1

# Make the treatment variable
minwage_long$treat <- as.integer(minwage_long$STATE == 1&minwage_long$time==1)

# TWFE
twfe_reg <- lm_robust(EMPFT ~ treat + as.factor(time) + as.factor(unit),
                      data=minwage_long, cluster=unit, se_type = "CR2")
tidy(twfe_reg) %>% filter(term == "treat")

   term estimate std.error statistic p.value conf.low conf.high   df outcome
1 treat     2.93      1.73      1.69  0.0938   -0.505      6.36 98.7   EMPFT

Example: Card and Krueger (1994, AER)

Conditional parallel trends

What if parallel trends holds only conditional on a set of pre-treatment covariates

\[\underbrace{\left\{E[Y_{i1}(0) - Y_{i0} | D_i = 1, X_i = x] \right\}}_{\text{Average counterfactual trend among treated}} = \underbrace{\left\{E[Y_{i1} - Y_{i0}| D_i = 0, X_i = x]\right\}}_{\text{Average trend among controls}}\]
We can include the covariates in the regression or TWFE
- But beware of unit-constant covariates in TWFE - they get soaked up by the unit fixed effects
- And be careful with time-varying covariates - we don’t want to control for consequences of treatment.

Conditional parallel trends

Can we adjust without strong assumptions on the outcome model?
- Abadie (2005) shows that an IPTW estimator can identify the ATT under conditional parallel trends
- Intuition: Treated units get a constant weight. Control units are reweighted to match the covariate distribution among the treateds.
  
  \[E[Y_{i1}(1) - Y_{i1}(0)| D_i = 1] = E\left[\frac{(Y_{i1} - Y_{i0})}{P(D_i = 1)} \times \frac{D_i - P(D_i = 1 | X_i)}{1 - P(D_i = 1 | X_i)}\right]\]
- Sant’anna and Zhao (2020) extend this to an AIPW/doubly-robust estimator with an outcome model for \(Y_{i1} - Y_{i0}\)

Example: Card and Krueger (1994, AER)

Suppose we thought that parallel trends in Card and Krueger (1994) only held conditional on the type of fast food restaurant

# 1=Burger King; 2=KFC; 3=Roy Rogers; 4=Wendy's
minwage %>% group_by(STATE) %>% summarize(mean(CHAIN == 1), mean(CHAIN==2), mean(CHAIN == 3), mean(CHAIN ==4))

# A tibble: 2 × 5
  STATE `mean(CHAIN == 1)` `mean(CHAIN == 2)` `mean(CHAIN == 3)`
  <dbl>              <dbl>              <dbl>              <dbl>
1     0              0.463              0.149              0.224
2     1              0.405              0.223              0.251
# ℹ 1 more variable: `mean(CHAIN == 4)` <dbl>

# Fit a propensity score model
weight_model <- glm(STATE ~ as.factor(CHAIN), data=minwage, family=binomial(link="logit"))

# Predict weights
minwage$e <- predict(weight_model, type="response")
minwage$did_wt <- (1/mean(minwage$STATE)) * ((minwage$STATE -minwage$e)/(1-minwage$e))

Example: Card and Krueger (1994, AER)

# Point estimate
mean(minwage$CHG_EMPFT*minwage$did_wt)

[1] 2.46

# Slight fix of the weights to make this work in OLS
minwage$did_wt_reg <- minwage$did_wt*minwage$STATE - minwage$did_wt*(1-minwage$STATE)
lm_robust(CHG_EMPFT ~ STATE, data=minwage, weight=did_wt_reg)

            Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper  DF
(Intercept)    -2.02       1.49   -1.36    0.174   -4.946    0.898 356
STATE           2.46       1.58    1.56    0.120   -0.647    5.564 356

Example: Card and Krueger (1994, AER)

# Bootstrap
set.seed(60637)
niter <- 1000
boot_est <- rep(NA, niter)
for(i in 1:niter){
  boot_minwage <- minwage[sample(1:nrow(minwage), nrow(minwage), replace=T),]
  # Fit a propensity score model
  weight_model_boot <- glm(STATE ~ as.factor(CHAIN), data=boot_minwage, family=binomial(link="logit"))

  # Predict weights
  boot_minwage$e <- predict(weight_model_boot, type="response")
  boot_minwage$did_wt <- (1/mean(boot_minwage$STATE)) * ((boot_minwage$STATE - boot_minwage$e)/(1-boot_minwage$e))

  # Point est
  boot_est[i] <- mean(boot_minwage$CHG_EMPFT*boot_minwage$did_wt)
}
#Bootstrap 95% CI
quantile(boot_est, c(.025, .975))

  2.5%  97.5% 
-0.707  5.464

Example: Card and Krueger (1994, AER)

Sant’anna and Zhao (2020) doubly-robust estimator

# This package requires using the *long* data
library(DRDID) 
# Notably this doesn't add much with the fully saturated propensity score model - we're really just stratifying by chain
dr_minwage <- drdid(yname="EMPFT",
                    tname="time",
                    idname="unit",
                    dname="STATE",
                    xformla = ~as.factor(CHAIN),
                    data=minwage_long,
                    estMethod = "trad")
summary(dr_minwage)

 Call:
drdid(yname = "EMPFT", tname = "time", idname = "unit", dname = "STATE", 
    xformla = ~as.factor(CHAIN), data = minwage_long, estMethod = "trad")
------------------------------------------------------------------
 Locally efficient DR DID estimator for the ATT:
 
   ATT     Std. Error  t value    Pr(>|t|)  [95% Conf. Interval] 
  2.4585     1.5383     1.5981      0.11     -0.5567     5.4736  
------------------------------------------------------------------
 Estimator based on panel data.
 Outcome regression est. method: OLS.
 Propensity score est. method: maximum likelihood.
 Analytical standard error.
------------------------------------------------------------------
 See Sant'Anna and Zhao (2020) for details.

Example: Card and Krueger (1994, AER)

Difference-in-differences with many time periods

Classic 2x2 DiD

Difference-in-differences estimators can be understood in terms of their component 2 \(\times\) 2 comparisons
- …between two treatment timing groups \(g\) and \(g^\prime\)
- …and between two time periods \(t\) and \(t^\prime\)
  
  \[\underbrace{\bigg[\bar{Y}_{g,t} - \bar{Y}_{g^\prime, t} \bigg]}_{\text{cross-sectional difference at time } t} - \underbrace{\bigg[\bar{Y}_{g,t^\prime} - \bar{Y}_{g^\prime, t^\prime} \bigg]}_{\text{cross-sectional difference at time } t^\prime}\]
In the classic 2x2 case, we have…
- time period \(t\) as the period where treatment is assigned to some units and \(t^\prime\) as the “pre-treatment” period.
- timing group \(g\) as the group that receives treatment (\(D_i = 1\)) and timing group \(g^\prime\) as the units that are always under control (\(D_i = 0\))

Classic 2x2 DiD

DiD with 2 periods, 2 groups

Many time periods, two timing groups

Suppose that instead of having two time periods, we now have \(T\) treatment periods.
- We’ll now characterize our treatment groups based on when they start treatment.
- Let \(G_i\) denote the time period when unit \(i\) initiates treatment
We’ll stick with the two timing group case for now
- Treated units start treatment at some time \(g^*: 1 < g^* \le T\)
- Control units start treatment at time after \(T\) (we’ll use \(\infty\))
Define potential outcomes in terms of assignment to a timing group
- \(Y_{it}(g) = Y_{it}\) for units with \(G_i = g\)
- We’ll denote \(Y_{it}(\infty)\) as the “control” potential outcome

Many time periods, two timing groups

Estimand: The group-time average treatment effect on the treated
- What would have happened on average at time \(t\) had a unit that started treatment at time \(g\) instead never started treatment.
  
  \[\text{ATT}_{g}(t) = \mathbb{E}[Y_{it}(g) - Y_{it}(\infty) | G_i = g]\]
Other target estimands are averages of the group-time ATTs
- e.g. the average of all post-treatment group-time ATTs

DiD with no staggered adoption

DiD with 4 periods, 2 groups

DiD with no staggered adoption

DiD with 4 periods, 2 groups - relative treatment time

Identifying assumptions

No anticipation: Pre-treatment outcomes are unaffected by future treatment status.

\[Y_{it}(g) = Y_{it}(\infty) \ \forall\ t < g\]
(General) Parallel trends: For any two time periods \(t\) and \(t^\prime\) and two timing groups \(g\) and \(g^\prime\)

\[E[Y_{it}(\infty) - Y_{it'}(\infty) | G_i = g] = E[Y_{it}(\infty) - Y_{it'}(\infty) | G_i = g^\prime ]\]
Generalizes our prior assumptions to all time periods and across all treatment timing groups

Two-way fixed effects estimators

In the case with two treatment timing groups and many time periods, we have two natural quantities of interest
- The \(ATT_{g^*}(t)\) for each time period \(t \ge g^*\)
- The average of all post-treatment group-time ATTs \(ATT_{g^*} = \frac{1}{T - g^* + 1}\sum_{t = g^*}^T ATT_{g^*}(t)\)
There are two common ways to estimate these effects with two-way fixed effects (TWFE) regressions
The Static TWFE

\[Y_{it} = \alpha_i + \delta_{t} + \tau D_{it} + \epsilon_{it}\]

where \(D_{it}\) is an indicator for whether unit \(i\) is under treatment at time \(t\)
In the two timing-group case we’re looking at here, \(D_{it} = \mathbf{1}(G_i = g^*) \times \mathbf{1}(t \ge g^*)\)

Two-way fixed effects estimators

The Dynamic TWFE

\[Y_{it} = \alpha_i + \delta_{t} + \sum_{l \neq -1} \tau_l D^{(l)}_{it} + \epsilon_{it}\]

where \(D_{it}^{(l)}\) is a dummy indicator for observation \(i\) being \(l\) periods from treatment initiation at time \(t\)
In the two timing-group case, this regression looks like:

\[Y_{it} = \alpha_i + \delta_{t} + \sum_{\substack{l = -(g^* - 1) \\ l \neq -1}}^{T - g^*} \tau_l \times \mathbf{1}(G_i = g^*)\times\mathbf{1}(t = g^* + l) + \epsilon_{it}\]
Note that we need to omit at least one relative treatment time indicator!
- Otherwise we have perfect collinearity with the TWFE

Identification

Do these regressions identify the target parameters we’re looking for in the two timing group case.
Static TWFE: Yes
- \(\hat{\tau}\) is an average over the 2x2 diff-in-diffs between each post-treatment and each pre-treatment period.
- Equivalent to collapsing the data into a 2x2 DiD by averaging overall pre- and post- outcomes for treated/control
- Identifies the average of post-treatment ATTs \(ATT_{g^*}\)
Dynamic TWFE: Yes!
- \(\hat{\tau_l}\) is the 2x2 DiD between relative treatment time \(l\) and the held-out “baseline” period (period \(-1\))
- Each coefficient identifies \(ATT_{g^*}(g^* + l)\)

TWFE as an average over 2x2s

Static TWFE 2x2s

TWFE as an average over 2x2s

Dynamic TWFE - relative time 1

Example: Ferwerda (2021)

Ferwerda, Jeremy. “Immigration, voting rights, and redistribution: Evidence from local governments in Europe.” The Journal of Politics 83.1 (2021): 321-339.

Does expanding voting rights in municipal elections to non-citizen foreign residents increase social expenditures?
- Setting: Swiss municipalities in the 2000s
In Switzerland, voting rights vary by cantons
- Vaud, Fribourg and Geneva implemented foreign voting rights in 2003, 2004 and 2005 respectively.
We’re going to focus on Vaud (earliest adopter) as the treated group and Geneva (latest adopter) as the control group
- Actual paper uses a different analysis than DiD/TWFE and has covariates, but we’ll do a simple DiD to illustrate

Example: Ferwerda (2021)

# Pre-processing
voting <- read_dta("assets/data/Swiss_master.dta", encoding='latin1')
voting_complete <- voting %>% filter(year >= 2000, year <= 2005) %>% filter(!is.na(log_net_welfare_head))
# Keep only units with complete panels (all 6 years)
voting_complete <- voting_complete %>% group_by(bfsnr) %>% filter(n() == 6) %>% ungroup()
voting_complete <-suppressWarnings(voting_complete %>% 
                          mutate(first_year = min(year[vf == 1]), .by=bfsnr))

Example: Ferwerda (2021)

Unique Treatment Histories in Ferwerda (2021) - Dropping all units with years missing from 2000-2006

Example: Ferwerda (2021)

Let’s fit the simple static TWFE on the Vaud and Geneva cases

## Filter down to the two treatment timing groups
voting_vaud_geneva <- voting_complete %>% filter(cantonid %in% c(22, 25))
## Estimate with fixest::feols
static_twfe <- feols(log_net_welfare_head ~ vf | bfsnr + year, data=voting_vaud_geneva, cluster="bfsnr")
etable(static_twfe)

                         static_twfe
Dependent Var.: log_net_welfare_head
                                    
vf                  -0.0927 (0.0599)
Fixed-Effects:  --------------------
bfsnr                            Yes
year                             Yes
_______________ ____________________
S.E.: Clustered            by: bfsnr
Observations                   2,028
R2                           0.95818
Within R2                    0.00791
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Verify this is equivalent to the simple 2x2 DiD

mean(voting_vaud_geneva$log_net_welfare_head[voting_vaud_geneva$year>2003&voting_vaud_geneva$cantonid == 22]) -
mean(voting_vaud_geneva$log_net_welfare_head[voting_vaud_geneva$year> 2003&voting_vaud_geneva$cantonid == 25]) -
mean(voting_vaud_geneva$log_net_welfare_head[voting_vaud_geneva$year<=2003&voting_vaud_geneva$cantonid == 22]) +
mean(voting_vaud_geneva$log_net_welfare_head[voting_vaud_geneva$year<= 2003&voting_vaud_geneva$cantonid == 25])

[1] -0.0927

Example: Ferwerda (2021)

How about the dynamic TWFE regression?
- In the non-staggered case, you can just interact an indicator for the treated unit with the time FEs.
- We’ll just use the sunab() syntax here to match next week’s lecture

dynamic_twfe <- feols(log_net_welfare_head ~ sunab(first_year, year, ref.p = -1) | bfsnr + year,
                         data = voting_vaud_geneva, cluster = "bfsnr")
etable(dynamic_twfe)

                        dynamic_twfe
Dependent Var.: log_net_welfare_head
                                    
year = -4         0.2409*** (0.0563)
year = -3          0.1709** (0.0569)
year = -2            0.0635 (0.0441)
year = 0             0.0378 (0.0337)
year = 1             0.0144 (0.0566)
Fixed-Effects:  --------------------
bfsnr                            Yes
year                             Yes
_______________ ____________________
S.E.: Clustered            by: bfsnr
Observations                   2,028
R2                           0.95920
Within R2                    0.03213
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example: Ferwerda (2021)

“Pre-trends” tests

We might be concerned that our parallel trends assumption is violated.
- We can’t test the parallel trends assumption we care about (between pre-treatment \(t\) and a post-treatment \(t^\prime\))
But if we think parallel trends holds generally across all time periods, there are observable implications
- Consider \(t\) and \(t^\prime\) that are both pre-treatment
What is the observed difference-in-differences between those two periods and our two treatment timing groups going to be equal to?

\[E[Y_{it} - Y_{it^\prime} | G_i = g^*] - E[Y_{it} - Y_{it^\prime} | G_i = \infty]\]

“Pre-trends” tests

By consistency

\[E[Y_{it}(g^*) - Y_{it^\prime}(g^*) | G_i = g^*] - E[Y_{it}(\infty) - Y_{it^\prime}(\infty) | G_i = \infty]\]
By no anticipation

\[E[Y_{it}(\infty) - Y_{it^\prime}(\infty) | G_i = g^*] - E[Y_{it}(\infty) - Y_{it^\prime}(\infty) | G_i = \infty]\]
And by the definition of parallel trends, we know this equals zero

“Pre-trends” tests

Placebo/Pre-trends test
- It is common to plot the combined treatment effect estimates and pre-treatment placebos in a single event study plot
- All point estimates denote differences-in-differences relative to the held-out baseline (typically the period just before treatment)
- Under no anticipation + parallel trends between all periods, the placebo pre-treatment DiD should be statistically indistinguishable from zero.
But be careful
- “Good” pre-trends is no guarantee that parallel trends holds!
- In low powered studies, we may fail to reject the null even when there’s a sizeable parallel trends violation.
- Conversely, sometimes we might find significant pre-trends but the sizes suggest the acutal violation is negligible.
- We’ll discuss methods for partial identification and sensitivity analysis next week.

“Event study” plots

What we want!

“Event study” plots

What is likely very concerning!

Example: Ferwerda (2021)

Parametric time trends

Suppose we believe that parallel trends is violated but we’re willing to assume that the violation takes on a particular parametric form
Parallel trends-in-trends (Mora and Reggio, 2019; Egami and Yamauchi, 2023)

\[\underbrace{\mathbb{E}[Y_{it}(\infty) - Y_{it^\prime}(\infty) | G_i = g]}_{\text{Counterfactual trend in the treated group}} - \underbrace{\mathbb{E}[Y_{it}(\infty) - Y_{it^\prime}(\infty) | G_i = g^\prime]}_{\text{Counterfactual trend in the control group}} = \Delta_g(t - t^\prime)\]
There is a violation of parallel trends but it is equal to a constant (\(\Delta_g\)) scaled by the time gap \(t - t^\prime\)

Parametric time trends

It’s straightforward to see that the difference-in-differences will not identify the ATT
- For a post-treatment \(t\) and pre-treatment \(t^\prime\)
  
  \[\mathbb{E}[\hat{\tau}_{\text{DD}}] = \underbrace{ATT_{g^*}(t)}_{\text{Treatment effect}} + \underbrace{\Delta_g(t - t^\prime)}_{\text{divergence in parallel trends}}\]
How do we solve this? Add another difference
- \(\Delta_g\) can be estimated from a difference-in-difference comparison entirely in the pre-treatment period
- Rescale this (by the time gap) and subtract from the main DiD.
Triple-differences (in time)

Group-specific linear trends

This approach is connected to the TWFE regression specification with group-specific linear trends

\[Y_{it} = \alpha_i + \delta_{t} + \beta \times \mathbf{1}(G_i = g^*) \times t + \sum_{l = 0}^{T - g^*} \tau_l D_{it}^{(l)} + \epsilon_{it}\]
With linear time trends, \(\tau_l\) correspond to averages over triple-differences-in-time
But be careful - you need to at a minimum include all post-treatment indicators even with only two timing groups.
- Otherwise we get “invalid” triple-differences (using two post-treatment periods to estimate the time trend)
- Similar to the “forbidden comparisons” problem in staggered adoption which we’ll talk about next week.
You can also still estimate pre-trends but you now have to leave in at least two pre-treatment periods to estimate the linear trend!
- At least three for a quadratic (quadruple differences), four for a cubic, etc…

Group-specific linear trends

Example: Ferwerda (2021)

Let’s estimate the linear time trend specification in feols

lin_trend <- feols(log_net_welfare_head ~ sunab(first_year, year, ref.p = c(-1:-4)) +
                     as.factor(cantonid)*year| bfsnr + year,
                         data = voting_vaud_geneva, cluster = "bfsnr")
etable(lin_trend)

                                        lin_trend
Dependent Var.:              log_net_welfare_head
                                                 
year = 0                       0.1266*** (0.0373)
year = 1                        0.1862** (0.0581)
as.factor(cantonid)25 x year   0.0830*** (0.0178)
Fixed-Effects:               --------------------
bfsnr                                         Yes
year                                          Yes
____________________________ ____________________
S.E.: Clustered                         by: bfsnr
Observations                                2,028
R2                                        0.95919
Within R2                                 0.03189
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example: Ferwerda (2021)

Do we get different results if we specify different parametric forms?

quad_trend <- feols(log_net_welfare_head ~ sunab(first_year, year, ref.p = c(-1:-4)) + 
                      as.factor(cantonid)*year + as.factor(cantonid)*I(year^2) | bfsnr + year,
                         data = voting_vaud_geneva, cluster = "bfsnr")
etable(quad_trend)

                                            quad_trend
Dependent Var.:                   log_net_welfare_head
                                                      
year = 0                               0.1183 (0.0798)
year = 1                               0.1679 (0.1528)
as.factor(cantonid)25 x year             6.725 (61.70)
as.factor(cantonid)25 x I(year^2)     -0.0017 (0.0154)
Fixed-Effects:                    --------------------
bfsnr                                              Yes
year                                               Yes
_________________________________ ____________________
S.E.: Clustered                              by: bfsnr
Observations                                     2,028
R2                                             0.95919
Within R2                                      0.03190
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example: Ferwerda (2021)

Let’s hold out some periods to use as placebos

Example: Ferwerda (2021)

What if we fit a quadratic time trend instead?

Parametric time trends

Be careful with just throwing in a linear or quadratic time trend into your TWFE regressions
- You’re still making strong assumptions about how to extrapolate the pre-treatment trend into the post-treatment period
- Different functional form choices can lead to different results (esp. with higher-order polynomials)
If you are using a linear time trend, make sure you’re using a dynamic TWFE estimator and include indicators for each post-treatment period
- Otherwise, you’re using comparisons between post-treatment periods to learn about the time trend - implicit constant effects assumption
- Treatment effect heterogeneity and a parametric time trend are observationally equivalent post-treatment.
Ideally, you’ll have some theory to motivate the form of the parallel trends violation
- We’ll talk about another triple-differences strategy that uses an unexposed cross-sectional unit next week.

Conclusion

Most differences-in-differences designs have many pre- and post-treatment periods
- Can estimate effects for many post-treatment periods - plot trajectory of the effect over time
- Can estimate effects for pre-treatment periods - placebo/pre-trends tests
When we have no staggering in treatment adoption (two timing groups), TWFE estimators are equivalent to averages over valid 2x2 differences-in-differences
- Static - Single coefficient on treatment
- Dynamic - Coefficients for every relative-treatment-time
- Don’t forget to be clear about what your baseline (held-out) period is when using the dynamic specification!

Next week

What happens when we have staggered adoption of treatment?
- TWFE estimators are biased unless we make strong constant effects assumptions - “forbidden comparisons”/“negative weighting problem”
- Static and Dynamic TWFE require different constant effects assumptions!
Alternatives: “New DiD”
- First-differences approaches: construct the DiD for each group-time ATT and aggregate (Callaway/Sant’anna estimator)
- Regression imputation: Fit TWFE to controls and impute on the treateds (Borusyak/Jaravel/Spiess estimator)
- Properly saturate the TWFE (Sun and Abraham/Wooldridge estimators)
- Many equivalencies between these!
Partial identification by bounding the parallel trends violations - Rambachan/Roth approach