Week 10: Modern Differences-in-Differences

PS 813 - Causal Inference

Anton Strezhnev

strezhnev@wisc.edu

University of Wisconsin-Madison

March 23, 2026

\[ \require{cancel} \]

Last week

Differences-in-differences with two groups
- 2x2 comparison: treated and control groups at post-treatment time \(t\) vs. treated and control at some pre-treatment \(t^\prime\)
Two approaches to estimation
- First-differences regression
- Two-way fixed effects regression (static and “dynamic”)
Pre-trends placebos
- Differences-in-differences in two pre-treatment periods.
- Not proof that parallel trends hold but a useful diagnostic
- Be careful of power here - underpowered tests may fail to reject even under sizeable violations!

Review: Event study plots

This week

Differences-in-differences under varying treatment adoption
- Staggered adoption - Treatment is initiated by units at different time periods
- Treatment reversal - Units that adopt treatment can later switch to control
Problems
- Conventional estimation strategies (esp. TWFE) require additional assumptions - constant treatment effects
- Not feasible in many settings (e.g. persistent effects over time)
Alternative estimators
- Construct an estimator using only the “valid” 2x2 DiDs (no additional effect homogeneity assumptions)
- Re-fit the TWFE regression or construct the correct “first-differences” regression.

Difference-in-differences with staggered adoption

Running Example: Paglayan (2019, AJPS)

Paglayan, Agustina S. “Public‐sector unions and the size of government.” American Journal of Political Science 63, no. 1 (2019): 21-36.

Paglayan (2019) examines whether the implementation of mandatory collective bargaining in some states affects state expenditures on education
- Cross-sectionally, states with collective bargaining laws spend more on education
- But is this just selection?
Design: Look at the roll-out of collective bargaining laws over time in U.S. states
- Eliminate the baseline differences in spending across states using a DiD approach.

library(panelView)
union <- read_dta("assets/data/Paglayan Dataset.dta")
union <- union %>% filter(!is.na(studteachratio)&State !="DC"&State != "WI"&year>1959&year<1997)
table(union$YearCBrequired)


1965 1966 1968 1969 1970 1971 1972 1973 1974 1975 1976 1978 1984 1987 
  74   74   74  148  222  111   37   37   74   74  111   37   74   37

Staggered adoption

Collective bargaining laws were rolled out in a staggered fashion

DiD with staggered adoption

We’ll extend our previous set-up with \(T\) treated periods to also allow for varying treatment times
- Let \(G_i\) denote the time period when unit \(i\) initiates treatment
\(\mathcal{G} \in \{2, 3, \dotsc, T, \infty\}\) denotes the set of possible treatment timing groups
- We’ll use \(G_i = \infty\) to denote the never-treated units
- \(N_{g}\) observations in each timing group \(g\)
Define potential outcomes in terms of assignment to a timing group
- \(Y_{it}(g) = Y_{it}\) for units with \(G_i = g\)
- We’ll denote \(Y_{it}(\infty)\) as the “control” potential outcome

DiD with staggered adoption

DiD with 4 periods, 3 timing groups

DiD with staggered adoption

Building block: “Group-Time” ATT (Callaway and Sant’anna)

\[\text{ATT}_{g}(t) = E[Y_{it}(g) - Y_{it}(\infty) | G_i = g]\]
What would have happened to group \(g\) at time \(t\) had it never received treatment
Assumption: No reverse causality/anticipation

\[Y_{it}(g) = Y_{it}(\infty) \forall t < g\]
The potential outcomes among the “not-yet-treated” at time \(t\) are the same as the “never-treated” at time \(t\)
Assumption: “General” parallel trends. For all \(t \neq t^{\prime}\) and \(g \neq g^{\prime}\)

\[E[Y_{it}(\infty) - Y_{it^{\prime}}(\infty)| G_i = g] - E[Y_{it}(\infty) - Y_{it^{\prime}}(\infty)| G_i = g^{\prime}]\]
Can weaken this
- Only assume parallel trends with respect to the ever-treated units (\(g \neq \infty\))

DiD with staggered adoption

Let \(\bar{Y}_{g, t}\) denote the average outcome at time \(t\) for timing group \(g\)

\[\bar{Y}_{g, t} = \frac{1}{N_g} \sum_{i: G_i = g} Y_{it}\]
Any group-time ATT can be consistently estimated by the 2x2 difference-in-difference with any \(g^\prime > t\) and \(t^\prime < g\)

\[\widehat{ATT}_g(t) = \bar{Y}_{g, t} - \bar{Y}_{g^\prime, t} - \bar{Y}_{g, t^\prime} + \bar{Y}_{g^\prime, t^\prime}\]
We can average over the DiDs with all not-yet-treated units at time \(t\)

\[\widehat{ATT}_g(t) = \bar{Y}_{g, t} - \sum_{g^\prime > t} \frac{N_{g^\prime}}{N^{(t)}}\bar{Y}_{g^\prime, t} - \bar{Y}_{g, t^\prime} + \sum_{g^\prime > t} \frac{N_{g^\prime}}{N^{(t)}}\bar{Y}_{g^\prime, t^\prime}\]

where \(N^{(t)} = \sum_{g^\prime > t} N_{g^\prime}\)

Visualizing group effects

DiD for group 3, time 3

Visualizing group effects

DiD for group 3, time 4

Visualizing group effects

DiD for group 4, time 4

Visualizing group effects

Aggregating group-time ATTs

It is unlikely that we can estimate each particular group-time ATT with much precision
- And we’re probably more interested in some overall effect of treatment or an average of group-time ATTs.
How should we aggregate?
- Defining a single “post-treatment” average is tricky - multiple ways to aggregate
Averaging uniformly within unit first - non-uniform weights on time.

\[ATT = \frac{1}{N}\sum_{g = 2}^T \frac{N_{g}}{T-g -1}\sum_{t \ge g}^T ATT_{g}(t)\]
Averaging uniformly within time first - non-uniform weights on units

\[ATT = \frac{1}{T-1}\sum_{t=2}^{T} \sum_{g=2}^{t} \frac{N_g}{\sum_{g^\prime=2}^{t} N_{g^\prime}} ATT_{g}(t)\]

Aggregating group-time ATTs

“Relative-treatment-time” effects - average of all group-time ATTs where \(t = g + q\)

\[RTT(q) = \sum_{g=\max(2,\, 1-q)}^{T - \max(0,\, q)} \frac{N_g}{\sum_{g^\prime=\max(2,\, 1-q)}^{T-\max(0,\, q)} N_{g^\prime}} ATT_g(g + q) \quad \text{for } q = -(T-1), \dotsc, T-2\]
In staggered adoption, the relative-treatment time estimates for the earlier pre- and later post- periods will involve different sets of units!
- Composition changes might mask parallel trends violations!

Two-way FE w/ staggered adoption

Does the static specification identify an average of group-time ATTs?

\[Y_{it} = \alpha_i + \delta_{t} + \tau D_{it} + \epsilon_{it}\]
No! (Goodman-Bacon, 2021)
- \(\hat{\tau}\) is a weighted average over 2x2 differences-in-differences.
- Only some of those are valid under our identification assumptions
- Others require an additional constant effects assumption in order to identify an ATT.
Intuition:
- The TWFE estimator under staggered adoption incorporates 2x2 DiD terms where the “baseline” period is in the future (where both units are under treatment)

Two-way FE w/ staggered adoption

Good 2x2

Two-way FE w/ staggered adoption

Bad 2x2

Two-way FE w/ staggered adoption

Consider the “forbidden comparison” with
- \(t^\prime > t\), (baseline period is in the “future”)
- \(g \le t < t^\prime\) (timing group of interest is treated prior to both periods).
- \(t < g^\prime \le t^\prime\), (comparison group is treated at the baseline period)
The 2x2 difference-in-difference is:

\[\bar{Y}_{g, t} - \bar{Y}_{g^\prime, t} - \bar{Y}_{g, t^\prime} + \bar{Y}_{g^\prime, t^\prime}\]
Under parallel trends, this identifies

\[ATT_{g}(t) - ATT_{g^\prime}(t) - ATT_{g}(t^\prime) + ATT_{g^\prime}(t^\prime)\]
Under no anticipation, only \(ATT_{g^\prime}(t)\) is zero.

Two-way FE w/ staggered adoption

We need to make an effect homogeneity assumption for this 2x2 to identify a single treatment effect.

\[ATT_{g}(t) - ATT_{g}(t^\prime) + ATT_{g^\prime}(t^\prime)\]
The first is to assume homogeneity within unit over time
- \(ATT_{g}(t) = ATT_{g}(t^\prime)\)
The second is to assume homogeneity in calendar time
- \(ATT_{g}(t^\prime) = ATT_{g^\prime}(t^\prime)\)

Two-way FE w/ staggered adoption

What about the dynamic specification?

\[Y_{it} = \alpha_i + \delta_{t} + \sum_{l \neq -1} \tau_l D^{(l)}_{it} + \epsilon_{it}\]
Also no! (Sun and Abraham, 2021)
- Both \(\tau_l\) incorporate ATTs from other relative-times \(\neq l\) (“contamination bias”)
- The dynamic specification is only valid if we believe each relative-time treatment effect is constant across timing groups.
Li and Strezhnev (2024) develop some more of the intuition
- Each \(\hat{\tau_l}\) incorporates 2x2 DiDs with units that are also already treated but at different times
- The bias due to the “contaminating” treatment effects gets differenced out by estimates of those relative time effects with different units

Dynamic TWFE

Dynamic specification - good 2x2s

Dynamic TWFE

Dynamic specification - contaminated 2x2s

Dynamic TWFE

Dynamic specification - (some) de-contamination

Example: Paglayan (2019, AJPS)

Let’s fit the dynamic specification on the Paglyan (2019) dataset.

# Keep all
union_rep2 <- union
union_rep2 <- union_rep2 %>% mutate(yearFromCB = year - YearCBrequired)

# Make the never-treateds "infinity" (ensure that their dummy will be dropped as well)
union_rep2$yearFromCB[is.na(union_rep2$YearCBrequired)] <- Inf

# Make the dummy variables using the factor syntax - make -1 the reference period
union_rep2$yearFromCBFactor <- relevel(as.factor(union_rep2$yearFromCB), ref="-1")

Example: Paglayan (2019, AJPS)

Sun and Abraham (2021) Estimator

One solution to the problem of contamination bias is to estimate the relative treatment time effects separately for each unique treatment timing group
- Control group is the never treated
Easy to implement in a single TWFE regression with interactions between cohort-indicators and relative-treatment time indicators.

\[Y_{it} = \alpha_i + \delta_{t} + \sum_{g= 2}^T\sum_{l \neq -1} \tau_l^{(g)} (D^{(l)}_{it} \times \mathbf{1}(G_i = g)) + \epsilon_{it}\]
Aggregate into average effects for each relative-treatment time by averaging over the cohorts with that particular relative treatment time.
Implemented in the fixest R package using the sunab() syntax

Example: Paglayan (2019, AJPS)

Implementing the Sun and Abraham (2021) estimator

### Per-pupil expenditure
# Code never-treated (NA) as Inf so sunab treats them as the control cohort
union_rep2_sa <- union_rep2
union_rep2_sa$YearCBrequired[is.na(union_rep2_sa$YearCBrequired)] <- Inf
dyn_reg_rep2_sa <- feols(lnppexpend ~ sunab(YearCBrequired, year, ref.p=-1) | year + State,
                            data=union_rep2_sa, cluster="State")

Example: Paglayan (2019, AJPS)

Callaway and Sant’anna (2021) Estimator

Rather than correcting the TWFE regression, directly estimate each group-time ATT via a simple 2x2 DiD
- Compare each treated cohort \(g\) at time \(t\) to a not-yet-treated (or never-treated) control group
- Use the period just before treatment (\(g - 1\)) as the baseline
  
  \[\widehat{ATT}_g(t) = \bar{Y}_{g, t} - \bar{Y}_{g, g-1} - \left(\sum_{g^\prime > t} \frac{N_{g^\prime}}{N^{(t)}}\bar{Y}_{g^\prime, t} - \sum_{g^\prime > t} \frac{N_{g^\prime}}{N^{(t)}}\bar{Y}_{g^\prime, g-1}\right)\]
- where \(N^{(t)} = \sum_{g^\prime > t} N_{g^\prime}\) is the total number of not-yet-treated units at time \(t\)
Aggregate group-time ATTs into summary measures (overall ATT, calendar-time ATTs, relative-time ATTs, etc…)
- Can add IPW + outcome model to allow for conditional parallel trends to hold
- Easy to adjust for both time-varying and time-invariant confounders of parallel trends (Caetano and Callaway, 2026)
Implemented in the did R package

Example: Paglayan (2019, AJPS)

Implementing the CS estimator in the did R package

library(did)

dyn_reg_cs_1 <- att_gt(yname = "lnppexpend", tname="year", idname="Stateid",
                       gname="YearCBrequired", xformla = ~1,
                       base_period = "universal", control_group = "notyettreated",
                       cband = F, data=union_rep2)

dyn_reg_cs_1_agg <- aggte(dyn_reg_cs_1, type="dynamic")

Example: Paglayan (2019, AJPS)

What happens if we implement CS with only the never-treated as the controls?
- Equivalent to Sun and Abraham w/ reference group -1!

library(did)
union_rep2$YearCBrequired[is.na(union_rep2$YearCBrequired)] <- 0
dyn_reg_cs_1 <- att_gt(yname = "lnppexpend", tname="year", idname="Stateid",
                       gname="YearCBrequired", xformla = ~1,
                       base_period = "universal", control_group = "nevertreated",
                       cband = F,
                        allow_unbalanced_panel = T, data=union_rep2)

dyn_reg_cs_1_agg <- aggte(dyn_reg_cs_1, type="dynamic", na.rm = T)

Example: Paglayan (2019, AJPS)

Regression Imputation Estimator

Alternative approach: impute the missing counterfactual \(Y_{it}(\infty)\) for treated observations
- Proposed by Borusyak, Jaravel, and Spiess (2024); Gardner (2022); Liu, Wang, and Xu (2024)
Step 1: Fit unit and time fixed effects using only the untreated observations (control units + pre-treatment periods of treated units)

\[Y_{it} = \alpha_i + \delta_t + \epsilon_{it}\]
Step 2: For each treated unit predict

\[\widehat{Y_{it}}(\infty) = \hat{\alpha_i} + \hat{\delta_t}\]
Step 3: For each treated unit and time period, impute the treatment effect

\[\hat{\tau}_{it} = Y_{it} - \widehat{Y_{it}}(\infty)\]
Step 4: Aggregate the \(\hat{\tau}_{it}\) into relevant quantities (e.g. relative time effects, overall average, etc…)
Implemented in fect (among other packages)

Regression Imputation Estimator

The regression imputation estimator for a given group-time ATT \[ATT_{g}(t)\] can be expressed as an average over 2x2 differences-in-differences
- Using all the pre-treatment periods as baselines (\(t^\prime < g\))
- And using all units that adopt treatment after \(g\) (\(g^\prime > g\))
This creates a kind of sequential structure to the imputations for each post-treatment period
- \(ATT_{g}(g)\) is a simple average over DiDs…
- …but \(ATT_{g}(g+1)\) incorporates the estimates from \(ATT_{g+1}(g+1)\)…
- …and so on…
Units under treatment are used as “controls” with the model-imputed counterfactual acting as \(Y_{it}\) for that period.

Regression Imputation Estimator

Regression Imputation Group 3, Time 3

Regression Imputation Estimator

Regression Imputation Group 4, Time 4

Regression Imputation Estimator

Regression Imputation Group 3, Time 4

Example: Paglayan (2019, AJPS)

Implementing the regression imputation estimator using fect

library(fect)
union_fect <- union
union_fect$CBrequired_SY <- as.numeric(union_fect$CBrequired_SY)
fect_fit <- fect(lnppexpend ~ CBrequired_SY, data = union_fect,
                 index = c("State", "year"), method = "fe",
                 force = "two-way", se = TRUE)

Example: Paglayan (2019, AJPS)

New DiD is just old DiD

All of the “heterogeneity-robust” difference-in-differences methods work in basically the same way
- Estimate the group-time ATTs using only the 2x2 comparisons that are valid w/o additional homogeneity assumptions
There are basically two classes of estimators
- Either “fix the TWFE regression” (Sun and Abraham, regression imputation, ‘extended’ TWFE)
- Or construct the “first-differences” regression (Callaway/Sant’anna)
Implementations of these methods will differ on other options, but these are researcher choices
- Which units are used as “controls”
- Which periods act as the baseline
- How to estimate standard errors (some form of asymptotic cluster SEs or bootstrap)

Interpreting event study plots w/ “new” DiD estimators

One downside of abandoning the traditional “event study plot” is that there are no clear convention for how to construct the plot of the treatment effects and the pre-treatment placebos usinig the new DiD estimators.
- Common software implementations don’t generate figures that have the same interpretation as the dynamic TWFE (Roth, 2026)
Sun and Abraham (2021) is the most direct equivalent to the original event study plot
- All treatment effects and placebos are estimated from the same held out common baseline.
- But remember that the composition of each relative treatment time varies
  - Late adopters don’t contribute to many of the treatment effects, early adopters don’t contribute to many of the placebos.

Callaway/Sant’anna - Perils of the “varying” baseline

One option in the Callaway-Sant’anna package is to compute placebos using a varying baseline.
- This generates event study plots that are asymmetric
- Post-treatment estimates all use \(-1\) as the baseline time period
- But the pre-treatment estimates are always relative to adjacent periods
  - “Short” DiDs only, not “long”
Visually there is a kink even in the absence of any treatment effect!
- Always us the universal baseline option!

dyn_reg_cs_1 <- att_gt(yname = "lnppexpend", tname="year", idname="Stateid",
                       gname="YearCBrequired", xformla = ~1,
                       base_period = "varying", control_group = "notyettreated",
                       cband = F, data=union_rep2)

dyn_reg_cs_1_agg <- aggte(dyn_reg_cs_1, type="dynamic")

Example of varying baseline (Paglayan, 2019)

Regression Imputation

Regression imputation estimators exhibit an even more troubling problem
- How do you construct a placebo test if all of the control observations are being used to estimate the treatment effects?
Trilemma - You have to give up one of…
1. Not using the same observations twice (imputing for units in the imputation regression)
2. Using the same baselines as the treatment effect estimates
3. Imputing for all of the pre-treatment periods
What you give up depends on how you plan to use the pre-treatment placebo estimates.
- But…you should definitely avoid imputing from the same regression as you used for the post-treatment observations (the default in fect)

Example with the `fect` defaults

Regression Imputation

Li and Strezhnev (2026) show that the in-sample imputation approach suffers from two biases if you care about the magnitudes of the pre-treatment coefficients
- Attenuation bias - Some of the component DiD comparisons are zero by construction since they re-use the same unit or period twice
- Contamination bias - Under staggered adoption, placebo estimates for periods further away from treatment incorporate estimates for periods closer to treatment
Possible solution
- “Double-leave-one-out” approach - Estimate the placebo effects using separate models for each treatment timing group and time period.
- Leave out all units that adopt treatment prior to the cohort of interest and the time period being imputed-for.

In-sample imputation bias

We can write the regression imputation estimator for a pre-treatment period as

\[\begin{align*} \widehat{ATT}_{g}(t) &= \frac{1}{(g-1)N^{(t)}} \bigg[\sum_{g^\prime \ge g} \sum_{t^\prime = 1}^{g-1} N_{g^\prime}(\bar{Y}_{g,t} - \bar{Y}_{g,t^\prime} - \bar{Y}_{g^\prime, t} + \bar{Y}_{g^\prime, t^\prime}) + \\ &\hskip4em \sum_{g^\prime = t+1}^{g-1} \sum_{t^\prime = 1}^{g^\prime-1} N_{g^\prime}(\bar{Y}_{g,t} - \bar{Y}_{g,t^\prime} - \bar{Y}_{g^\prime, t} + \bar{Y}_{g^\prime, t^\prime}) + \\ &\hskip4em \sum_{g^\prime = t+1}^{g-1} \sum_{t^\prime = g^\prime}^{g-1} N_{g^\prime}(\bar{Y}_{g,t} - \bar{Y}_{g,t^\prime} - \bar{Y}_{g^\prime, t} + \hat{Y}_{g^\prime, t^\prime})\bigg] \end{align*}\]

where \(N^{(t)} = \sum_{g^\prime = t + 1}^G N_{g^\prime}\)

In-sample imputation bias

2x2 comparisons in in-sample TWFE imputation of placebo for group 3, time 1

In-sample imputation: Non-staggered case

Consider the simplest case: all treated units start treatment at time \(g^*\). The expression for the in-sample imputation estimator simplifies to:

\[\widehat{\text{ATT}_{is}(t)} = \frac{1}{(g^*-1)N}\bigg\{\sum_{t^\prime=1}^{g^*-1} N_{g^*}(\bar{Y}_{g^*, t} - \bar{Y}_{g^*,t} - \bar{Y}_{g^*, t^\prime} + \bar{Y}_{g^*, t^\prime}) + N_{\infty} (\bar{Y}_{g^*, t} - \bar{Y}_{\infty,t} - \bar{Y}_{g^*, t^\prime} + \bar{Y}_{\infty, t^\prime})\bigg\}\]
It’s straightforward to see that the first part of that sum is always zero (since we are comparing \(g^*\) to itself)
- Additionally, any term where \(t^\prime = t\) is also zero.
Combined, the in-sample imputation estimator (compared to the “leave-one-out” estimator) exhibits an attenuation towards zero

\[\widehat{\text{ATT}}_{\text{is}}(t) = \bigg(\frac{N_{G}}{N}\times\frac{g^*-2}{g^*-1}\bigg)\times\bigg(\widehat{\text{ATT}}_{\text{loo}}(t)\bigg)\]

Application: Gazmararian (2025)

Gazmararian, Alexander F. “Sources of partisan change: Evidence from the shale gas shock in American coal country.” The Journal of Politics 87, no. 2 (2025): 601-615.

Examines whether the 2008 shale gas shock (fracking boom) shifted Republican presidential vote share in U.S. coal counties
- Gas substitutes for coal, accelerating coal’s decline
- Voters dependent on coal shift toward Republicans who promised looser environmental regulation
- Treatment: coal county (\(\geq\) 1% coal employment pre-2008) \(\times\) post-2008
- County-level analysis, presidential elections 1976-2020
Original analysis uses fect with in-sample imputation (the default)
- Pre-trends are statistically distinguishable from zero
- But appear to be small - an equivalence testing approach concludes that the size of the violation is “negligible”

Equivalence testing

Conventional null hypothesis tests assume a null of no difference and reject in favor of an alternative that a difference exists

\[H_0: \tau = 0; H_a: \tau \neq 0\]
But with placebo tests, this is kind of begging the question
- You are assuming what you set out to demonstrate (that a parallel trends violation is negligible)
- Under low power settings, you might fail to reject under large violations!

Equivalence testing

Equivalence testing inverts this - assume a null that the true difference is greater than some “equivalence region” and an alternative that

\[H_0: |\tau| \ge \epsilon; H_a: |\tau| < \epsilon\]
Researcher selects the \(\epsilon\) - what size of an effect is considered “negligible”
- Existing recommendations mostly come from the literature on balance checking (Hartman and Hidalgo, 2018).
- For assessing pre-trends, most intuitive to just benchmark against observed effects.
Easy implementation - Two One-Sided Test (TOST) approach
- Reject (in favor of a negligible difference) if the 90% confidence interval is entirely within the “equivalence region”

Replication: In-sample vs. leave-one-out

Replication: Fixed baseline estimates

HonestDiD: Partial identification

An alternative approach to incorporating the actual pre-treatment placebos is to use them to bound the potential violation of parallel trends in the post-treatment period.
- Rambachan and Roth (2023) propose a partial identification approach based on such a user specified bound
- Develop an approach for construct confidence sets under this violation
In our conventional event-study regression, we get estimates of…
- \(\beta_{l}^{\text{post}} = \tau_{l}^{\text{post}} + \delta_l^{\text{post}}\) - the treatment effects - a combination of the “true” effect and the parallel trends violation
- \(\beta_{l}^{\text{pre}} = \delta_l^{\text{pre}}\) - the pre-treatment “placebos” - these are only capturing the parallel trends violation
We don’t want to assume \(\delta = 0\).
- Instead, we’ll relax that assumption by assuming \(\delta \in \Delta\), some user-specified set of restrictions

HonestDiD: Partial identification

One popular approach recommended in Rambachan and Roth (2023) is to bound the relative magnitudes

\[\Delta^{\text{RM}}(\bar{M}) = \bigg\{\delta: \forall l \ge 0, |\delta_{l+1} - \delta_{l}| \le \bar{M} \times \max_{s < 0} |\delta_{s - 1} - \delta_{s}|\bigg\}\]
The per-period change in the violation of parallel trends in the post-treatment period is no more than \(\bar{M}\) times the largest observed per-period violation in the pre-treatment period
- \(\bar{M} = 0\): Our parallel trends assumption holds
- \(\bar{M} = 1\): The violation in the post-treatment period is as bad as it is in the pre-treatment period.
Construct a hypothesis test under the system of moment inequalities implied by the bound
- Invert to obtain a confidence set
- Sensitivity analysis - vary \(\bar{M}\) and see at what level the results “break”

HonestDiD: TWFE event study (Gazmararian, 2025)

Summary

Diff-in-diff with two periods, two treatment groups
- Parallel trends assumption - sensitive to transformations of the outcome (e.g. logs vs. levels)
- Straightforward non-parametric estimator - can weight + use regression if parallel trends holds conditionally
Diff-in-diff with multiple periods, two treatment groups
- No changes! Everything is fine - TWFE still gives you a difference-in-difference under no effect homogeneity assumptions
- Use pre-trends tests as a diagnostic, but be careful (w/ small samples, low power to violations)
Diff-in-diff with staggered adoption
- Can break down into a bunch of 2x2 DiDs for different treatment groups and time periods
- But beware of naive 2-way fixed effects, both static and dynamic specifications - “bad” 2x2s if effects aren’t constant

Next week

Grab-bag of other topics on “time-series causal inference”
- Diff-in-diff is not the only design when you have time
- In fact, it’s arguably not even a “time-series” method
When outcome in the past affects future treatment, DiD assumptions are violated!
- We need a different design - back to selection-on-observables conditional on lagged outcome and treatment
- Identification under “sequential ignorability” - lagged DV/IV regressions.
“Synthetic control” designs - reweight controls to match treated units on lagged outcomes.
Additional bit of discussion on what “fixed effects” accomplish and how to think about them
- Fixed effects are a form of de-meaning - equivalent to controlling for averages of the outcome (the “Mundlak” device)

Week 10: Modern Differences-in-Differences

Last week

Review: Event study plots

This week

Difference-in-differences with staggered adoption

Running Example: Paglayan (2019, AJPS)

Staggered adoption

DiD with staggered adoption

DiD with staggered adoption

DiD with staggered adoption

DiD with staggered adoption

Visualizing group effects

Visualizing group effects

Visualizing group effects

Visualizing group effects

Aggregating group-time ATTs

Aggregating group-time ATTs

Two-way FE w/ staggered adoption

Two-way FE w/ staggered adoption

Two-way FE w/ staggered adoption

Two-way FE w/ staggered adoption

Two-way FE w/ staggered adoption

Two-way FE w/ staggered adoption

Dynamic TWFE

Dynamic TWFE

Dynamic TWFE

Example: Paglayan (2019, AJPS)

Example: Paglayan (2019, AJPS)

Sun and Abraham (2021) Estimator

Example: Paglayan (2019, AJPS)

Example: Paglayan (2019, AJPS)

Callaway and Sant’anna (2021) Estimator

Example: Paglayan (2019, AJPS)

Example: Paglayan (2019, AJPS)

Example: Paglayan (2019, AJPS)

Example: Paglayan (2019, AJPS)

Regression Imputation Estimator

Regression Imputation Estimator

Regression Imputation Estimator

Regression Imputation Estimator

Regression Imputation Estimator

Example: Paglayan (2019, AJPS)

Example: Paglayan (2019, AJPS)

New DiD is just old DiD

Interpreting event study plots w/ “new” DiD estimators

Callaway/Sant’anna - Perils of the “varying” baseline

Example of varying baseline (Paglayan, 2019)

Regression Imputation

Example with the fect defaults

Regression Imputation

In-sample imputation bias

In-sample imputation bias

In-sample imputation: Non-staggered case

Application: Gazmararian (2025)

Equivalence testing

Equivalence testing

Replication: In-sample vs. leave-one-out

Replication: Fixed baseline estimates

HonestDiD: Partial identification

HonestDiD: Partial identification

HonestDiD: TWFE event study (Gazmararian, 2025)

Summary

Next week

Example with the `fect` defaults