Week 12: Regression Discontinuity Designs

PS 813 - Causal Inference

Anton Strezhnev

University of Wisconsin-Madison

April 13, 2026

\[ \require{cancel} \]

Last several weeks

  • Techniques for addressing unobserved confounding
  • Instrumental variables
    • Find a randomized (natural) experiment that affects your treatment and doesn’t affect the outcome any other way.
  • Difference-in-differences
    • Find an outcome where we know treatment has no effect (e.g. before treatment starts)
    • Use the observed difference on that outcome to de-bias the comparison between treated and control.
  • Synthetic control
    • Adjust for lagged Y to indirectly adjust for the latent factor
    • Bias goes to zero as number of pre-treatment periods gets large

This week

  • Another strategy for identification when there exists unobserved confounding
  • Regression Discontinuity Designs: What if the treatment is assigned via a “cut-off” rule?
    • All units below the cut-off remain under control
    • All units above the cut-off get treated
  • If there’s an observed discontinuity in the outcome, that might be evidence of a causal effect if…
    • The conditional expectations of the potential outcomes are truly continuous at the cut-off
  • Often described as “as-good-as-random” assignment near the cut-off.

Regression Discontinuity Designs

Regression Discontinuity Designs

  • Three components:
    • “Score”/“running”/“forcing” variable: \(X_i\)
    • Cut-off: \(c\)
    • Treatment is determined by \(X_i\) and \(c\)
  • \(X_i\) is predictive of the potential outcomes - it’s a confounder.
    • But all we’ll need for identification is the smoothness of the CEFs of the potential outcomes around \(c\)
  • Presence of an unexpected “jump” around \(c\) is attributed to the causal effect of treatment.
  • Examples from education:
    • Test score thresholds for allocating scholarships (Thistlewaite and Campbell, 1960)
    • Class size thresholds for splitting classes (Angrist and Lavy, 1999)
    • GPA thresholds for majors (Bleemer and Mehta, 2022)
  • Examples from political science
    • Close elections! (Lee, 2008; Broockman, 2009; Gerber and Hopkins, 2011)

Regression Discontinuity Designs

  • Setup:
    • Treatment: \(D_i \in \{0, 1\}\)
    • Potential outcomes: \(Y_i(1), Y_i(0)\)
    • Observed outcomes (Consistency): \(Y_i = Y_i(1)D_i + Y_i(0)(1- D_i)\)
    • Score/running variable \(X_i\)
    • Threshold \(c\)

Regression Discontinuity Designs

  • Sharp RD: Treatment assignment is a deterministic function of the running variable \(X_i\) and the cut-off \(c\)

\[D_i = \mathbb{1}(X_i \ge c) \text{ for all } i\]

  • Close (2 party) FPTP elections:

    • Candidates receiving above 50 percent (2-party) of the vote get elected
    • Candidates below 50 percent of the (2-party) vote do not.
  • Sharp RD: Treatment is deterministic

    \[\mathbb{P}(D_i = 1 | X_i \ge c) = 1\] \[\mathbb{P}(D_i = 1 | X_i < c) = 0\]

Regression Discontinuity Designs

  • Assumption: Continuity of the CEFs (around \(c\)):

    • We assume that \(\mathbb{E}[Y_i(0) | X_i = x]\) and \(\mathbb{E}[Y_i(1) | X_i = x]\) are continuous in \(x\)
  • Implications:

    • The CEF of \(Y_i(0)\) at \(X_i = c\) is equal to the limit of the CEF from the bottom

      \[\mathbb{E}[Y_i(0) | X_i = c] = \lim_{x \to c^{-}} \mathbb{E}[Y_i(0) | X_i = x]\]

    • All units below the discontinuity take control.

      \[\mathbb{E}[Y_i(0) | X_i = c] = \lim_{x \to c^{-}} \mathbb{E}[Y_i(0) | D_i = 0, X_i = x]\]

    • Then, by consistency

      \[\mathbb{E}[Y_i(0) | X_i = c] = \lim_{x \to c^{-}} \mathbb{E}[Y_i | X_i = x]\]

Regression Discontinuity Designs

  • The same holds for the CEF of \(Y_i(1)\) but taking the limit from the top

    \[\mathbb{E}[Y_i(1) | X_i = c] = \lim_{x \to c^{+}} \mathbb{E}[Y_i | X_i = x]\]

  • We can therefore identify the treatment effect at the threshold using the difference in the one-sided limits

    \[\begin{align*}\tau_{\text{SRD}} &= \mathbb{E}[Y_i(1) - Y_i(0) | X_i = c]\\ &= \mathbb{E}[Y_i(1) | X_i = c] - \mathbb{E}[Y_i(0) | X_i = c]\\ &= \lim_{x \to c^{+}} \mathbb{E}[Y_i | X_i = x] - \lim_{x \to c^{-}} \mathbb{E}[Y_i | X_i = x] \end{align*}\]

  • Intuition - We use the data below and above the cut-off to extrapolate to the cut-off. The difference in extrapolations is our estimate of the ATE.

    • All we need is continuity in the CEFs

Visualizing the sharp RD

Visualizing the sharp RD

Extrapolation

  • How does RD compare to other identification strategies?
  • Implicitly we have a selection-on-observables assumption: \(D_i\) is perfectly determined by \(X_i\)
    • Conditional on \(X_i\) it’s independent of the potential outcomes
  • But unlike selection-on-observables, we have no overlap/positivity
    • \(\mathbb{P}(D_i = 1 | X_i < c) = 0\)
  • RD relies on extrapolation from the observed treated/control observations to a common value of \(X_i\) - the cut-off or threshold \(c\).
    • Extrapolation can be very sensitive to model specification - works best when there are many observations near \(c\)

Interpreting the RD Estimand

  • Like IV, RD identifies a local average treatment effect
  • What if we’re not interested in the effect at the discontinuity but the effect for the sample as a whole?
    • External validity challenge - how much effect heterogeneity is there?

Violations of continuity

  • What could cause the potential outcomes to be discontinuous around \(c\)?
  • Bunching/Sorting
    • Suppose individuals knew the cut-off and could manipulate their \(X_i\) to get (or avoid) treatment
    • Another selection-into-treatment problem.
    • Can diagnose by looking at the histogram of observations around the discontinuity.
  • Other “treatments”
    • Sometimes other factors will be “assigned” by a discontinuity along with the treatment
    • Common with geographical RDs - a lot of things change across a border!

Example of “bunching”

Gelber et. al. (2021) “Misperceptions of the Social Security Earnings Test and the Actuarial Adjustment: Implications for Labor Force Participation and Earnings”

Estimation: Local Regression

Estimation challenges

  • With infinite data, we can get arbitrarily close to the true ATE at the discontinuity
    • More and more observations very close to \(c\)
  • But with actual datasets, we might have very few observations near \(c\).
    • Need to use observations that are further away and fit a model to extrapolate to the discontinuity.
  • Bias-variance trade-off:
    • Using observations that are very far from the discontinuity might increase bias (especially if our assumptions on the CEF are wrong) but reduce variance.
    • Restricting us to only “close” observations might reduce bias but increase the variance.

Binned scatterplots

  • In a regression-discontinuity design, always plot your data!
    • Raw scatterplots are hard to interpret – we want to start by trying to approximate the CEF without imposing any additional modeling assumptions
  • Binned scatterplots - Plot the average of \(Y_i\) within bins of \(X_i\)
    • Choice of binning method (equally spaced vs. quantile) and number of bins is a bias-variance trade-off
  • Do we see the conditional expectation changing smoothly before and after the cut-point? Is there a visible gap at the cutpoint?

Illustration: Incumbency Advantage

  • Our running example will be the Lee (2008) “close elections” dataset.
  • What is the size of the incumbency advantage in the U.S. House?
    • When Democrats barely win in time \(t\) does it have an effect on their vote share in \(t+1\)?
  • Variables
    • \(X_i\) - Democratic margin of victory in time \(t\)
    • \(Y_i\) - Democratic vote share in time \(t+1\)
    • \(D_i\) - Victory in time \(t\) (margin > \(0\))

Illustration: Incumbency Advantage

house = read_csv("assets/house.csv")
house$d <- as.integer(house$x >= 0)
house %>% ggplot(aes(x=x, y=y, colour=as.factor(d))) + geom_point() + geom_vline(xintercept=0, lty=2) + theme_bw() + theme(legend.position="none") +
  xlab("Democratic margin of victory at time t") + ylab("Democratic vote share at time t+1")

Illustration: Incumbency Advantage

house %>% ggplot(aes(x=x, y=y, colour=as.factor(d))) +
  stat_summary_bin(fun='mean', bins=50, size=2, geom='point') +  geom_vline(xintercept=0, lty=2) + theme_bw() + theme(legend.position="none") +
  xlab("Democratic margin of victory at time t") + ylab("Democratic vote share at time t+1")

Illustration: Incumbency Advantage

house %>% filter(abs(x) < .2) %>% ggplot(aes(x=x, y=y, colour=as.factor(d))) +
  stat_summary_bin(fun='mean', bins=20, size=2, geom='point') +  geom_vline(xintercept=0, lty=2) + theme_bw() + theme(legend.position="none") +
  xlab("Democratic margin of victory at time t") + ylab("Democratic vote share at time t+1")

Local polynomial regression

  • Goal: Estimate \(\lim_{x \to c^{+}} \mathbb{E}[Y_i | X_i]\) and \(\lim_{x \to c^{-}} \mathbb{E}[Y_i | X_i]\)
  • Fit a model on the treated and control sides (respectively) and get the prediction at the cut-point.
  • What model? A local polynomial regression
    • To reduce the approximation error from our choice of polynomial use only units with \(X_i\) close to \(c\) (within some bandwidth \(h\))
    • Use the model to capture changes in \(\mathbb{E}[Y_i(d) | X_i = x]\) even near \(c\)

Local polynomial regression

  • Choose a polynomial order \(p\) and a kernel function \(K(\cdot)\)
    • Kernel captures how we should weight observations near the discontinuity vs. far
    • Lots of options: “triangular” is common (diminishing weight further from the discontinuity) but we’ll just use a uniform kernel for this example.
  • Choose a bandwidth \(h\)
    • Observations outside of the bandwidth receive a weight of \(0\). Observations inside the bandwidth get weight \(K\left(\frac{X_i - c}{h}\right)\)
    • With a uniform kernel, all observations get the same weight if they’re inside the bandwidth

Local polynomial regression

  • Fit a regression among observations \(X_i \ge c\) of \(Y_i\) on the polynomial of \((X_i - c), (X_i - c)^2, \dotsc, (X_i - c)^p\), weighting each observation by its kernel weight.

    \[\widehat{\mathbb{E}}[Y_i | X_i \ge x] = \hat{\mu}_{+} + \hat{\mu}_{+,1}(X_i - c) + \dotsc + \hat{\mu}_{+,p}(X_i - c)^p\]

  • Fit a regression among observations \(X_i < c\) of \(Y_i\) on the polynomial of \((X_i - c), (X_i - c)^2, \dotsc, (X_i - c)^p\), weighting each observation by its kernel weight.

    \[\widehat{\mathbb{E}}[Y_i | X_i < x] = \hat{\mu}_{-} + \hat{\mu}_{-,1}(X_i - c) + \dotsc + \hat{\mu}_{-,p}(X_i - c)^p\]

  • Our RD estimate is the difference in intercepts from this regression: \(\hat{\tau}_{\text{SRD}} = \hat{\mu}_+ - \hat{\mu}_-\)

Local polynomial regression

  • We can do this all in a single regression with all polynomial terms interacted with the treatment variable
  • Let’s do a local linear fit with a bandwidth of \(.1\) and a uniform kernel
lm_robust(y ~ d + d*I(x-0), data=house %>% filter(abs(x) < .1))
            Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
(Intercept)   0.4640    0.00861 53.9084 2.96e-323   0.4471   0.4809 1204
d             0.0604    0.01264  4.7776  1.99e-06   0.0356   0.0852 1204
I(x - 0)      0.6440    0.14181  4.5416  6.14e-06   0.3658   0.9223 1204
d:I(x - 0)    0.0104    0.20969  0.0498  9.60e-01  -0.4009   0.4218 1204
  • The coefficient on \(d\) is the estimated gap at the discontinuity.

Local polynomial regression

house %>% filter(abs(x) < .2) %>% ggplot(aes(x=x, y=y, colour=as.factor(d))) +
  stat_summary_bin(fun='mean', bins=20, size=2, geom='point') + geom_smooth(method="lm_robust", formula = y ~ x, data=house %>% filter(abs(x) < .1)) +
  geom_vline(xintercept=0, lty=2) + geom_vline(xintercept=-.1, lty=3) + geom_vline(xintercept=.1, lty=3) + theme_bw() + theme(legend.position="none") +
  xlab("Democratic margin of victory at time t") + ylab("Democratic vote share at time t+1")

Local polynomial regression

house %>% filter(abs(x) < .2) %>% ggplot(aes(x=x, y=as.numeric(y>.5), colour=as.factor(d))) +
  stat_summary_bin(fun='mean', bins=20, size=2, geom='point') + geom_smooth(method="lm_robust", formula = y ~ x, data=house %>% filter(abs(x) < .1)) +
  geom_vline(xintercept=0, lty=2) + geom_vline(xintercept=-.1, lty=3) + geom_vline(xintercept=.1, lty=3) + theme_bw() + theme(legend.position="none") +
  xlab("Democratic margin of victory at time t") + ylab("Pr(Democratic victory at time t+1)")

Local polynomial regression

  • How does changing the bandwidth affect the local linear estimate?
  • Bandwidth of \(0.05\):
lm_robust(y ~ d + d*I(x-0), data=house %>% filter(abs(x) < .05))
            Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper  DF
(Intercept)   0.4691     0.0105 44.6878 5.87e-194   0.4485     0.49 606
d             0.0487     0.0160  3.0512  2.38e-03   0.0174     0.08 606
I(x - 0)      0.8990     0.3477  2.5857  9.95e-03   0.2162     1.58 606
d:I(x - 0)    0.0197     0.5086  0.0387  9.69e-01  -0.9792     1.02 606
  • Bandwidth of \(.2\):
lm_robust(y ~ d + d*I(x-0), data=house %>% filter(abs(x) < .2))
            Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper   DF
(Intercept)   0.4535    0.00625  72.577 0.00e+00   0.4413   0.4658 2260
d             0.0778    0.00922   8.440 5.59e-17   0.0597   0.0959 2260
I(x - 0)      0.4007    0.05594   7.162 1.07e-12   0.2910   0.5103 2260
d:I(x - 0)    0.0749    0.08162   0.918 3.59e-01  -0.0851   0.2350 2260

Local polynomial regression

house %>% filter(abs(x) < .2) %>% ggplot(aes(x=x, y=y, colour=as.factor(d))) +
  stat_summary_bin(fun='mean', bins=20, size=2, geom='point') + geom_smooth(method="lm_robust", formula = y ~ x, data=house %>% filter(abs(x) < .05)) +
  geom_vline(xintercept=0, lty=2) + geom_vline(xintercept=-.05, lty=3) + geom_vline(xintercept=.05, lty=3) + theme_bw() + theme(legend.position="none") +
  xlab("Democratic margin of victory at time t") + ylab("Democratic vote share at time t+1")

Local polynomial regression

house %>% filter(abs(x) < .2) %>% ggplot(aes(x=x, y=y, colour=as.factor(d))) +
  stat_summary_bin(fun='mean', bins=20, size=2, geom='point') + geom_smooth(method="lm_robust", formula = y ~ x, data=house %>% filter(abs(x) < .2)) +
  geom_vline(xintercept=0, lty=2) + geom_vline(xintercept=-.2, lty=3) + geom_vline(xintercept=.2, lty=3) + theme_bw() + theme(legend.position="none") +
  xlab("Democratic margin of victory at time t") + ylab("Democratic vote share at time t+1")

Kernel weights

Kernel weights

  • What happens if we use a triangular kernel rather than a uniform kernel (at bw = .1)
    • For the local linear regression?
house$kernelwt <- (1 - abs((house$x - 0)/.1))*as.numeric(abs(house$x) < .1)
# Unweighted
lm_robust(y ~ d + d*I(x-0), data=house %>% filter(abs(x) < .1))
            Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
(Intercept)   0.4640    0.00861 53.9084 2.96e-323   0.4471   0.4809 1204
d             0.0604    0.01264  4.7776  1.99e-06   0.0356   0.0852 1204
I(x - 0)      0.6440    0.14181  4.5416  6.14e-06   0.3658   0.9223 1204
d:I(x - 0)    0.0104    0.20969  0.0498  9.60e-01  -0.4009   0.4218 1204
# Weighted
lm_robust(y ~ d + d*I(x-0), data=house %>% filter(abs(x) < .1), weight=kernelwt)
            Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper   DF
(Intercept)   0.4631    0.00857  54.029 0.00e+00    0.446   0.4799 1204
d             0.0594    0.01294   4.590 4.90e-06    0.034   0.0848 1204
I(x - 0)      0.6169    0.16164   3.816 1.42e-04    0.300   0.9340 1204
d:I(x - 0)    0.0939    0.24947   0.376 7.07e-01   -0.396   0.5833 1204

Kernel weights

  • How about for the quadratic?
# Unweighted
lm_robust(y ~ d + d*I(x-0) + d*I((x-0)^2), data=house %>% filter(abs(x) < .1))
               Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
(Intercept)      0.4616     0.0113  40.842 2.03e-229   0.4395   0.4838 1202
d                0.0578     0.0172   3.366  7.88e-04   0.0241   0.0914 1202
I(x - 0)         0.5068     0.5422   0.935  3.50e-01  -0.5570   1.5706 1202
I((x - 0)^2)    -1.3592     5.4595  -0.249  8.03e-01 -12.0704   9.3519 1202
d:I(x - 0)       0.4427     0.8058   0.549  5.83e-01  -1.1382   2.0237 1202
d:I((x - 0)^2)  -1.5958     7.9289  -0.201  8.41e-01 -17.1518  13.9602 1202
# Weighted
lm_robust(y ~ d + d*I(x-0)+ d*I((x-0)^2), data=house %>% filter(abs(x) < .1), weight=kernelwt)
               Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
(Intercept)      0.4599     0.0103  44.654 1.67e-257   0.4397   0.4801 1202
d                0.0637     0.0161   3.963  7.83e-05   0.0321   0.0952 1202
I(x - 0)         0.3750     0.5737   0.654  5.13e-01  -0.7505   1.5005 1202
I((x - 0)^2)    -2.9874     6.4512  -0.463  6.43e-01 -15.6442   9.6695 1202
d:I(x - 0)       0.2521     0.8541   0.295  7.68e-01  -1.4236   1.9278 1202
d:I((x - 0)^2)   4.0225     9.4273   0.427  6.70e-01 -14.4733  22.5183 1202

Kernel weights

  • Quadratic: Uniform kernel
house %>% filter(abs(x) < .2) %>% ggplot(aes(x=x, y=y, colour=as.factor(d))) +
  stat_summary_bin(fun='mean', bins=20, size=2, geom='point') + geom_smooth(method="lm_robust", formula = y ~ x + I(x^2), data=house %>% filter(abs(x) < .1)) +
  geom_vline(xintercept=0, lty=2) + geom_vline(xintercept=-.1, lty=3) + geom_vline(xintercept=.1, lty=3) + theme_bw() + theme(legend.position="none") + xlab("Democratic margin of victory at time t") + ylab("Democratic vote share at time t+1")

Kernel weights

  • Quadratic: Triangular kernel
house %>% filter(abs(x) < .2) %>% ggplot(aes(x=x, y=y, colour=as.factor(d), weight=kernelwt)) +
  stat_summary_bin(fun='mean', bins=20, size=2, geom='point') + geom_smooth(method="lm_robust", formula = y ~ x + I(x^2), data=house %>% filter(abs(x) < .1)) +
  geom_vline(xintercept=0, lty=2) + geom_vline(xintercept=-.1, lty=3) + geom_vline(xintercept=.1, lty=3) + theme_bw() + theme(legend.position="none") + xlab("Democratic margin of victory at time t") + ylab("Democratic vote share at time t+1")

Bandwidth selection

  • General criteria for how to choose \(h\)
    • Large \(h\): More bias, lower variance
    • Small \(h\): Less bias, higher variance
  • If the true CEF is linear, we can get away with choosing a larger \(h\) (since the model will be correct)
    • If the true CEF is non-linear, then our linear approximation will only be “good” for a small window.
    • How bad depends on the degree of non-linearity
  • An “optimal” bandwidth minimizes the Mean Square Error \((\text{Bias}^2 + \text{Variance})\)
  • Imbens and Kalyanaraman (2012) and Calonico et. al. (2014) - derive an approximation to the MSE to find an optimal solution.
    • Depends on three main factors: density of observations around \(c\), variance of \(Y\) around \(c\) and the curvature of the CEF around \(c\).
  • Confidence intervals using will still under-cover (the optimal MSE estimator is biased).
    • Solution: Use a smaller \(h\) than optimal or use a bias-correction (Cattaneo et. al., 2014)

Discussion: Higher-order polynomials

  • Chen, Ebenstein and Greenstone (2013: PNAS) attempt to estimate the effect of air polution on life expectancy using data from China.
  • Design - Geographic RD using the Huai River - From 1950 to 1980 the Chinese government subsidized the use of coal for heating in cities North of the Huai river
    • The authors argue that this created an exogenous shock in air polution at the boundary.
    • Use a regression-discontinuity approach to estimate the effect of exposure to air polution on life expectancy.

Discussion: Higher-order polynomials

  • They estimate the effect using a global third-order polynomial of distance to the Huai river

Discussion: Higher-order polynomials

Discussion: Higher-order polynomials

house %>% filter(abs(x) < 1) %>% ggplot(aes(x=x, y=y, colour=as.factor(d))) +
  stat_summary_bin(fun='mean', bins=20, size=2, geom='point') + geom_smooth(method="lm_robust", formula = y ~ x + I(x^2) + I(x^3)) +
  geom_vline(xintercept=0, lty=2)  + theme_bw() + theme(legend.position="none") +
  xlab("Democratic margin of victory at time t") + ylab("Democratic vote share at time t+1")

“Fuzzy” RD

Fuzzy Regression Discontinuity

  • Under a “fuzzy” RD, the probability of receiving treatment is no longer a jump from 0 to 1 at the cut-point.

  • However, we will still assume that there is a jump in the probability of receiving treatment at the discontinuity

  • Assumption - Discontinuity in the propensity score at \(c\)

    \[\lim_{x \to c^{+}} \mathbb{P}(D_i = 1 | X_i = x) \neq \lim_{x \to c^{-}} \mathbb{P}(D_i = 1 | X_i = x)\]

  • Often plausible in settings where the cutpoint acts as an encouragement to take treatment

    • e.g. Only individuals w/ incomes below \(c\) are eligible to apply for a program but they are not forced into it.

Fuzzy Regression Discontinuity

Fuzzy RD is IV

  • Being above the cut-off is an instrument

    • \(D_i(1)\): The treatment taken by unit \(i\) when \(X_i \ge c\)
    • \(D_i(0)\): The treatment taken by unit \(i\) when \(X_i < c\)
  • Under our original continuity assumption, applying the Sharp RD estimator to a Fuzzy RD setting recovers the intent to treat effect

    \[\lim_{x \to c^{+}} \mathbb{E}[Y_i | X_i = x] - \lim_{x \to c^{-}} \mathbb{E}[Y_i | X_i = x] = \mathbb{E}[(D_i(1) - D_i(0))(Y_i(1) - Y_i(0)) | X_i = c] = \tau_{\text{ITT}}\]

  • ITT is driven by two elements

    • \(Y_i(1) - Y_i(0)\): The actual treatment effect
    • \(D_i(1) - D_i(0)\) The effect of being above the discontinuity on taking treatment.

Fuzzy RD is IV

  • What do we need to recover a LATE?

    • Continuity in \(\mathbb{E}[Y_i(1) | X_i = x]\) and \(\mathbb{E}[Y_i(0) | X_i = x]\) around \(c\) are akin to exogeneity + exclusion restriction for the instrument
    • “Local randomization” assumptions also give a similar intuition.
  • Need one more assumption: monotonicity: \(D_i(1) \ge D_i(0)\)

    • A “no-defiers” assumption at the discontinuity.
    • No one who refuses the treatment when above the cut-off would take the treatment if they were below the cut-off.
    • No one who takes the treatment when below the cut-off would not take it if they were above the cut-off.
  • Under continuity and monotonicity, we get a familiar ratio estimator

    \[\frac{\lim_{x \to c^{+}} \mathbb{E}[Y_i | X_i = x] - \lim_{x \to c^{-}} \mathbb{E}[Y_i | X_i = x]}{\lim_{x \to c^{+}} \mathbb{E}[D_i | X_i = x] - \lim_{x \to c^{-}} \mathbb{E}[D_i | X_i = x]} = \mathbb{E}[Y_i(1) - Y_i(0)| X_i = c, D_i(1) > D_i(0)]\]

  • The ratio of ITT and first stage RD estimates recovers the local average treatment effect at the cut-point among compliers.

Estimation

  • We can just take the ratio of the two regression coefficients from the reduced form and the first stage RDs
  • Or we can do this with a 2SLS approach. For a (local) linear model:
    • “First stage”

      \[D_i = \delta_0 + \rho \mathbb{1}(X_i \ge c) + \delta_1(X_i - c) + \delta_2(X_i - c)\mathbb{1}(X_i \ge c)\]

    • Second stage

      \[Y_i = \beta_0 + \tau \hat{D_i} + \beta_1(X_i - c) + \beta_2(X_i - c)\mathbb{1}(X_i \ge c)\]

  • Note that interactions with the cutpoint indicator still appear in the second-stage to allow for the model above and below the cutpoint to vary in coefficients.
  • All weak instrument problems still apply here! Need a strong first stage to do reliable inference.

Example: Majoring in Economics

  • Bleemer and Mehta (2022, AEJ:AE) look at the effects of majoring in economics as an undergraduate on early-career wages.
  • Design: Fuzzy RD using a GPA cut-off
    • UC Santa Cruz’s Economics department implemented a policy that only permitted students receiving below a 2.8 GPA in Econ 1 and 2 to major “at the discretion of the department”

Example: Majoring in Economics

  • Not all students below threshold did not major, not all students above the threshold majored

Example: Majoring in Economics

  • The reduced form is a sharp RD

Example: Majoring in Economics

Diagnosing assumptions

Placebo checks

  • One consequence of the continuity assumption in RD is that we should also expect any pre-treatment covariates to be continuous around the cut-point as well.
  • Common to conduct “placebo” RDs across the discontinuity on covariates known to be unaffected by treatment.
    • Presence of a non-zero discontinuity would suggest “sorting” with different types of respondents more likely to be below vs. above
  • All the usual placebo test caveats apply

Placebo checks

Density tests

  • One intuitive approach to testing for “bunching” at the cut-point is to see whether the number of observations just below vs. just above is roughly the same.
    • Consistent with the “locally randomized experiment” interpretation of the RDD (Cattaneo, Titiunik and Vasquez-Bare, 2017).
  • Straightforward binomial hypothesis test: The number of observations above vs. below the cut-off in a certain window should be about 1/2.

Density tests

  • Example: Consider a window of \(.01\) for our close elections RD
house %>% filter(abs(x) <= .01) %>% group_by(d) %>% summarize(n())
# A tibble: 2 × 2
      d `n()`
  <int> <int>
1     0    50
2     1    56

Density tests

  • We have 56 elections above and 50 below. What’s the probability that we would see this just by chance (under independent coin flips)
binom.test(x=56, n = 106)

    Exact binomial test

data:  56 and 106
number of successes = 56, number of trials = 106, p-value = 0.6
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.429 0.626
sample estimates:
probability of success 
                 0.528 

Density tests

  • McCrary (2008) argues that if units are not able to manipulate their \(X_i\), then the density of \(X_i\) around the discontinuity should be continuous.
    • The histogram shouldn’t have any huge drop-off at \(c\)
  • Intuition
    • Construct a histogram of the running variable (with bins selected to not overlap at the discontinuity)
    • Smooth the histogram by fitting a local linear regression of the histogram heights on the bin mid-points
    • Test for the difference in the smoothed histogram near the discontinuity
  • More modern approach in Cattaneo, Jansson and Ma (2020) using a local polynomial density estimator
    • Implemented in rddensity

Density tests

density_test  <- rddensity::rddensity(house$x, c=0, bino=F, massPoints=F)
summary(density_test)

Manipulation testing using local polynomial density estimation.

Number of obs =       6558
Model =               unrestricted
Kernel =              triangular
BW method =           estimated
VCE method =          jackknife

c = 0                 Left of c           Right of c          
Number of obs         2740                3818                
Eff. Number of obs    1296                1365                
Order est. (p)        2                   2                   
Order bias (q)        3                   3                   
BW est. (h)           0.235               0.244               

Method                T                   P > |T|             
Robust                1.4138              0.1574              

Density tests

rdplot <- rddensity::rdplotdensity(density_test, house$x)

Overview

  • Regression discontinuity designs leverage a known treatment assignment process based on a “score” \(X_i\) and a cut-off \(c\).
    • Key assumption is continuity in the CEFs \(\mathbb{E}[Y_i(1) | X_i]\) and \(\mathbb{E}[Y_i(0) | X_i]\) around \(c\)
  • Be wary of sorting around the discontinuity
    • Placebo testing with pre-treatment covariates
    • Density tests
  • Estimation challenges are significant in RD
    • Why? Because we’re extrapolating to a discontinuity!
    • Results can be very sensitive to arbitrary modeling choices
  • Modern approaches to RD use local regressions (w/ some bandwidth around the cut-off) with lower-order polynomials (linear or quadratic).
    • Beware cubic polynomials and above! Poor boundary properties.
    • Typically assess sensitivity across lots of different bandwidth choices
  • Always plot your RDs!

Next Two Weeks

  • Four separate topics related to causal inference that we need to cover.
  • Mediation analysis
    • How can we decompose treatment effects into “direct” and “indirect” components?
    • What assumptions do we need to identify mediation pathways?
  • Sensitivity analysis for selection-on-observables
    • How bad does the confounding have to be to break our result?
    • How can we use this to defend the robustness of our design?
  • Inference with dependent treatment assignment
    • When should you use clustered standard errors?
    • When to use adjustments for spatial autocorrelation?
  • Causal inference with interference
    • What happens when we don’t have SUTVA?
    • shift-share designs with a particular interference structure