Week 12: Regression Discontinuity Designs

PS 813 - Causal Inference

Anton Strezhnev

strezhnev@wisc.edu

University of Wisconsin-Madison

April 13, 2026

\[ \require{cancel} \]

Last several weeks

Techniques for addressing unobserved confounding
Instrumental variables
- Find a randomized (natural) experiment that affects your treatment and doesn’t affect the outcome any other way.
Difference-in-differences
- Find an outcome where we know treatment has no effect (e.g. before treatment starts)
- Use the observed difference on that outcome to de-bias the comparison between treated and control.
Synthetic control
- Adjust for lagged Y to indirectly adjust for the latent factor
- Bias goes to zero as number of pre-treatment periods gets large

This week

Another strategy for identification when there exists unobserved confounding
Regression Discontinuity Designs: What if the treatment is assigned via a “cut-off” rule?
- All units below the cut-off remain under control
- All units above the cut-off get treated
If there’s an observed discontinuity in the outcome, that might be evidence of a causal effect if…
- The conditional expectations of the potential outcomes are truly continuous at the cut-off
Often described as “as-good-as-random” assignment near the cut-off.

Regression Discontinuity Designs

Three components:
- “Score”/“running”/“forcing” variable: \(X_i\)
- Cut-off: \(c\)
- Treatment is determined by \(X_i\) and \(c\)
\(X_i\) is predictive of the potential outcomes - it’s a confounder.
- But all we’ll need for identification is the smoothness of the CEFs of the potential outcomes around \(c\)
Presence of an unexpected “jump” around \(c\) is attributed to the causal effect of treatment.
Examples from education:
- Test score thresholds for allocating scholarships (Thistlewaite and Campbell, 1960)
- Class size thresholds for splitting classes (Angrist and Lavy, 1999)
- GPA thresholds for majors (Bleemer and Mehta, 2022)
Examples from political science
- Close elections! (Lee, 2008; Broockman, 2009; Gerber and Hopkins, 2011)

Regression Discontinuity Designs

Setup:
- Treatment: \(D_i \in \{0, 1\}\)
- Potential outcomes: \(Y_i(1), Y_i(0)\)
- Observed outcomes (Consistency): \(Y_i = Y_i(1)D_i + Y_i(0)(1- D_i)\)
- Score/running variable \(X_i\)
- Threshold \(c\)

Regression Discontinuity Designs

Sharp RD: Treatment assignment is a deterministic function of the running variable \(X_i\) and the cut-off \(c\)

\[D_i = \mathbb{1}(X_i \ge c) \text{ for all } i\]

Close (2 party) FPTP elections:
- Candidates receiving above 50 percent (2-party) of the vote get elected
- Candidates below 50 percent of the (2-party) vote do not.
Sharp RD: Treatment is deterministic

\[\mathbb{P}(D_i = 1 | X_i \ge c) = 1\] \[\mathbb{P}(D_i = 1 | X_i < c) = 0\]

Regression Discontinuity Designs

Assumption: Continuity of the CEFs (around \(c\)):
- We assume that \(\mathbb{E}[Y_i(0) | X_i = x]\) and \(\mathbb{E}[Y_i(1) | X_i = x]\) are continuous in \(x\)
Implications:
- The CEF of \(Y_i(0)\) at \(X_i = c\) is equal to the limit of the CEF from the bottom
  
  \[\mathbb{E}[Y_i(0) | X_i = c] = \lim_{x \to c^{-}} \mathbb{E}[Y_i(0) | X_i = x]\]
- All units below the discontinuity take control.
  
  \[\mathbb{E}[Y_i(0) | X_i = c] = \lim_{x \to c^{-}} \mathbb{E}[Y_i(0) | D_i = 0, X_i = x]\]
- Then, by consistency
  
  \[\mathbb{E}[Y_i(0) | X_i = c] = \lim_{x \to c^{-}} \mathbb{E}[Y_i | X_i = x]\]

Regression Discontinuity Designs

The same holds for the CEF of \(Y_i(1)\) but taking the limit from the top

\[\mathbb{E}[Y_i(1) | X_i = c] = \lim_{x \to c^{+}} \mathbb{E}[Y_i | X_i = x]\]
We can therefore identify the treatment effect at the threshold using the difference in the one-sided limits

\[\begin{align*}\tau_{\text{SRD}} &= \mathbb{E}[Y_i(1) - Y_i(0) | X_i = c]\\ &= \mathbb{E}[Y_i(1) | X_i = c] - \mathbb{E}[Y_i(0) | X_i = c]\\ &= \lim_{x \to c^{+}} \mathbb{E}[Y_i | X_i = x] - \lim_{x \to c^{-}} \mathbb{E}[Y_i | X_i = x] \end{align*}\]
Intuition - We use the data below and above the cut-off to extrapolate to the cut-off. The difference in extrapolations is our estimate of the ATE.
- All we need is continuity in the CEFs

Visualizing the sharp RD

Visualizing the sharp RD

Extrapolation

How does RD compare to other identification strategies?
Implicitly we have a selection-on-observables assumption: \(D_i\) is perfectly determined by \(X_i\)
- Conditional on \(X_i\) it’s independent of the potential outcomes
But unlike selection-on-observables, we have no overlap/positivity
- \(\mathbb{P}(D_i = 1 | X_i < c) = 0\)
RD relies on extrapolation from the observed treated/control observations to a common value of \(X_i\) - the cut-off or threshold \(c\).
- Extrapolation can be very sensitive to model specification - works best when there are many observations near \(c\)

Interpreting the RD Estimand

Like IV, RD identifies a local average treatment effect
What if we’re not interested in the effect at the discontinuity but the effect for the sample as a whole?
- External validity challenge - how much effect heterogeneity is there?

Violations of continuity

What could cause the potential outcomes to be discontinuous around \(c\)?
Bunching/Sorting
- Suppose individuals knew the cut-off and could manipulate their \(X_i\) to get (or avoid) treatment
- Another selection-into-treatment problem.
- Can diagnose by looking at the histogram of observations around the discontinuity.
Other “treatments”
- Sometimes other factors will be “assigned” by a discontinuity along with the treatment
- Common with geographical RDs - a lot of things change across a border!

Example of “bunching”

Gelber et. al. (2021) “Misperceptions of the Social Security Earnings Test and the Actuarial Adjustment: Implications for Labor Force Participation and Earnings”

Estimation: Local Regression

Estimation challenges

With infinite data, we can get arbitrarily close to the true ATE at the discontinuity
- More and more observations very close to \(c\)
But with actual datasets, we might have very few observations near \(c\).
- Need to use observations that are further away and fit a model to extrapolate to the discontinuity.
Bias-variance trade-off:
- Using observations that are very far from the discontinuity might increase bias (especially if our assumptions on the CEF are wrong) but reduce variance.
- Restricting us to only “close” observations might reduce bias but increase the variance.

Binned scatterplots

In a regression-discontinuity design, always plot your data!
- Raw scatterplots are hard to interpret – we want to start by trying to approximate the CEF without imposing any additional modeling assumptions
Binned scatterplots - Plot the average of \(Y_i\) within bins of \(X_i\)
- Choice of binning method (equally spaced vs. quantile) and number of bins is a bias-variance trade-off
Do we see the conditional expectation changing smoothly before and after the cut-point? Is there a visible gap at the cutpoint?

Illustration: Incumbency Advantage

Our running example will be the Lee (2008) “close elections” dataset.
What is the size of the incumbency advantage in the U.S. House?
- When Democrats barely win in time \(t\) does it have an effect on their vote share in \(t+1\)?
Variables
- \(X_i\) - Democratic margin of victory in time \(t\)
- \(Y_i\) - Democratic vote share in time \(t+1\)
- \(D_i\) - Victory in time \(t\) (margin > \(0\))

Illustration: Incumbency Advantage

house = read_csv("assets/house.csv")
house$d <- as.integer(house$x >= 0)
house %>% ggplot(aes(x=x, y=y, colour=as.factor(d))) + geom_point() + geom_vline(xintercept=0, lty=2) + theme_bw() + theme(legend.position="none") +
  xlab("Democratic margin of victory at time t") + ylab("Democratic vote share at time t+1")

Illustration: Incumbency Advantage

house %>% ggplot(aes(x=x, y=y, colour=as.factor(d))) +
  stat_summary_bin(fun='mean', bins=50, size=2, geom='point') +  geom_vline(xintercept=0, lty=2) + theme_bw() + theme(legend.position="none") +
  xlab("Democratic margin of victory at time t") + ylab("Democratic vote share at time t+1")

Illustration: Incumbency Advantage

house %>% filter(abs(x) < .2) %>% ggplot(aes(x=x, y=y, colour=as.factor(d))) +
  stat_summary_bin(fun='mean', bins=20, size=2, geom='point') +  geom_vline(xintercept=0, lty=2) + theme_bw() + theme(legend.position="none") +
  xlab("Democratic margin of victory at time t") + ylab("Democratic vote share at time t+1")

Local polynomial regression

Goal: Estimate \(\lim_{x \to c^{+}} \mathbb{E}[Y_i | X_i]\) and \(\lim_{x \to c^{-}} \mathbb{E}[Y_i | X_i]\)
Fit a model on the treated and control sides (respectively) and get the prediction at the cut-point.
What model? A local polynomial regression
- To reduce the approximation error from our choice of polynomial use only units with \(X_i\) close to \(c\) (within some bandwidth \(h\))
- Use the model to capture changes in \(\mathbb{E}[Y_i(d) | X_i = x]\) even near \(c\)

Local polynomial regression

Choose a polynomial order \(p\) and a kernel function \(K(\cdot)\)
- Kernel captures how we should weight observations near the discontinuity vs. far
- Lots of options: “triangular” is common (diminishing weight further from the discontinuity) but we’ll just use a uniform kernel for this example.
Choose a bandwidth \(h\)
- Observations outside of the bandwidth receive a weight of \(0\). Observations inside the bandwidth get weight \(K\left(\frac{X_i - c}{h}\right)\)
- With a uniform kernel, all observations get the same weight if they’re inside the bandwidth

Local polynomial regression

Fit a regression among observations \(X_i \ge c\) of \(Y_i\) on the polynomial of \((X_i - c), (X_i - c)^2, \dotsc, (X_i - c)^p\), weighting each observation by its kernel weight.

\[\widehat{\mathbb{E}}[Y_i | X_i \ge x] = \hat{\mu}_{+} + \hat{\mu}_{+,1}(X_i - c) + \dotsc + \hat{\mu}_{+,p}(X_i - c)^p\]
Fit a regression among observations \(X_i < c\) of \(Y_i\) on the polynomial of \((X_i - c), (X_i - c)^2, \dotsc, (X_i - c)^p\), weighting each observation by its kernel weight.

\[\widehat{\mathbb{E}}[Y_i | X_i < x] = \hat{\mu}_{-} + \hat{\mu}_{-,1}(X_i - c) + \dotsc + \hat{\mu}_{-,p}(X_i - c)^p\]
Our RD estimate is the difference in intercepts from this regression: \(\hat{\tau}_{\text{SRD}} = \hat{\mu}_+ - \hat{\mu}_-\)

Local polynomial regression

We can do this all in a single regression with all polynomial terms interacted with the treatment variable
Let’s do a local linear fit with a bandwidth of \(.1\) and a uniform kernel

lm_robust(y ~ d + d*I(x-0), data=house %>% filter(abs(x) < .1))

            Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
(Intercept)   0.4640    0.00861 53.9084 2.96e-323   0.4471   0.4809 1204
d             0.0604    0.01264  4.7776  1.99e-06   0.0356   0.0852 1204
I(x - 0)      0.6440    0.14181  4.5416  6.14e-06   0.3658   0.9223 1204
d:I(x - 0)    0.0104    0.20969  0.0498  9.60e-01  -0.4009   0.4218 1204

The coefficient on \(d\) is the estimated gap at the discontinuity.

Local polynomial regression

house %>% filter(abs(x) < .2) %>% ggplot(aes(x=x, y=y, colour=as.factor(d))) +
  stat_summary_bin(fun='mean', bins=20, size=2, geom='point') + geom_smooth(method="lm_robust", formula = y ~ x, data=house %>% filter(abs(x) < .1)) +
  geom_vline(xintercept=0, lty=2) + geom_vline(xintercept=-.1, lty=3) + geom_vline(xintercept=.1, lty=3) + theme_bw() + theme(legend.position="none") +
  xlab("Democratic margin of victory at time t") + ylab("Democratic vote share at time t+1")

Local polynomial regression

house %>% filter(abs(x) < .2) %>% ggplot(aes(x=x, y=as.numeric(y>.5), colour=as.factor(d))) +
  stat_summary_bin(fun='mean', bins=20, size=2, geom='point') + geom_smooth(method="lm_robust", formula = y ~ x, data=house %>% filter(abs(x) < .1)) +
  geom_vline(xintercept=0, lty=2) + geom_vline(xintercept=-.1, lty=3) + geom_vline(xintercept=.1, lty=3) + theme_bw() + theme(legend.position="none") +
  xlab("Democratic margin of victory at time t") + ylab("Pr(Democratic victory at time t+1)")

Local polynomial regression

How does changing the bandwidth affect the local linear estimate?
Bandwidth of \(0.05\):

lm_robust(y ~ d + d*I(x-0), data=house %>% filter(abs(x) < .05))

            Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper  DF
(Intercept)   0.4691     0.0105 44.6878 5.87e-194   0.4485     0.49 606
d             0.0487     0.0160  3.0512  2.38e-03   0.0174     0.08 606
I(x - 0)      0.8990     0.3477  2.5857  9.95e-03   0.2162     1.58 606
d:I(x - 0)    0.0197     0.5086  0.0387  9.69e-01  -0.9792     1.02 606

Bandwidth of \(.2\):

lm_robust(y ~ d + d*I(x-0), data=house %>% filter(abs(x) < .2))

            Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper   DF
(Intercept)   0.4535    0.00625  72.577 0.00e+00   0.4413   0.4658 2260
d             0.0778    0.00922   8.440 5.59e-17   0.0597   0.0959 2260
I(x - 0)      0.4007    0.05594   7.162 1.07e-12   0.2910   0.5103 2260
d:I(x - 0)    0.0749    0.08162   0.918 3.59e-01  -0.0851   0.2350 2260

Local polynomial regression

house %>% filter(abs(x) < .2) %>% ggplot(aes(x=x, y=y, colour=as.factor(d))) +
  stat_summary_bin(fun='mean', bins=20, size=2, geom='point') + geom_smooth(method="lm_robust", formula = y ~ x, data=house %>% filter(abs(x) < .05)) +
  geom_vline(xintercept=0, lty=2) + geom_vline(xintercept=-.05, lty=3) + geom_vline(xintercept=.05, lty=3) + theme_bw() + theme(legend.position="none") +
  xlab("Democratic margin of victory at time t") + ylab("Democratic vote share at time t+1")

Local polynomial regression

house %>% filter(abs(x) < .2) %>% ggplot(aes(x=x, y=y, colour=as.factor(d))) +
  stat_summary_bin(fun='mean', bins=20, size=2, geom='point') + geom_smooth(method="lm_robust", formula = y ~ x, data=house %>% filter(abs(x) < .2)) +
  geom_vline(xintercept=0, lty=2) + geom_vline(xintercept=-.2, lty=3) + geom_vline(xintercept=.2, lty=3) + theme_bw() + theme(legend.position="none") +
  xlab("Democratic margin of victory at time t") + ylab("Democratic vote share at time t+1")

Kernel weights

Kernel weights

What happens if we use a triangular kernel rather than a uniform kernel (at bw = .1)
- For the local linear regression?

house$kernelwt <- (1 - abs((house$x - 0)/.1))*as.numeric(abs(house$x) < .1)
# Unweighted
lm_robust(y ~ d + d*I(x-0), data=house %>% filter(abs(x) < .1))

            Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
(Intercept)   0.4640    0.00861 53.9084 2.96e-323   0.4471   0.4809 1204
d             0.0604    0.01264  4.7776  1.99e-06   0.0356   0.0852 1204
I(x - 0)      0.6440    0.14181  4.5416  6.14e-06   0.3658   0.9223 1204
d:I(x - 0)    0.0104    0.20969  0.0498  9.60e-01  -0.4009   0.4218 1204

# Weighted
lm_robust(y ~ d + d*I(x-0), data=house %>% filter(abs(x) < .1), weight=kernelwt)

            Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper   DF
(Intercept)   0.4631    0.00857  54.029 0.00e+00    0.446   0.4799 1204
d             0.0594    0.01294   4.590 4.90e-06    0.034   0.0848 1204
I(x - 0)      0.6169    0.16164   3.816 1.42e-04    0.300   0.9340 1204
d:I(x - 0)    0.0939    0.24947   0.376 7.07e-01   -0.396   0.5833 1204

Kernel weights

How about for the quadratic?

# Unweighted
lm_robust(y ~ d + d*I(x-0) + d*I((x-0)^2), data=house %>% filter(abs(x) < .1))

               Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
(Intercept)      0.4616     0.0113  40.842 2.03e-229   0.4395   0.4838 1202
d                0.0578     0.0172   3.366  7.88e-04   0.0241   0.0914 1202
I(x - 0)         0.5068     0.5422   0.935  3.50e-01  -0.5570   1.5706 1202
I((x - 0)^2)    -1.3592     5.4595  -0.249  8.03e-01 -12.0704   9.3519 1202
d:I(x - 0)       0.4427     0.8058   0.549  5.83e-01  -1.1382   2.0237 1202
d:I((x - 0)^2)  -1.5958     7.9289  -0.201  8.41e-01 -17.1518  13.9602 1202

# Weighted
lm_robust(y ~ d + d*I(x-0)+ d*I((x-0)^2), data=house %>% filter(abs(x) < .1), weight=kernelwt)

               Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
(Intercept)      0.4599     0.0103  44.654 1.67e-257   0.4397   0.4801 1202
d                0.0637     0.0161   3.963  7.83e-05   0.0321   0.0952 1202
I(x - 0)         0.3750     0.5737   0.654  5.13e-01  -0.7505   1.5005 1202
I((x - 0)^2)    -2.9874     6.4512  -0.463  6.43e-01 -15.6442   9.6695 1202
d:I(x - 0)       0.2521     0.8541   0.295  7.68e-01  -1.4236   1.9278 1202
d:I((x - 0)^2)   4.0225     9.4273   0.427  6.70e-01 -14.4733  22.5183 1202

Kernel weights

Quadratic: Uniform kernel

house %>% filter(abs(x) < .2) %>% ggplot(aes(x=x, y=y, colour=as.factor(d))) +
  stat_summary_bin(fun='mean', bins=20, size=2, geom='point') + geom_smooth(method="lm_robust", formula = y ~ x + I(x^2), data=house %>% filter(abs(x) < .1)) +
  geom_vline(xintercept=0, lty=2) + geom_vline(xintercept=-.1, lty=3) + geom_vline(xintercept=.1, lty=3) + theme_bw() + theme(legend.position="none") + xlab("Democratic margin of victory at time t") + ylab("Democratic vote share at time t+1")

Kernel weights

Quadratic: Triangular kernel

house %>% filter(abs(x) < .2) %>% ggplot(aes(x=x, y=y, colour=as.factor(d), weight=kernelwt)) +
  stat_summary_bin(fun='mean', bins=20, size=2, geom='point') + geom_smooth(method="lm_robust", formula = y ~ x + I(x^2), data=house %>% filter(abs(x) < .1)) +
  geom_vline(xintercept=0, lty=2) + geom_vline(xintercept=-.1, lty=3) + geom_vline(xintercept=.1, lty=3) + theme_bw() + theme(legend.position="none") + xlab("Democratic margin of victory at time t") + ylab("Democratic vote share at time t+1")

Bandwidth selection

General criteria for how to choose \(h\)
- Large \(h\): More bias, lower variance
- Small \(h\): Less bias, higher variance
If the true CEF is linear, we can get away with choosing a larger \(h\) (since the model will be correct)
- If the true CEF is non-linear, then our linear approximation will only be “good” for a small window.
- How bad depends on the degree of non-linearity
An “optimal” bandwidth minimizes the Mean Square Error \((\text{Bias}^2 + \text{Variance})\)
Imbens and Kalyanaraman (2012) and Calonico et. al. (2014) - derive an approximation to the MSE to find an optimal solution.
- Depends on three main factors: density of observations around \(c\), variance of \(Y\) around \(c\) and the curvature of the CEF around \(c\).
Confidence intervals using will still under-cover (the optimal MSE estimator is biased).
- Solution: Use a smaller \(h\) than optimal or use a bias-correction (Cattaneo et. al., 2014)

Discussion: Higher-order polynomials

Chen, Ebenstein and Greenstone (2013: PNAS) attempt to estimate the effect of air polution on life expectancy using data from China.
Design - Geographic RD using the Huai River - From 1950 to 1980 the Chinese government subsidized the use of coal for heating in cities North of the Huai river
- The authors argue that this created an exogenous shock in air polution at the boundary.
- Use a regression-discontinuity approach to estimate the effect of exposure to air polution on life expectancy.

Discussion: Higher-order polynomials

They estimate the effect using a global third-order polynomial of distance to the Huai river

Discussion: Higher-order polynomials

Discussion: Higher-order polynomials

house %>% filter(abs(x) < 1) %>% ggplot(aes(x=x, y=y, colour=as.factor(d))) +
  stat_summary_bin(fun='mean', bins=20, size=2, geom='point') + geom_smooth(method="lm_robust", formula = y ~ x + I(x^2) + I(x^3)) +
  geom_vline(xintercept=0, lty=2)  + theme_bw() + theme(legend.position="none") +
  xlab("Democratic margin of victory at time t") + ylab("Democratic vote share at time t+1")

“Fuzzy” RD

Fuzzy Regression Discontinuity

Under a “fuzzy” RD, the probability of receiving treatment is no longer a jump from 0 to 1 at the cut-point.
However, we will still assume that there is a jump in the probability of receiving treatment at the discontinuity
Assumption - Discontinuity in the propensity score at \(c\)

\[\lim_{x \to c^{+}} \mathbb{P}(D_i = 1 | X_i = x) \neq \lim_{x \to c^{-}} \mathbb{P}(D_i = 1 | X_i = x)\]
Often plausible in settings where the cutpoint acts as an encouragement to take treatment
- e.g. Only individuals w/ incomes below \(c\) are eligible to apply for a program but they are not forced into it.

Fuzzy Regression Discontinuity

Fuzzy RD is IV

Being above the cut-off is an instrument
- \(D_i(1)\): The treatment taken by unit \(i\) when \(X_i \ge c\)
- \(D_i(0)\): The treatment taken by unit \(i\) when \(X_i < c\)
Under our original continuity assumption, applying the Sharp RD estimator to a Fuzzy RD setting recovers the intent to treat effect

\[\lim_{x \to c^{+}} \mathbb{E}[Y_i | X_i = x] - \lim_{x \to c^{-}} \mathbb{E}[Y_i | X_i = x] = \mathbb{E}[(D_i(1) - D_i(0))(Y_i(1) - Y_i(0)) | X_i = c] = \tau_{\text{ITT}}\]
ITT is driven by two elements
- \(Y_i(1) - Y_i(0)\): The actual treatment effect
- \(D_i(1) - D_i(0)\) The effect of being above the discontinuity on taking treatment.

Fuzzy RD is IV

What do we need to recover a LATE?
- Continuity in \(\mathbb{E}[Y_i(1) | X_i = x]\) and \(\mathbb{E}[Y_i(0) | X_i = x]\) around \(c\) are akin to exogeneity + exclusion restriction for the instrument
- “Local randomization” assumptions also give a similar intuition.
Need one more assumption: monotonicity: \(D_i(1) \ge D_i(0)\)
- A “no-defiers” assumption at the discontinuity.
- No one who refuses the treatment when above the cut-off would take the treatment if they were below the cut-off.
- No one who takes the treatment when below the cut-off would not take it if they were above the cut-off.
Under continuity and monotonicity, we get a familiar ratio estimator

\[\frac{\lim_{x \to c^{+}} \mathbb{E}[Y_i | X_i = x] - \lim_{x \to c^{-}} \mathbb{E}[Y_i | X_i = x]}{\lim_{x \to c^{+}} \mathbb{E}[D_i | X_i = x] - \lim_{x \to c^{-}} \mathbb{E}[D_i | X_i = x]} = \mathbb{E}[Y_i(1) - Y_i(0)| X_i = c, D_i(1) > D_i(0)]\]
The ratio of ITT and first stage RD estimates recovers the local average treatment effect at the cut-point among compliers.

Estimation

We can just take the ratio of the two regression coefficients from the reduced form and the first stage RDs
Or we can do this with a 2SLS approach. For a (local) linear model:
- “First stage”
  
  \[D_i = \delta_0 + \rho \mathbb{1}(X_i \ge c) + \delta_1(X_i - c) + \delta_2(X_i - c)\mathbb{1}(X_i \ge c)\]
- Second stage
  
  \[Y_i = \beta_0 + \tau \hat{D_i} + \beta_1(X_i - c) + \beta_2(X_i - c)\mathbb{1}(X_i \ge c)\]
Note that interactions with the cutpoint indicator still appear in the second-stage to allow for the model above and below the cutpoint to vary in coefficients.
All weak instrument problems still apply here! Need a strong first stage to do reliable inference.

Example: Majoring in Economics

Bleemer and Mehta (2022, AEJ:AE) look at the effects of majoring in economics as an undergraduate on early-career wages.
Design: Fuzzy RD using a GPA cut-off
- UC Santa Cruz’s Economics department implemented a policy that only permitted students receiving below a 2.8 GPA in Econ 1 and 2 to major “at the discretion of the department”

Example: Majoring in Economics

Not all students below threshold did not major, not all students above the threshold majored

Example: Majoring in Economics

The reduced form is a sharp RD

Example: Majoring in Economics

Diagnosing assumptions

Placebo checks

One consequence of the continuity assumption in RD is that we should also expect any pre-treatment covariates to be continuous around the cut-point as well.
Common to conduct “placebo” RDs across the discontinuity on covariates known to be unaffected by treatment.
- Presence of a non-zero discontinuity would suggest “sorting” with different types of respondents more likely to be below vs. above
All the usual placebo test caveats apply

Placebo checks

Density tests

One intuitive approach to testing for “bunching” at the cut-point is to see whether the number of observations just below vs. just above is roughly the same.
- Consistent with the “locally randomized experiment” interpretation of the RDD (Cattaneo, Titiunik and Vasquez-Bare, 2017).
Straightforward binomial hypothesis test: The number of observations above vs. below the cut-off in a certain window should be about 1/2.

Density tests

Example: Consider a window of \(.01\) for our close elections RD

house %>% filter(abs(x) <= .01) %>% group_by(d) %>% summarize(n())

# A tibble: 2 × 2
      d `n()`
  <int> <int>
1     0    50
2     1    56

Density tests

We have 56 elections above and 50 below. What’s the probability that we would see this just by chance (under independent coin flips)

binom.test(x=56, n = 106)


    Exact binomial test

data:  56 and 106
number of successes = 56, number of trials = 106, p-value = 0.6
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.429 0.626
sample estimates:
probability of success 
                 0.528

Density tests

McCrary (2008) argues that if units are not able to manipulate their \(X_i\), then the density of \(X_i\) around the discontinuity should be continuous.
- The histogram shouldn’t have any huge drop-off at \(c\)
Intuition
- Construct a histogram of the running variable (with bins selected to not overlap at the discontinuity)
- Smooth the histogram by fitting a local linear regression of the histogram heights on the bin mid-points
- Test for the difference in the smoothed histogram near the discontinuity
More modern approach in Cattaneo, Jansson and Ma (2020) using a local polynomial density estimator
- Implemented in rddensity

Density tests

density_test  <- rddensity::rddensity(house$x, c=0, bino=F, massPoints=F)
summary(density_test)


Manipulation testing using local polynomial density estimation.

Number of obs =       6558
Model =               unrestricted
Kernel =              triangular
BW method =           estimated
VCE method =          jackknife

c = 0                 Left of c           Right of c          
Number of obs         2740                3818                
Eff. Number of obs    1296                1365                
Order est. (p)        2                   2                   
Order bias (q)        3                   3                   
BW est. (h)           0.235               0.244               

Method                T                   P > |T|             
Robust                1.4138              0.1574

Density tests

rdplot <- rddensity::rdplotdensity(density_test, house$x)

Overview

Regression discontinuity designs leverage a known treatment assignment process based on a “score” \(X_i\) and a cut-off \(c\).
- Key assumption is continuity in the CEFs \(\mathbb{E}[Y_i(1) | X_i]\) and \(\mathbb{E}[Y_i(0) | X_i]\) around \(c\)
Be wary of sorting around the discontinuity
- Placebo testing with pre-treatment covariates
- Density tests
Estimation challenges are significant in RD
- Why? Because we’re extrapolating to a discontinuity!
- Results can be very sensitive to arbitrary modeling choices
Modern approaches to RD use local regressions (w/ some bandwidth around the cut-off) with lower-order polynomials (linear or quadratic).
- Beware cubic polynomials and above! Poor boundary properties.
- Typically assess sensitivity across lots of different bandwidth choices
Always plot your RDs!

Next Two Weeks

Four separate topics related to causal inference that we need to cover.
Mediation analysis
- How can we decompose treatment effects into “direct” and “indirect” components?
- What assumptions do we need to identify mediation pathways?
Sensitivity analysis for selection-on-observables
- How bad does the confounding have to be to break our result?
- How can we use this to defend the robustness of our design?
Inference with dependent treatment assignment
- When should you use clustered standard errors?
- When to use adjustments for spatial autocorrelation?
Causal inference with interference
- What happens when we don’t have SUTVA?
- shift-share designs with a particular interference structure