library(tidyverse)
library(here)
library(janitor)
library(jtools)
library(gtsummary)
library(gt)
library(MatchIt) # matching
library(cobalt) # balance + love plots
library(fixest) # fast fixed effects
library(scales) # plottingAssignment instructions
Working with classmates to troubleshoot code and concepts is encouraged. If you collaborate, list collaborators at the top of your submission.
All written responses must be written independently (in your own words).
Keep your work readable: Use clear headings and label plot elements thoughtfully.
Assignment submission (YOUR NAME): ______________________________________
Introduction
In this assignment you will be doing political weather forecasting except the “storms” we care about are electoral swings that might follow local wind turbine development.
In Stokes (2015), the idea is that a policy with diffuse benefits (cleaner electricity) can create concentrated local costs (turbines nearby), and those local opponents may “send a signal” at the ballot box (i.e., NIMBYISM). Your job is to use two statistical tools:
- Matching: Can we create a more apples-to-apples comparison between precincts that did vs. did not end up near turbine proposals?
- Fixed effects + Difference-in-Differences: Can we use repeated elections to estimate how within-precinct changes in turbine exposure relate to changes in incumbent vote share?
Learning goal: Replicate the matching and fixed effects analyses from study:
Stokes (2015): “Electoral Backlash against Climate Policy: A Natural Experiment on Retrospective Voting and Local Resistance to Public Policy.
- Study: Stokes (2015) - Article
- Data source: Dataverse-Stokes2015
Setup: Load libraries
- Load libraries (+ install if needed)
Part 1: Study Background
1A. Dive into the details of the study design and evaluation plan
Goal: Get familiar with the study setting, environmental issue, and policy under evaluation.
1A.Q1 Summarize the environmental policy issue, the outcome of interest, and the intervention being evaluated. Be sure to include a brief description of each of the following key elements of the study: unit of analysis, outcome, treatment, comparison group):
Response: _________________________
1A.Q2 Why might turbine proposals be correlated with baseline political preferences or rural areas? Provide 2 plausible mechanisms, and explain why that creates confounding.
Response: _________________________
1B. Break down the causal inference strategy and identify threats to identification:
1B.Q1 What is the key identifying assumption for a fixed effects / Difference-in-Difference design? Explain how this assumption when satisfied provides evidence of causal effect:
Response: _________________________
1B.Q2 What is the reason for using a fixed effects approach from a causal inference perspective? Summarize within the context of study (in your own words).
Response: _________________________
1B.Q3 What part of the SUTVA assumption is most likely violated in the context of this study design (and why)?
Response: _________________________
1B.Q4 Why does spillover matter when estimating an unbiased treatment effect?
Response: _________________________
1B.Q5 How do the authors assess the risk of spillovers, and what analytic choice do they make to attempt to mitigate the risk that spillover biases the causal estimate?
Response: _________________________
Part 2: Matching
We will start by evaluating the 2007 survey (cross-sectional) data. Treatment is defined by whether a precinct is near a turbine proposal (within 3 km).
Goal: Match precincts using pre-treatment covariates and then estimate the effect of proposed wind turbines on incumbent vote share.
2A. Load data for matching
- Read in data file
stokes15_survey2007.csv - Code
precinct_idanddistrict_idas factors - Take a look at the data
match_data <- 2A.Q1 Intuition check: Why match? Explain rationale for using this method.
Response: _________________________
2B. Check imbalance (before matching)
- Create a covariate balance table comparing treated and control precincts
- Treatment indicator:
proposed_turbine_3km - Include pre-treatment covariates:
log_home_val_07,p_uni_degree,log_median_inc,log_pop_denc - Use the
tbl_summary()function from the{gtsummary}package.
match_data %>%2B.Q1 Summarize the table output: Which covariates look balanced/imbalanced?
Response: _________________________
2B.Q2 Describe in your own words why these covariates might be expected to confound the treatment estimate:
Response (2-4 sentences): _________________________
2B.Q3 Intuition check: What type of data do you need to conduct a matching analysis?
Response: _________________________
Conduct matching estimation using the {MatchIt} package:
Learning goals:
- Approximate the Mahalanobis matching method used in Stokes (2015)
- Implement another common matching approach called
propensity score matching
2C. Mahalanobis nearest-neighbor matching
- Conduct Mahalanobis matching
- Use nearest-neighbor match without replacement using Mahalanobis distance
- Use 1-to-1 matching (match one control unit to each treatment unit)
- Extract the matched data using
match.data()
set.seed(2412026)
match_model <- matchit(
# Treatment_indicator ~ Pre_treatment_covariates
data = match_data,
method = "nearest", # Nearest neighbor matching
distance = "mahalanobis", # Mahalanobis distance
ratio = 1, # Match one control unit to one treatment unit (1:1 matching)
replace = FALSE # Control observations are not replaced
)
# Extract matched data
matched_data <- match.data(match_model)summary(match_model)2C.Q1 Using the summary() output: Which covariate had the largest and smallest Std. Mean Diff. before matching. Next, compare largest/smallest Std. Mean Diff. after matching.
Response: _________________________
2D. Create a “love plot” using love.plot() ❤️
- Plot mean differences for data before & after matching across all pre-treatment covariates
- This is an effective way to evaluate how effective matching was at achieving balance.
- Make a love plot of standardized mean differences (SMDs) before vs after matching.
- Include a threshold line at 0.1.
- In love plot display
mean.diffs
new_names <- data.frame(
old = c("log_home_val_07", "p_uni_degree", "log_median_inc", "log_pop_denc"),
new = c("Home Value (log)", "Percent University Degree",
"Median Income (log)", "Population Density (log)"))
# Love plot
love.plot(match_model, stats = "mean.diffs",
thresholds = c(m = 0.1),
var.names = new_names)2D.Q1 Interpret the love plot in your own words:
Response: _________________________
Propensity score matching
2E. Propensity Score Matching (PSM)
- Estimate 1:1 nearest-neighbor Propensity Score Matching
- Same code as above except change
distance = "logit"
set.seed(2412026)
propensity_scores <- Create table displaying covariate balance using cobalt::bal.tab()
Use bal.tab() to report balance before and after matching.
bal.tab(propensity_scores,
var.names = new_names) 2E.Q1 Compare Mahalanobis vs propensity score matching. Which method did a better job at achieving balance?
Response: _________________________
2F. Estimate an effect in the matched sample
Using the matched data (Mahalanobis method), estimate the effect of treatment on the change in incumbent vote share (change_liberal).
reg_match <-
#summ(, model.fit = FALSE)2F.Q1 Have you identified a causal estimate using this approach: Why or why not?
Response: _________________________
2F.Q2 When using a matching method, what is the main threat to causal identification?
Response: _________________________
2F.Q3 Describe why the treatment estimate represents the Average Treatment for the Treated (ATT) and explain why this is the case relative to estimation of the Average Treatment Effect (ATE).
Response: _________________________
Part 3: Panel Data, Fixed Effects, and Difference-in-Difference
Data source: Dataverse-Stokes2015
3A: Read in the panel data + code variables precinct_id and year as factors
panel_data <-
# HINT: Try running `tabyl(panel_data$year)`. Review article to make sense of the row numbers (n).3A.Q1: Why are there 18,558 rows in panel_data?
Response: _________________________
# How many years are included in the panel?
# How many precincts are there?3A.Q2: How many unique precincts are ever treated (i.e., proposed & operational)?
Response: _________________________
panel_data %>%
group_by(precinct_id) %>%
summarise(
ever_proposed = any(proposed_turbine == 1, na.rm = TRUE),
ever_operational = any(operational_turbine == 1, na.rm = TRUE),
.groups = "drop") %>%
summarise(
n_ever_proposed = sum(ever_proposed),
n_ever_operational = sum(ever_operational))3B. Plot and evaluate parallel trends: Replicate Figure.2 (Stokes, 2015)
- Create indicators for whether each precinct is ever treated by 2011 (
treat_p,treat_o; separate indicator for proposals and operational turbines). - Plot mean incumbent vote share by year for treated vs control precincts (with 95% CIs).
- Facet by turbine type (proposed & operational)
Step 1: Prepare data
trends_data <- panel_data %>%
group_by(precinct_id) %>%
mutate(
treat_p = as.integer(any(proposed_turbine == 1, na.rm = TRUE)), # ever proposed (in any year)
treat_o = as.integer(any(operational_turbine == 1, na.rm = TRUE))) %>% # ever operational (in any year)
ungroup() %>%
pivot_longer(c(treat_p, treat_o),
names_to = "turbine_type", values_to = "treat") %>%
mutate(
turbine_type = factor(turbine_type,
levels = c("treat_p", "treat_o"),
labels = c("Proposed turbines", "Operational turbines")),
status = if_else(treat == 1, "Treated", "Control"),
year = factor(year))Step 2: Create trends plot
pd <- position_dodge(width = 0.15)
trends_data %>%
group_by(turbine_type, status, year) %>%
summarise(
mean = mean(perc_lib, na.rm = TRUE),
n = sum(!is.na(perc_lib)),
se = sd(perc_lib, na.rm = TRUE) / sqrt(n),
ci = qt(.975, df = pmax(n - 1, 1)) * se,
.groups = "drop") %>%
ggplot(aes(year, mean, color = status, group = status)) +
geom_line(position = pd, linewidth = 1.2) +
geom_point(position = pd, size = 2.6) +
geom_errorbar(
aes(ymin = mean - ci, ymax = mean + ci),
position = pd, width = .12, linewidth = .7, color = "black") +
facet_wrap(~ turbine_type, nrow = 1) +
scale_color_manual(values = c(Control = "#0072B2", Treated = "#B22222")) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
coord_cartesian(ylim = c(.20, .57)) +
labs(
title = "Figure 2. Trends in the Governing Party’s Vote Share",
x = "Election Year",
y = "Liberal Party Vote Share",
color = NULL) +
theme_minimal(base_size = 14) +
theme(
panel.grid.minor = element_blank(),
legend.position = "bottom",
strip.text = element_text(face = "bold"))3B.Q1: Write a short paragraph assessing the parallel trends assumption for each outcome.
Response (4-6 sentences): _________________________
Estimating Fixed Effects Models (DiD) for proposals
\[ \text{Y}_{it} = \alpha_0 + \beta \cdot (\text{proposed_turbine}_{it}) + \gamma_i + \delta_t + \varepsilon_{it} \]
- \(Y_{it}\) is the vote share for the Liberal Party in precinct i in time t
- \(\beta\) is the treatment effect of a turbine being proposed within a precinct
- \(\gamma_i\) is the precinct fixed effect
- \(\delta_t\) is the year fixed effect
Example 1: Randomly sample 40 precincts
- To illustrate the “dummy variable method” of estimating fixed effects using the the general
lm()function we are going to randomly sample 40 precincts (20 “treated” precincts with proposed turbines). - If we attempted to use this approach with the full sample estimating all 6185 (n-1) precinct-level coefficients is impractical (it would take a long time).
set.seed(40002026)
precinct_frame <- panel_data %>%
group_by(precinct_id) %>%
summarise(
proposed_turbine_any = as.integer(any(proposed_turbine == 1, na.rm = TRUE)),
.groups = "drop"
)
ids_40 <- precinct_frame %>%
group_by(proposed_turbine_any) %>%
slice_sample(n = 20) %>%
ungroup() %>%
select(precinct_id)
sample_40_precincts <- panel_data %>%
semi_join(ids_40, by = "precinct_id")3C: Estimate a fixed effects model using lm() with fixed effects added for precinct and year using the sample of 40 precincts just created.
model1_ff <-
# summ(*** , model.fit = FALSE)# summ(*** , model.fit = FALSE, [. . .])3C.Q1: Intuition check: Is the signal-to-noise ratio for the treatment estimate greater than 2-to-1?
Response: _________________________
HINT: Add the argument
digits = 3to thesumm()function above
3C.Q2: Re-run the summ() function using the heteroscedasticiy robust standard error adjustment (robust = TRUE). Did the standard error (S.E.) estimates change? Explain why.
Response: _________________________
3C.Q3: Compare results of the model above to the findings from the fixed effects analysis in the Stokes (2015) study. Why might the results be similar or different?
Response: _________________________
3C.Q4: In your own words, explain why it is advantageous from a causal inference perspective to include year and precinct fixed effects. Explain how between-level and within-level variance is relevant to the problem of omitted variable bias (OVB).
Response (2-4 sentences): _________________________