Working with partners to troubleshoot code and concepts is encouraged! If you work with a partner, please list their name next to yours at the top of your assignment so Annie and I can easily see who collaborated.
All written responses must be written independently (in your own words).
Please follow the question prompts carefully and include only the information each question asks in your submitted responses.
Submit both your knitted document and the associated
RMarkdown
or Quarto
file.
Your knitted presentation should meet the quality you’d submit to research colleagues or feel confident sharing publicly. Refer to the rubric for details about presentation standards.
Assignment submission (YOUR NAME): ______________________________________
library(tidyverse)
library(here)
library(janitor)
library(estimatr)
library(performance)
library(jtools)
library(gt)
library(gtsummary)
library(MASS) ## NOTE: The `select()` function is masked. Use: `dplyr::select()` ##
library(interactions)
Reed D. 2019. SBC LTER: Reef: Abundance, size and fishing effort for California Spiny Lobster (Panulirus interruptus), ongoing since 2012. Environmental Data Initiative. https://doi.org/10.6073/pasta/a593a675d644fdefb736750b291579a0. Dataset accessed 11/17/2019.
You’re about to dive into some deep data collected from five reef sites in Santa Barbara County, all about the abundance of California spiny lobsters! 🦞 Data was gathered by divers annually from 2012 to 2018 across Naples, Mohawk, Isla Vista, Carpinteria, and Arroyo Quemado reefs.
Why lobsters? Well, this sample provides an opportunity to evaluate the impact of Marine Protected Areas (MPAs) established on January 1, 2012 (Reed, 2019). Of these five reefs, Naples, and Isla Vista are MPAs, while the other three are not protected (non-MPAs). Comparing lobster health between these protected and non-protected areas gives us the chance to study how commercial and recreational fishing might impact these ecosystems.
We will consider the MPA sites the treatment
group and
use regression methods to explore whether protecting these reefs really
makes a difference compared to non-MPA sites (our control group). In
this assignment, we’ll think deeply about which causal inference
assumptions hold up under the research design and identify where they
fall short.
Let’s break it down step by step and see what the data reveals! 📊
Step 1: Anticipating potential sources of selection bias
a. Do the control sites (Arroyo Quemado, Carpenteria, and Mohawk) provide a strong counterfactual for our treatment sites (Naples, Isla Vista)? Write a paragraph making a case for why this comparison is centris paribus or whether selection bias is likely (be specific!).
Step 2: Read & wrangle data
a. Read in the raw data. Name the data.frame
(df
) rawdata
b. Use the function clean_names()
from
the janitor
package
# HINT: check for coding of missing values (`na = "-99999"`)
rawdata <-
c. Create a new df
named
tidyata
. Using the variable site
(reef
location) create a new variable reef
as a
factor
and add the following labels in the order listed
(i.e., re-order the levels
):
"Arroyo Quemado", "Carpenteria", "Mohawk", "Isla Vista", "Naples"
tidydata <-
Create new df
named spiny_counts
d. Create a new variable counts
to
allow for an analysis of lobster counts where the unit-level of
observation is the total number of observed lobsters per
site
, year
and transect
.
mean_size
from the variable
size_mm
counts
should have values which are
integers (whole numbers).na
)!e. Create a new variable mpa
with
levels MPA
and non_MPA
. For our regression
analysis create a numerical variable treat
where MPA sites
are coded 1
and non_MPA sites are coded 0
#HINT(d): Use `group_by()` & `summarize()` to provide the total number of lobsters observed at each site-year-transect row-observation.
#HINT(e): Use `case_when()` to create the 3 new variable columns
spiny_counts <-
NOTE: This step is crucial to the analysis. Check with a friend or come to TA/instructor office hours to make sure the counts are coded correctly!
Step 3: Explore & visualize data
a. Take a look at the data! Get familiar with the
data in each df
format (tidydata
,
spiny_counts
)
b. We will focus on the variables
count
, year
, site
, and
treat
(mpa
) to model lobster abundance. Create
the following 4 plots using a different method each time from the 6
options provided. Add a layer (geom
) to each of the plots
including informative descriptive statistics (you choose; e.g., mean,
median, SD, quartiles, range). Make sure each plot dimension is clearly
labeled (e.g., axes, groups).
Create plots displaying the distribution of lobster counts:
Create a plot of lobster size :
# plot 1: ....
spiny_counts %>%
ggplot()
c. Compare means of the outcome by treatment group.
Using the tbl_summary()
function from the package gt_summary
# USE: gt_summary::tbl_summary()
Step 4: OLS regression- building intuition
a. Start with a simple OLS estimator of lobster
counts regressed on treatment. Use the function summ()
from
the jtools
package to print the OLS output
b. Interpret the intercept & predictor coefficients in your own words. Use full sentences and write your interpretation of the regression results to be as clear as possible to a non-academic audience.
# NOTE: We will not evaluate/interpret model fit in this assignment (e.g., R-square)
m1_ols <-
summ(m1_ols, model.fit = FALSE)
c. Check the model assumptions using the
check_model
function from the performance
package
d. Explain the results of the 4 diagnostic plots. Why are we getting this result?
check_model(m1_ols, check = "qq" )
check_model(m1_ols, check = "normality")
check_model(m1_ols, check = "homogeneity")
check_model(m1_ols, check = "pp_check")
Step 5: Fitting GLMs
a. Estimate a Poisson regression model using the
glm()
function
b. Interpret the predictor coefficient in your own words. Use full sentences and write your interpretation of the results to be as clear as possible to a non-academic audience.
c. Explain the statistical concept of dispersion and overdispersion in the context of this model.
d. Compare results with previous model, explain change in the significance of the treatment effect
#HINT1: Incidence Ratio Rate (IRR): Exponentiation of beta returns coefficient which is interpreted as the 'percent change' for a one unit increase in the predictor
#HINT2: For the second glm() argument `family` use the following specification option `family = poisson(link = "log")`
m2_pois <-
e. Check the model assumptions. Explain results.
f. Conduct tests for over-dispersion & zero-inflation. Explain results.
check_model(m2_pois)
check_overdispersion(m2_pois)
check_zeroinflation(m2_pois)
g. Fit a negative binomial model using the function
glm.nb() from the package MASS
and check model
diagnostics
h. In 1-2 sentences explain rationale for fitting this GLM model.
i. Interpret the treatment estimate result in your own words. Compare with results from the previous model.
# NOTE: The `glm.nb()` function does not require a `family` argument
m3_nb <-
check_overdispersion(m3_nb)
check_zeroinflation(m3_nb)
check_predictions(m3_nb)
check_model(m3_nb)
Step 6: Compare models
a. Use the export_summ()
function from
the jtools
package to look at the three regression models
you fit side-by-side.
c. Write a short paragraph comparing the results. Is
the treatment effect robust
or stable across the model
specifications.
export_summs(# ADD MODELS
model.names = c("OLS","Poisson", "NB"),
statistics = "none")
Step 7: Building intuition - fixed effects
a. Create new df
with the
year
variable converted to a factor
b. Run the following negative binomial model using
glm.nb()
year
(i.e., dummy
coefficients)treat
& year
(treat*year
)c. Take a look at the regression output. Each coefficient provides a comparison or the difference in means for a specific sub-group in the data. Informally, describe the what the model has estimated at a conceptual level (NOTE: you do not have to interpret coefficients individually)
d. Explain why the main effect for treatment is negative? *Does this result make sense?
ff_counts <- spiny_counts %>%
mutate(year=as_factor(year))
m5_fixedeffs <- glm.nb(
counts ~
treat +
year +
treat*year,
data = ff_counts)
summ(m5_fixedeffs, model.fit = FALSE)
e. Look at the model predictions: Use the
interact_plot()
function from package
interactions
to plot mean predictions by year and treatment
status.
f. Re-evaluate your responses (c) and (b) above.
interact_plot(m5_fixedeffs, pred = year, modx = treat,
outcome.scale = "link") # NOTE: y-axis on log-scale
# HINT: Change `outcome.scale` to "response" to convert y-axis scale to counts
g. Using ggplot()
create a plot in same
style as the previous interaction plot
, but displaying the
original scale of the outcome variable (lobster counts). This type of
plot is commonly used to show how the treatment effect changes across
discrete time points (i.e., panel data).
The plot should have… - year
on the x-axis -
counts
on the y-axis - mpa
as the grouping
variable
# Hint 1: Group counts by `year` and `mpa` and calculate the `mean_count`
# Hint 2: Convert variable `year` to a factor
plot_counts <-
# plot_counts %>% ggplot() ...
Step 8: Reconsider causal identification assumptions
Discuss whether you think spillover effects
are
likely in this research context (see Glossary of terms; https://docs.google.com/document/d/1RIudsVcYhWGpqC-Uftk9UTz3PIq6stVyEpT44EPNgpE/edit?usp=sharing)
Explain why spillover is an issue for the identification of causal effects
How does spillover relate to impact in this research setting?
Discuss the following causal inference assumptions in the context of the MPA treatment effect estimator. Evaluate if each of the assumption are reasonable:
Use the recent lobster abundance data with observations collected up until 2024 (
lobster_sbchannel_24.csv
) to run an analysis evaluating the effect of MPA status on lobster counts using the same focal variables.