Lab 2: Generalize Fast! – EDS241/ESM244

Map of Treatment Assignment Zones in Nansana, Uganda

Figure from Buntaine et al., 2024 (S5; Supplemental Materials)

Open source data & applied study:

Open access data was utilized for this class exercise from the study (Buntaine et al., 2024):

Buntaine, M. T., Komakech, P., & Shen, S. V. (2024). Social competition drives collective action to reduce informal waste burning in Uganda. Proceedings of the National Academy of Sciences, 121(23). https://doi.org/10.1073/pnas.2319712121

Goals of this exercise

Start with a simple OLS model
Demonstrate the rationale for estimating progressively complex models
Illustrate a causal inference approach (i.e., identify robust estimators)
Conclusion: What is the take-home message of the Lab-1 exercise? . . .

Estimating complex models is a relatively easy part of our work as data scientists…
- Making informed specification decisions & communicating results is the hard part (i.e., the science side of data science)!

Let’s get started!

Load packages and data:

library(tidyverse)   # Keeping things tidy 
library(janitor)     # Housekeeping 
library(here)        # Location, location, location 
#install.packages("jtools")
library(jtools)      # Pretty regression output 
library(gt)          # Tables 
library(gtsummary)   # Table for checking balance 
#install.packages("performance")
library(performance) # Check model diagnostics 
#install.packages("see") # Visualization engine for performance package
#install.packages("huxtable") # Package required for export_summs

# A tibble: 6 × 8
  zoneid survey    waste_piles rain_0_48hrs rain_49_168hrs pre_count_avg pair_id
   <dbl> <chr>           <dbl>        <dbl>          <dbl>         <dbl>   <dbl>
1   1011 post_tre…          97         0              0.4           97      2021
2   1012 post_tre…          12         0              0.4           12.5    2022
3   1013 post_tre…          58         0              0.4           87.5    2022
4   1014 post_tre…          88         0              0.4          110.     2021
5   1015 post_tre…          53         0.04           0.36         110.     2017
6   1016 post_tre…          18         0              0.4           23      2017
# ℹ 1 more variable: treat <dbl>

# Read in waste pile count data
counts_gpx <- read_csv(here::here( "data", "waste_pile_counts_gpx.csv")) %>% 
    rename("waste_piles" = "total",
           "rain_0_48hrs" = "rf_0_to_48_hours",
           "rain_49_168hrs" = "rf_49_to_168_hours") %>% 
    filter(survey%in%c(
      "post_treatment_1","post_treatment_2","post_treatment_3",
      "post_treatment_4","post_treatment_5"))

# Select subset of post-treatment periods
post_treat_subset <- counts_gpx %>% 
   filter(survey == "post_treatment_4")

head(post_treat_subset)

Now that we got our data loaded in, lets take a look at what these variables mean.

Label	Description
zoneid	Zone or neighborhood ID: The observational unit (\(N=44\))
waste_piles	The outcome variable (\(Y\)): The number of waste pile burns recorded (Range: 5, 125)
treat	The treatment assignment variable (0 = Control Group; 1 = Treatment Group)
rf_0_to_48_hours	The rainfall estimated between 0 and 48 hours prior to the start of the audit, derived from NASA MEERA-2 corrected precipitation estimates (mm)
rf_49_to_168_hours	The rainfall estimated between 49 and 168 hours prior to the start of the audit, derived from NASA MEERA-2 corrected precipitation estimates (mm)
pre_count_avg	The average number of burnt piles in the pre-treatment audits, spread to zone level
pair_id	The unique ID assigned to each pair of zones for assignment to treatment

Model 1: Simple OLS (Ordinary Least Squares) estimator

OLS estimator (specification):

\[waste\ piles =\beta_0 + \beta_1(treat) + \epsilon\]

m1_ols <- lm(
    waste_piles ~ treat,
    data = post_treat_subset)

summ(m1_ols, model.fit = FALSE)

Observations	44
Dependent variable	waste_piles
Type	OLS linear regression

	Est.	S.E.	t val.	p
(Intercept)	77.91	4.16	18.73	0.00
treat	-24.77	5.88	-4.21	0.00
Standard errors: OLS

Are we making reasonable assumptions?

Remember questioning (and testing) assumptions is the heart of causal inference!We want to make sure our assumptions our reasonable. To do so, let’stake a quick look at our outcome variable waste piles, and then look at the normality of our residuals.

post_treat_subset %>%
  ggplot(aes(x = waste_piles)) +
  geom_histogram(binwidth = 5, fill = "blue", color = "white", alpha = 0.7) +
  labs(title = "Histogram of Waste Piles",
    x = "Waste Piles (counts)",
    y = "Count") +
    theme_minimal()

While our data isn’t a perfect bell curve, it also isn’t severely skewed. Let’s do another assumption check (normality of residuals) to help us decide if OLS is the right model for us.

To do so, we will make a QQ plot displaying residuals (y-axis) compared to the normal distribution (x-axis).

check_model(m1_ols,  check = "qq" )

Hmm, it doesn’t seem like the OLS assumption, normality of residuals is a good fit for our data. Onto the next model!

Discussion

Since we ruled OLS out, talk with your neighbor about another potential model to try. What kind of variable is our outcome variable? How might that inform what model we try next?

We can “relax” our assumptions by using a Generalized Linear model! Specifically, we will relax the assumption that the residuals are normally distributed.

Warning!

When we relax an assumption and make our model more flexible, the statistics and interpretation become more complex!

As you discussed with your partner, our dependent variable, waste_piles is a count variable, not a continuous variable. Therefore, it’s best we pick a model that explicitly accounts for count outcomes. In walks the poisson regression model!

Model 2 : Poisson Generalized Linear Regression Model

We are going to assume Y follows a Poisson distribution with non negative integers (counts!). To do so, we need to think about the assumptions the Poisson regression makes. This model assumes that variance (dispersion) is proportional to the mean. Does our data match the theoretical distribution proposed? Let’s check!

lambda_hat <- mean(post_treat_subset$waste_piles)

poisson_curve <- tibble(
  x = seq(0, max(post_treat_subset$waste_piles), by = 1),  # Range of x values
  density = dpois(seq(0, max(post_treat_subset$waste_piles), by = 1), 
                  lambda = mean(post_treat_subset$waste_piles)) 
)

ggplot(post_treat_subset, aes(x = waste_piles)) +
  geom_histogram(aes(y = ..density..), binwidth = 5, color = "white", fill = "blue", alpha = 0.7) + 
  geom_line(data = poisson_curve, aes(x = x, y = density), color = "red", size = 1) +  # Poisson curve
  geom_density(color = "green", size = 1, adjust = 1.5) +  # KDE for a smoother curve
  labs(
    title = "Plot of Empirical (Green) v. Theoretical (Red) Distributions",
    x = "Waste Pile Counts",
    y = "Density"
  ) +
  theme_minimal()

m2_poisson <- glm(waste_piles ~ treat,
                  family = poisson(link = "log"),
                  data = post_treat_subset)

summ(m2_poisson, model.fit = FALSE)

Observations	44
Dependent variable	waste_piles
Type	Generalized linear model
Family	poisson
Link	log

	Est.	S.E.	z val.	p
(Intercept)	4.36	0.02	180.32	0.00
treat	-0.38	0.04	-10.09	0.00
Standard errors: MLE

Just like OLS, we have some assumptions we need to check!

We first need to check for over dispersion. This is when our variance exceeds the mean.

check_overdispersion(m2_poisson)

# Overdispersion test

       dispersion ratio =   5.873
  Pearson's Chi-Squared = 246.680
                p-value = < 0.001

If our dispersion ratio was equal to 1, this would mean our mean is equal to our variance. Since our dispersion ratio is 5.873, and far greater than 1, we can see that our data has significant overdispersion. Poisson isn’t the model for us!

Discussion

Since we ruled OLS and Poisson Regression out, talk with your neighbor about another potential model to try.

Model 3: Log- OLS Regression

Log Linear Regression is an alternative way to model count outcomes using regular OLS. The only difference is that we are transforming our outcome variable to be on the log scale.

m3_log <- lm(
    log(waste_piles) ~ treat,
    data = post_treat_subset)

summ(m3_log, model.fit = FALSE)

Observations	44
Dependent variable	log(waste_piles)
Type	OLS linear regression

	Est.	S.E.	t val.	p
(Intercept)	4.31	0.08	51.97	0.00
treat	-0.42	0.12	-3.57	0.00
Standard errors: OLS

We can compare our observed data to simulated data based on the model we just created (m3_log).

check_predictions(m3_log, verbose = FALSE)

Warning: Minimum value of original data is not included in the
  replicated data.
  Model may not capture the variation of the data.

Model Recap

Let’s take a look at the three models we have created thus far and compare their results.

export_summs(m1_ols, m2_poisson, m3_log,
             model.names = c("OLS","Poisson GLM","Log(Outcome) OLS"),
             statistics = "none")

	OLS	Poisson GLM	Log(Outcome) OLS
(Intercept)	77.91 ***	4.36 ***	4.31 ***
	(4.16)	(0.02)	(0.08)
treat	-24.77 ***	-0.38 ***	-0.42 ***
	(5.88)	(0.04)	(0.12)
* p < 0.001; p < 0.01; * p < 0.05.

Let’s do some quick math to compare treatment effect estimates

\[\% \Delta Y = \left( e^{\beta_1} - 1 \right) \times 100\]

# Calculate percent change in waste piles in each model:

m1_change = (-24.77/77.91)*100   ## % change OLS = -31.8
m2_change = (exp(-.38) - 1)*100  ## % change POIS = -31.6
m3_change = (exp(-0.42) - 1)*100 ## % change LOG-OLS = -34.3

Treatment Estimate Check In

Interpret the treatment estimate for the log OLS.

The treatment estimate for log-OLS can be interpret as follows:

The model estimated that the treatment group had a 34% reduction in waste piles relative to the control group.

Let’s also take a look at the covariate balance table across treatment conditions.

post_treat_subset %>% 
    select(treat, waste_piles, pre_count_avg) %>% 
    tbl_summary(
        by = treat,
        statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% 
    modify_header(label ~ "**Variable**") %>%
    modify_spanning_header(c("stat_1", "stat_2") ~ "**Treatment**")

Variable	Treatment
Variable	0 N = 22¹	1 N = 22¹
waste_piles	78 (21)	53 (18)
pre_count_avg	94 (18)	84 (26)
¹ Mean (SD)

Discussion

What potential issues do you notice with the pre_count_avg variable?

This takes us to our next model - controlling for pre_count_avg!

Model 4: Log- OLS Regression with a control variable

m4_control <- lm(
    log(waste_piles) ~ 
        treat +
        log(pre_count_avg), ## Add a control
    data = post_treat_subset)

summ(m4_control, model.fit = FALSE)

Observations	44
Dependent variable	log(waste_piles)
Type	OLS linear regression

	Est.	S.E.	t val.	p
(Intercept)	0.47	0.35	1.34	0.19
treat	-0.27	0.06	-4.44	0.00
log(pre_count_avg)	0.85	0.08	11.09	0.00
Standard errors: OLS

Let’s compare our two log OLS Regression Models.

export_summs(m3_log, m4_control,
             model.names = c("Log OLS Regression","Log OLS Regression with a control variable"),
             statistics = "none")

	Log OLS Regression	Log OLS Regression with a control variable
(Intercept)	4.31 ***	0.47
	(0.08)	(0.35)
treat	-0.42 ***	-0.27 ***
	(0.12)	(0.06)
log(pre_count_avg)		0.85 ***
		(0.08)
* p < 0.001; p < 0.01; * p < 0.05.

We have now controlled for pre_count_avg, but let’s think about what other variables we could control for to improve our model. Other variables in our data include:

Label	Description
rf_0_to_48_hours	The rainfall estimated between 0 and 48 hours prior to the start of the audit, derived from NASA MEERA-2 corrected precipitation estimates (mm)
rf_49_to_168_hours	The rainfall estimated between 49 and 168 hours prior to the start of the audit, derived from NASA MEERA-2 corrected precipitation estimates (mm)

The environment doesn’t sit still during our experiment, so let’s control for it! We should account for events that might also affect trash burning during the treatment period (like rainfall)

Let’s start by looking at a covariate balance table for across treatment conditions for average rain events.

post_treat_subset %>% 
    select(treat, waste_piles, rain_0_48hrs, rain_49_168hrs) %>% 
    tbl_summary(
        by = treat,
        statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% 
    modify_header(label ~ "**Variable**") %>% 
    modify_spanning_header(c("stat_1", "stat_2") ~ "**Treatment**")

Variable	Treatment
Variable	0 N = 22¹	1 N = 22¹
waste_piles	78 (21)	53 (18)
rain_0_48hrs	0.18 (0.15)	0.20 (0.16)
rain_49_168hrs	0.17 (0.17)	0.17 (0.17)
¹ Mean (SD)

m5_control2 <- lm(
    log(waste_piles) ~ 
        treat +
        log(pre_count_avg) +
        rain_0_48hrs + rain_49_168hrs,
    data = post_treat_subset)

summ(m5_control2,
     model.fit = FALSE,
     robust = "HC2")  # Heteroskedasticity robust standard error adj.

Observations	44
Dependent variable	log(waste_piles)
Type	OLS linear regression

	Est.	S.E.	t val.	p
(Intercept)	0.43	0.45	0.95	0.35
treat	-0.27	0.06	-4.29	0.00
log(pre_count_avg)	0.86	0.07	12.51	0.00
rain_0_48hrs	-0.02	0.85	-0.03	0.98
rain_49_168hrs	0.07	0.79	0.09	0.93
Standard errors: Robust, type = HC2

Now that we have two fixed effects models, lets look at our results side by side.

export_summs(m4_control, m5_control2, robust = "HC2",
             model.names = c("M4 - Pre-treat","M5 - Add Rain"),
             statistics = "none")

	M4 - Pre-treat	M5 - Add Rain
(Intercept)	0.47	0.43
	(0.33)	(0.45)
treat	-0.27 ***	-0.27 ***
	(0.06)	(0.06)
log(pre_count_avg)	0.85 ***	0.86 ***
	(0.07)	(0.07)
rain_0_48hrs		-0.02
		(0.85)
rain_49_168hrs		0.07
		(0.79)
Standard errors are heteroskedasticity robust. * p < 0.001; p < 0.01; * p < 0.05.

Are neighborhoods independent?

We have another issue which may bias our treatment effect - the neighborhoods were paired together. We have to account for this paired data feature which is called clustered data. Clustered data violates the regression assumption that observations are independent and identically distributed (i.i.d). In-other-words, the data has within group dependencies, paired neighborhoods will likely be more similar than un-paired neighborhoods.

Lets account for these paired neighborhoods by adding a standard error adjustment based on our clusters. We can make this adjustment with the fixed effects model we just created (m5_control2) .

summ(m5_control2,
     robust = "HC2",
     cluster = "pair_id", # Add SE adj. based on pair-clusters
     model.fit = FALSE)

Observations	44
Dependent variable	log(waste_piles)
Type	OLS linear regression

	Est.	S.E.	t val.	p
(Intercept)	0.43	0.49	0.88	0.39
treat	-0.27	0.07	-3.63	0.00
log(pre_count_avg)	0.86	0.07	12.44	0.00
rain_0_48hrs	-0.02	0.97	-0.03	0.98
rain_49_168hrs	0.07	0.94	0.07	0.94
Standard errors: Cluster-robust, type = HC2

What are the key takeaways from this summary? Discuss with your neighbor!