Figure: Map of Treatment Assignment Zones in Nansana, Uganda

* SOURCE: Figure from Buntaine et al., 2024 (S5; Supplemental Materials)


Open source data & applied study:

Open access data was utilized for this class exercise from the study (Buntaine et al., 2024):

Buntaine, M. T., Komakech, P., & Shen, S. V. (2024). Social competition drives collective action to reduce informal waste burning in Uganda. Proceedings of the National Academy of Sciences, 121(23). https://doi.org/10.1073/pnas.2319712121


Outline - What are the goals of this exercise?

1. Start with a simple OLS model

2. Demonstrate the rationale for estimating progressively complex models

3. Illustrate a causal inference approach (i.e., identify robust estimators)

4. Conclusion: What is the take-home message of the Lab-1 exercise? . . .

Estimating complex models is a relatively easy part of our work as data scientists...

* Making informed specification decisions & communicating results is the hard part
(i.e., the science side of `data science`)!  

In this tutorial I use this text style to highlight:

  • Statistical jargon - E.g., Ordinary Least Squares (OLS)
  • Variable names - E.g., predictor_x
  • In-text R code - E.g., here::here()

Let’s get started!

library(tidyverse)   # Keeping things tidy (https://www.tidyverse.org/packages/)
library(janitor)     # Housekeeping (https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html)
library(here)        # Location, location, location (https://here.r-lib.org/)
library(jtools)      # Pretty regression output (https://jtools.jacob-long.com/)
library(gt)          # Tables (https://gt.rstudio.com/)
library(gtsummary)   # Table for checking balance (https://www.danieldsjoberg.com/gtsummary/)
library(performance) # Check model diagnostics (https://easystats.github.io/performance/index.html)

Getting to know our data example - analysis variables

variable_descriptions <- tribble(
  ~"Label", ~"Description",  
 #----------|-------------|,
  "zoneid" , "Zone or neighborhood ID: The observational unit (N=44)",   
  "waste_piles" , "The outcome variable (Y): The number of waste pile burns recorded (Range; 5, 125)",  
  "treat" , "The treatment assignment variable (0 = Control Group; 1 = Treatment Group)"
 )

gt(variable_descriptions) %>% 
    tab_header(
        title = "Focal Variables - Evaluating Treatment Effects") %>%
    tab_style(
      style = cell_text(weight = "bold"),
      locations = cells_column_labels()) 
Focal Variables - Evaluating Treatment Effects
Label Description
zoneid Zone or neighborhood ID: The observational unit (N=44)
waste_piles The outcome variable (Y): The number of waste pile burns recorded (Range; 5, 125)
treat The treatment assignment variable (0 = Control Group; 1 = Treatment Group)

Read in the Nansana experiment data (Buntaine et al., 2024)

#waste pile count data:
counts_gpx <- read_csv(here("data", "waste_pile_counts_gpx.csv")) %>% 
    rename("waste_piles" = "total",
           "rain_0_48hrs" = "rf_0_to_48_hours",
           "rain_49_168hrs" = "rf_49_to_168_hours") %>% 
    filter(survey%in%c(
      "post_treatment_1","post_treatment_2","post_treatment_3",
      "post_treatment_4","post_treatment_5"))

#select subset of post-treatment periods (remove time point 5):
post_treat_subset <- counts_gpx %>% 
   filter(survey == "post_treatment_4")

Model 1: Simple OLS estimator

Review of regression: Ordinary Least Squares (OLS)


OLS estimator (specification):

\[waste\ piles =\beta_0 + \beta_1(treat) + \epsilon\]

m1_ols <- lm(
    waste_piles ~ treat,
    data = post_treat_subset)

summ(m1_ols, model.fit = FALSE)
Observations 44
Dependent variable waste_piles
Type OLS linear regression
Est. S.E. t val. p
(Intercept) 77.91 4.16 18.73 0.00
treat -24.77 5.88 -4.21 0.00
Standard errors: OLS

Are we making reasonable assumptions?

Let's take a quick look at our outcome variable `waste piles`!
post_treat_subset %>%
  ggplot(aes(x = waste_piles)) +
  geom_histogram(binwidth = 5, fill = "blue", color = "white", alpha = 0.7) +
  labs(title = "Histogram of Waste Piles",
    x = "Waste Piles (counts)",
    y = "Count") +
    theme_minimal()

Check model assumption: normality of residuals

Make a QQ plot displaying residuals (y-axis) compared to the normal distribution (x-axis)

check_model(m1_ols,  check = "qq" )

Key takeaways:


Relax- time to generalize!


Model 2: Poisson Generalized Linear Regression Model

lambda_hat <- mean(post_treat_subset$waste_piles)

poisson_curve <- tibble(
  x = seq(0, max(post_treat_subset$waste_piles), by = 1),  # Range of x values
  density = dpois(seq(0, max(post_treat_subset$waste_piles), by = 1), 
                  lambda = mean(post_treat_subset$waste_piles)) 
)

ggplot(post_treat_subset, aes(x = waste_piles)) +
  geom_histogram(aes(y = ..density..), binwidth = 5, color = "white", fill = "blue", alpha = 0.7) + 
  geom_line(data = poisson_curve, aes(x = x, y = density), color = "red", size = 1) +  # Poisson curve
  geom_density(color = "green", size = 1, adjust = 1.5) +  # KDE for a smoother curve
  labs(
    title = "Plot of Empirical (Green) v. Theoretical (Red) Distributions",
    x = "Waste Pile Counts",
    y = "Density"
  ) +
  theme_minimal()

Ok, lets try estimating a Poisson model using glm()

m2_poisson <- glm(waste_piles ~ treat,
                  family = poisson(link = "log"),
                  data = post_treat_subset)

summ(m2_poisson, model.fit = FALSE)
Observations 44
Dependent variable waste_piles
Type Generalized linear model
Family poisson
Link log
Est. S.E. z val. p
(Intercept) 4.36 0.02 180.32 0.00
treat -0.38 0.04 -10.09 0.00
Standard errors: MLE

Check overdispersion assumption

check_overdispersion(m2_poisson)
## # Overdispersion test
## 
##        dispersion ratio =   5.873
##   Pearson's Chi-Squared = 246.680
##                 p-value = < 0.001

Key takeaways:

Intuition check: The new estimate values are on a different scale…


Model 3: Simple is best! - the econometricians trick


m3_log <- lm(
    log(waste_piles) ~ treat,
    data = post_treat_subset)

summ(m3_log, model.fit = FALSE)
Observations 44
Dependent variable log(waste_piles)
Type OLS linear regression
Est. S.E. t val. p
(Intercept) 4.31 0.08 51.97 0.00
treat -0.42 0.12 -3.57 0.00
Standard errors: OLS

Compare observed data to simulated data based on the fitted model (m3_log)

check_predictions(m3_log, verbose = FALSE)
## Warning: Minimum value of original data is not included in the
##   replicated data.
##   Model may not capture the variation of the data.

Key takeaways:


Comparing model output - Let’s take a look at the models we’ve estimated side-by-side

export_summs(m1_ols, m2_poisson, m3_log,
             model.names = c("OLS","Poisson GLM","Log(Outcome) OLS"),
             statistics = "none")
OLSPoisson GLMLog(Outcome) OLS
(Intercept)77.91 ***4.36 ***4.31 ***
(4.16)   (0.02)   (0.08)   
treat-24.77 ***-0.38 ***-0.42 ***
(5.88)   (0.04)   (0.12)   
*** p < 0.001; ** p < 0.01; * p < 0.05.

Let’s do some quick math to compare treatment effect estimates

\[\% \Delta Y = \left( e^{\beta_1} - 1 \right) \times 100\]

# Calculate percent change in waste piles in each model:

m1_change = (-24.77/77.91)*100   ## % change OLS = -31.8
m2_change = (exp(-.38) - 1)*100  ## % change POIS = -31.6
m3_change = (exp(-0.42) - 1)*100 ## % change LOG-OLS = -34.3

Analysis check-in:

So far the log-OLS regression model is the best fitting model we have specified 

The treatment estimate for log-OLS can be interpret as follows:

The model estimated that the treatment group had a 34% reduction in waste piles relative to the control group.


Let’s take a look at covariate balance across treatment conditions

Specifically, `pre_count_avg` (pre-treatment waste pile counts)
post_treat_subset %>% 
    select(treat, waste_piles, pre_count_avg) %>% 
    tbl_summary(
        by = treat,
        statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% 
    modify_header(label ~ "**Variable**") %>%
    modify_spanning_header(c("stat_1", "stat_2") ~ "**Treatment**") 
Variable
Treatment
0
N = 22
1
1
N = 22
1
waste_piles 78 (21) 53 (18)
pre_count_avg 94 (18) 84 (26)
1 Mean (SD)

Model 4: Moving towards casual inference


m4_control <- lm(
    log(waste_piles) ~ 
        treat +
        log(pre_count_avg), ## Add a control
    data = post_treat_subset)

summ(m4_control, model.fit = FALSE)
Observations 44
Dependent variable log(waste_piles)
Type OLS linear regression
Est. S.E. t val. p
(Intercept) 0.47 0.35 1.34 0.19
treat -0.27 0.06 -4.44 0.00
log(pre_count_avg) 0.85 0.08 11.09 0.00
Standard errors: OLS

Model 5: Does the enviroment stand still during our experiment?


No! We should account for events that might also affect trash burning during the treatment period

I.e., Anything else that might affect the number of trash piles burned (besides our treatment)

Look at balance across treatment conditions for average rain events

post_treat_subset %>% 
    select(treat, waste_piles, rain_0_48hrs, rain_49_168hrs) %>% 
    tbl_summary(
        by = treat,
        statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% 
    modify_header(label ~ "**Variable**") %>% 
    modify_spanning_header(c("stat_1", "stat_2") ~ "**Treatment**") 
Variable
Treatment
0
N = 22
1
1
N = 22
1
waste_piles 78 (21) 53 (18)
rain_0_48hrs 0.18 (0.15) 0.20 (0.16)
rain_49_168hrs 0.17 (0.17) 0.17 (0.17)
1 Mean (SD)

Add controls for rain events during the competition period

m5_control2 <- lm(
    log(waste_piles) ~ 
        treat +
        log(pre_count_avg) +
        rain_0_48hrs + rain_49_168hrs,
    data = post_treat_subset)

summ(m5_control2,
     model.fit = FALSE,
     robust = "HC2")  # Heteroskedasticity robust standard error adj.
Observations 44
Dependent variable log(waste_piles)
Type OLS linear regression
Est. S.E. t val. p
(Intercept) 0.43 0.45 0.95 0.35
treat -0.27 0.06 -4.29 0.00
log(pre_count_avg) 0.86 0.07 12.51 0.00
rain_0_48hrs -0.02 0.85 -0.03 0.98
rain_49_168hrs 0.07 0.79 0.09 0.93
Standard errors: Robust, type = HC2

Comparing models after adding controls

export_summs(m4_control, m5_control2, robust = "HC2",
             model.names = c("M4 - Pre-treat","M5 - Add Rain"),
             statistics = "none")
M4 - Pre-treatM5 - Add Rain
(Intercept)0.47    0.43    
(0.33)   (0.45)   
treat-0.27 ***-0.27 ***
(0.06)   (0.06)   
log(pre_count_avg)0.85 ***0.86 ***
(0.07)   (0.07)   
rain_0_48hrs       -0.02    
       (0.85)   
rain_49_168hrs       0.07    
       (0.79)   
Standard errors are heteroskedasticity robust. *** p < 0.001; ** p < 0.01; * p < 0.05.

Model 6: Are neighborhoods independent?


m6_clust <- lm(
    log(waste_piles) ~ 
        treat +
        log(pre_count_avg) +
        rain_0_48hrs + rain_49_168hrs,
    data = post_treat_subset)

summ(m6_clust,
     robust = "HC2",
     cluster = "pair_id", # Add SE adj. based on pair-clusters
     model.fit = FALSE)
Observations 44
Dependent variable log(waste_piles)
Type OLS linear regression
Est. S.E. t val. p
(Intercept) 0.43 0.49 0.88 0.39
treat -0.27 0.07 -3.63 0.00
log(pre_count_avg) 0.86 0.07 12.44 0.00
rain_0_48hrs -0.02 0.97 -0.03 0.98
rain_49_168hrs 0.07 0.94 0.07 0.94
Standard errors: Cluster-robust, type = HC2
# m6_change = (exp(-0.27) - 1)*100 ## % change = -23.7

Key takeaways:


Figure 2 adapted from Buntaine et al. (2024) - Replication of T+6 treatment estimate

Source (Buntaine et al., 2024)
Source (Buntaine et al., 2024)