Lab 4 outline

Randomly split data into 2 equal parts (calibration & validation samples)
Introduction to MplusAutomation with iterators
Dealing with large data

____________________________________

Getting started - following the routine…

Create an R-Project
Install packages (\(\color{red}{\text{ONLY IF NEEDED}}\))
Load packages

R-Project instructions:

click “NEW PROJECT” (upper right corner of window)
choose option “NEW DIRECTORY”
choose location of project (on desktop OR in a designated class folder)

Within R-studio under the files pane (bottom right):

click “New Folder” and name folder “data”
click “New Folder” and name folder “efa_mplus”
click “New Folder” and name folder “figures”

____________________________________

New packages this week:

{janitor}
{haven}

IF package rhdf5 does not load then run:

if (!requireNamespace("BiocManager", quietly = TRUE)) 
  install.packages("BiocManager")

BiocManager::install("rhdf5")

____________________________________

Lab 4 - Begin

____________________________________

DATA SOURCE: This lab exercise utilizes the NCES public-use dataset: Education Longitudinal Study of 2002 (Lauff & Ingels, 2014) \(\color{blue}{\text{See website: nces.ed.gov}}\)

____________________________________

loading packages…

library(janitor)
library(tidyverse)
library(haven)
library(MplusAutomation)
library(rhdf5)
library(here)
library(corrplot)

read in the raw dataset

lab_data <- read_spss(here("data", "els_sub1_spss.sav"))

create a subset of the dataset called “school_trouble”

school_trouble <- lab_data %>% 
  select(41:55)

make a new codebook from the “school_trouble” subset

sjPlot::view_df(school_trouble)

write a CSV datafile

write_csv(school_trouble, here("data", "school_trouble_data.csv"))

read the unlabeled data back into R

trouble_data <- read_csv(here("data", "school_trouble_data.csv"))

check items to see if reverse coding is needed

cor_matrix <- cor(trouble_data, use = "pairwise.complete.obs")

corrplot(cor_matrix, method="circle",
         type = "upper")

____________________________________

Randomly split a sample into 2 equal parts

____________________________________

find the size of half of original sample.

The “floor()” function helps with rounding

smp_size <- floor(0.50 * nrow(trouble_data))

set the seed to make your partition reproducible

set.seed(123)

the function “sample()” will pick at random the values of the specified number

calibrate_smp <- sample(seq_len(nrow(trouble_data)), size = smp_size)

create two samples called “calibrate” & “validate”

calibrate <- trouble_data[calibrate_smp, ]
validate <- trouble_data[-calibrate_smp, ]

Let’s run an EFA with the “calibrate” sample

m_efa_1  <- mplusObject(
  TITLE = "School Trouble EFA - LAB 4 DEMO", 
  VARIABLE = 
    "usevar = BYS22A-BYS24G;", 
  
  ANALYSIS = 
    "type = efa 1 5;   
     estimator = mlr;  
     parallel=50; ! run parallel analysis",
  
  MODEL = "" ,
  
  PLOT = "type = plot3;",
  OUTPUT = "sampstat;",
  
  usevariables = colnames(calibrate), 
  rdata = calibrate)

m_efa_1_fit <- mplusModeler(m_efa_1, 
                            dataout=here("efa_mplus", "lab4_efa1_trouble.dat"),
                            modelout=here("efa_mplus", "lab4_efa1_trouble.inp"),
                            check=TRUE, run = TRUE, hashfilename = FALSE)

Plot Parallel Analysis & Eigenvalues

read into R an Mplus output file

efa_summary <- readModels(here("efa_mplus", "lab4_efa1_trouble.out"))

extract relavent data & prepare dataframe for plot

x <- list(EFA=efa_summary[["gh5"]][["efa"]][["eigenvalues"]], 
          Parallel=efa_summary[["gh5"]][["efa"]][["parallel_average"]])

plot_data <- as_data_frame(x)
plot_data <- cbind(Factor = paste0(1:nrow(plot_data)), plot_data)

plot_data <- plot_data %>% 
  mutate(Factor = fct_inorder(Factor))

pivot the dataframe to “long” format

plot_data_long <- plot_data %>% 
  pivot_longer(EFA:Parallel,               # The columns I'm gathering together
               names_to = "Analysis",      # new column name for existing names
               values_to = "Eigenvalues")  # new column name to store values

plot using ggplot

plot_data_long %>% 
  ggplot(aes(y=Eigenvalues,
             x=Factor,
             group=Analysis,
             color=Analysis)) +
  geom_point() + 
  geom_line() + 
  theme_minimal()

save figure to the designated folder

ggsave(here("figures", "eigenvalue_elbow_rplot.png"), dpi=300, height=5, width=7, units="in")

____________________________________

Introduction to MplusAutomation with iterators

____________________________________

Alternate way to run an EFA with the “calibrate” sample

m_efa  <- lapply(1:5, function(k) {
  m_efa2  <- mplusObject(
    TITLE = "School Trouble EFA - LAB 4 DEMO", 
    VARIABLE = 
      "usevar = BYS22A-BYS24G;", 
    
    ANALYSIS = 
      paste("type=efa", k, k), 
    
    MODEL = "" ,
    
    PLOT = "type = plot3;",
    OUTPUT = "sampstat;",
    
    usevariables = colnames(calibrate), 
    rdata = calibrate)
  
  m_efa_2_fit <- mplusModeler(m_efa2, 
                              dataout=sprintf(here("efa_mplus2", "efa_trouble.dat"), k),
                              modelout=sprintf(here("efa_mplus2", "efa_%d_trouble.inp"), k),
                              check=TRUE, run = TRUE, hashfilename = FALSE)
})

____________________________________

Cleaning & subsetting large datasets

____________________________________

reading SPSS files is a lot slower than reading CSV formatted files

hsls_raw <- read_spss(here("data", "hsls_16_student_sub_v1.sav"))

make all column names “lower_snake_case” style

hsls_tidy <- hsls_raw %>% 
  clean_names()

select usig the starts_with() function

hsls_x1 <- hsls_tidy %>% 
  select(starts_with("x1")) # columns with first 2 characters "x1"

select using the end_with() function

hsls_not_sex <- hsls_tidy %>% 
  select(!ends_with("sex")) # columns that do NOT end with "sex"

select using the end_with() function

hsls_science <- hsls_tidy %>% 
  select(contains("sci")) # columns that contain characters "sci"

hsls_math <- hsls_tidy %>% 
  select(contains(c("mth" , "math"))) # columns that contain "mth" or "math"

combine different select() arguements

hsls_math_sci <- hsls_tidy %>% 
  select(contains(c("mth" , "math", "sci"))) %>%
  select(!starts_with("x1")) %>% 
  select(!ends_with("sex"))

remove characters from the variable names that are greater than 8 characters

names(hsls_math_sci) = str_sub(names(hsls_math_sci), 1, 8)

check if culumn names are unique

test.unique <- function(df) { ## function to identify unique columns
  
  length1 <- length(colnames(df))
  length2 <- length(unique(colnames(df))) 
  if (length1 - length2 > 0 ) {
    
  print(paste("There are", length1 - length2, " duplicates", sep=" ")) 
    }
} 

test.unique(hsls_math_sci)

locate duplicates (this will find the column of the first duplicate)

anyDuplicated(colnames(hsls_math_sci))

names(hsls_math_sci)

Other functions to consider from the “stringr” package (part of tidyverse):

str_remove()
str_replace() # replace one string pattern with another
str_match()
str_pad() # to remove spaces
str_count()
str_detect()
str_dup()
str_extract_all()

____________________________________

End of lab 4 exercise

____________________________________

References

Hallquist, M. N., & Wiley, J. F. (2018). MplusAutomation: An R Package for Facilitating Large-Scale Latent Variable Analyses in Mplus. Structural equation modeling: a multidisciplinary journal, 25(4), 621-638.

Horst, A. (2020). Course & Workshop Materials. GitHub Repositories, https://https://allisonhorst.github.io/

Muthén, L.K. and Muthén, B.O. (1998-2017). Mplus User’s Guide. Eighth Edition. Los Angeles, CA: Muthén & Muthén

R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

Lab 4 - Random splits, Iterators, & Handling large data

Adam Garber

Factor Analysis ED 216B - Instructor: Karen Nylund-Gibson

March 21, 2020

Lab 4 outline

____________________________________

Getting started - following the routine…

R-Project instructions:

____________________________________

IF package rhdf5 does not load then run:

____________________________________

Lab 4 - Begin

____________________________________

____________________________________

loading packages…

read in the raw dataset

create a subset of the dataset called “school_trouble”

make a new codebook from the “school_trouble” subset

write a CSV datafile

read the unlabeled data back into R

check items to see if reverse coding is needed

____________________________________

Randomly split a sample into 2 equal parts

____________________________________

find the size of half of original sample.

The “floor()” function helps with rounding

set the seed to make your partition reproducible

the function “sample()” will pick at random the values of the specified number

create two samples called “calibrate” & “validate”

Let’s run an EFA with the “calibrate” sample

Plot Parallel Analysis & Eigenvalues

read into R an Mplus output file

extract relavent data & prepare dataframe for plot

pivot the dataframe to “long” format

plot using ggplot

save figure to the designated folder

____________________________________

Introduction to MplusAutomation with iterators

____________________________________

Alternate way to run an EFA with the “calibrate” sample

____________________________________

Cleaning & subsetting large datasets

____________________________________

reading SPSS files is a lot slower than reading CSV formatted files

make all column names “lower_snake_case” style

select usig the starts_with() function

select using the end_with() function

select using the end_with() function

combine different select() arguements

remove characters from the variable names that are greater than 8 characters

check if culumn names are unique

locate duplicates (this will find the column of the first duplicate)

Other functions to consider from the “stringr” package (part of tidyverse):

____________________________________

End of lab 4 exercise

____________________________________

References