University of California, Santa Barbara
Lab preparation
Creating a version-controlled R-Project with Github
Download repository here: https://github.com/garberadamc/SEM-Lab10
On the Github repository webpage:
fork
your ownbranch
of the lab repository- copy the repository web URL address from the
clone or download
menu
Within R-Studio:
- click “NEW PROJECT”
- choose option
Version Control
- choose option
Git
- paste the repository web URL path copied from the
clone or download
menu on Github page - choose location of the R-Project
Data source: Longitudinal Study of American Youth, Science Attitudes
\(\color{blue}{\text{See documentation about the LSAY here.}}\)
Load packages
library(tidyverse)
library(glue)
library(MplusAutomation)
library(rhdf5)
library(here)
library(janitor)
library(gt)
library(DT)
library(plotly)
library(gg3D)
library(gganimate)
library(viridis)
library(hrbrthemes)
Exploring observed response patterns
Load data
lsay_data <- read_csv(here("data", "lca_lsay_sci.csv"), #
na = c("9999", "9999.00")) %>% #
clean_names() %>% #
dplyr::select(1:5, Enjoy = ab39m, Useful = ab39t, #
Logical = ab39u, Job = ab39w, Adult = ab39x) #
Use {DT::datatable()
} to take a look at the data
Figure. Path diagram of science attitude indicators.
Save response frequencies for the 4 class model with response is _____.dat
.
patterns <- mplusObject(
TITLE = "Step1 - 3step LSAY - Lab9",
VARIABLE =
"categorical = Enjoy-Adult;
usevar = Enjoy-Adult;
classes = c(4);",
ANALYSIS =
"estimator = mlr;
type = mixture;
starts = 500 100;",
SAVEDATA =
"File=3step_savedata.dat;
Save=cprob;
Missflag= 999;
!!!!!!!! Code to save response frequency data !!!!!!!!
response is resp_patterns.dat;
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!",
OUTPUT = "sampstat residual patterns tech10 tech11 tech14",
PLOT =
"type = plot3;
series = Enjoy-Adult(*);",
usevariables = colnames(lsay_data),
rdata = lsay_data)
patterns_fit <- mplusModeler(patterns,
dataout=here("resp_pattrn", "exp_pattern_LSAY.dat"),
modelout=here("resp_pattrn", "exp_pattern_LSAY.inp") ,
check=TRUE, run = TRUE, hashfilename = FALSE)
Read in observed response pattern data
patterns <- read_table2(here("resp_pattrn", "resp_patterns.dat"), #
col_names=FALSE, na = "*") #
colnames(patterns) <- c("Frequency", "ENJOY", "USEFUL", "LOGICAL", "JOB", "ADULT", #
"CPROB1", "CPROB2", "CPROB3", "CPROB4", "C_MODAL") #
Order responses by highest frequency
loop_cond <- lapply(1:4, function(k) {
order_cond <- patterns %>%
filter(C_MODAL == k) %>%
arrange(desc(Frequency)) %>%
head(5)
})
table_data1 <- bind_rows(loop_cond) %>%
as.data.frame()
table_data2 <- rbind(order_highest[1:5,], table_data1)
Use {gt}
to make a nicely formatted table
table_data2 %>%
gt() %>%
tab_header(
title = md("**Observed response patterns, estimated frequencies, estimated posterior
class probabilities, and modal class assignment.**")) %>%
tab_source_note(
source_note = md("Data Source: **Longitudinal Study of American Youth.**")) %>%
cols_label(
ENJOY = "Enjoy",
USEFUL = "Useful",
LOGICAL = "Logical",
JOB = "Job",
ADULT = "Adult",
CPROB1 = html("P<sub>k=1"),
CPROB2 = html("P<sub>k=2"),
CPROB3 = html("P<sub>k=3"),
CPROB4 = html("P<sub>k=4"),
C_MODAL = md("*k*")) %>%
tab_row_group(
group = "Unconditional response patterns ordered by highest frequency",
rows = 1:5) %>%
tab_row_group(
group = "k=1 conditional response pattern ordered by highest frequency",
rows = 6:10) %>%
tab_row_group(
group = "k=2 conditional response pattern ordered by highest frequency",
rows = 11:15) %>%
tab_row_group(
group = "k=3 conditional response pattern ordered by highest frequency",
rows = 16:20) %>%
tab_row_group(
group = "k=4 conditional response pattern ordered by highest frequency",
rows = 21:25) %>%
row_group_order(
groups = c("Unconditional response patterns ordered by highest frequency",
"k=1 conditional response pattern ordered by highest frequency",
"k=2 conditional response pattern ordered by highest frequency",
"k=3 conditional response pattern ordered by highest frequency",
"k=4 conditional response pattern ordered by highest frequency")) %>%
tab_options(column_labels.font.weight = "bold")
Observed response patterns, estimated frequencies, estimated posterior class probabilities, and modal class assignment. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Frequency | Enjoy | Useful | Logical | Job | Adult | Pk=1 | Pk=2 | Pk=3 | Pk=4 | k |
Unconditional response patterns ordered by highest frequency | ||||||||||
558 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.117 | 0.000 | 0.883 | 4 |
529 | 1 | 1 | 1 | 1 | 1 | 0.957 | 0.000 | 0.043 | 0.000 | 1 |
313 | 1 | 0 | 0 | 0 | 0 | 0.000 | 0.307 | 0.000 | 0.693 | 4 |
135 | 1 | 0 | 1 | 0 | 0 | 0.002 | 0.977 | 0.000 | 0.021 | 2 |
94 | 1 | 1 | 1 | 0 | 1 | 0.687 | 0.000 | 0.313 | 0.000 | 1 |
k=1 conditional response pattern ordered by highest frequency | ||||||||||
529 | 1 | 1 | 1 | 1 | 1 | 0.957 | 0.000 | 0.043 | 0.000 | 1 |
94 | 1 | 1 | 1 | 0 | 1 | 0.687 | 0.000 | 0.313 | 0.000 | 1 |
78 | 0 | 1 | 1 | 1 | 1 | 0.859 | 0.000 | 0.141 | 0.000 | 1 |
62 | 1 | 1 | 0 | 1 | 1 | 0.580 | 0.000 | 0.420 | 0.000 | 1 |
55 | 1 | 1 | 1 | 1 | 0 | 0.650 | 0.350 | 0.000 | 0.000 | 1 |
k=2 conditional response pattern ordered by highest frequency | ||||||||||
135 | 1 | 0 | 1 | 0 | 0 | 0.002 | 0.977 | 0.000 | 0.021 | 2 |
88 | 0 | 0 | 1 | 0 | 0 | 0.000 | 0.934 | 0.000 | 0.066 | 2 |
74 | 1 | 1 | 1 | 0 | 0 | 0.063 | 0.937 | 0.000 | 0.000 | 2 |
47 | 1 | 1 | 0 | 0 | 0 | 0.006 | 0.994 | 0.000 | 0.000 | 2 |
44 | 1 | 0 | 0 | 1 | 0 | 0.004 | 0.643 | 0.000 | 0.353 | 2 |
k=3 conditional response pattern ordered by highest frequency | ||||||||||
91 | 1 | 0 | 0 | 0 | 1 | 0.003 | 0.000 | 0.937 | 0.060 | 3 |
88 | 1 | 0 | 1 | 1 | 1 | 0.337 | 0.000 | 0.663 | 0.000 | 3 |
76 | 1 | 0 | 1 | 0 | 1 | 0.048 | 0.000 | 0.951 | 0.001 | 3 |
70 | 1 | 0 | 0 | 1 | 1 | 0.031 | 0.000 | 0.964 | 0.006 | 3 |
53 | 0 | 0 | 0 | 0 | 1 | 0.001 | 0.000 | 0.763 | 0.236 | 3 |
k=4 conditional response pattern ordered by highest frequency | ||||||||||
558 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.117 | 0.000 | 0.883 | 4 |
313 | 1 | 0 | 0 | 0 | 0 | 0.000 | 0.307 | 0.000 | 0.693 | 4 |
53 | 0 | 0 | 0 | 1 | 0 | 0.000 | 0.353 | 0.000 | 0.647 | 4 |
11 | 0 | 0 | NA | 0 | 0 | 0.000 | 0.231 | 0.000 | 0.769 | 4 |
9 | 0 | NA | 0 | 0 | 0 | 0.000 | 0.170 | 0.000 | 0.829 | 4 |
Data Source: Longitudinal Study of American Youth. |
Visualizing observed response patterns
Order rows by modal assignment (K)
order_modal <- patterns %>%
arrange(desc(C_MODAL)) %>%
rownames_to_column() %>%
rename('pat_num' = "rowname") %>%
drop_na(ENJOY:ADULT)
Prepare plot data
p1_long <- order_modal %>%
dplyr::select(pat_num:ADULT, C_MODAL) %>%
pivot_longer(`ENJOY`:`ADULT`, # The columns I'm gathering together
names_to = "var", # new column name for existing names
values_to = "value") %>% # new column name to store values
mutate(obs = rep(1:32, each =5)) %>%
mutate(Class = factor(C_MODAL)) %>%
mutate(var = ordered(var,
levels = c("ENJOY","USEFUL","LOGICAL","JOB","ADULT"))) %>%
select(-pat_num, -C_MODAL)
# must first run LCA enumeration (code is out of sequential order)
out_c4 <- readModels(here("enum_mplus"), filefilter = "c4", quiet = TRUE)
# extract posterior probabilities
probs_c4 <- as.data.frame(
out_c4[["gh5"]][["means_and_variances_data"]]
[["estimated_probs"]][["values"]]
[seq(2, 10, 2),])
rownames(probs_c4) <- c("ENJOY","USEFUL","LOGICAL","JOB","ADULT")
long_c4 <- probs_c4 %>% rownames_to_column() %>%
rename('var' = "rowname") %>%
pivot_longer(`V1`:`V4`, # The columns I'm gathering together
names_to = "c", # new column name for existing names
values_to = "value") %>% # new column name to store values
mutate(Class = rep(1:4,5)) %>%
arrange(Class) %>%
mutate(obs = rep(33:36,each=5)) %>%
mutate(Frequency = rep(c(829,782,619,833),each=5)) %>%
mutate(var = ordered(var,
levels = c("ENJOY","USEFUL","LOGICAL","JOB","ADULT"))) %>%
select(6,1,3,5,4)
p2_long <- rbind(p1_long, long_c4) %>%
mutate(Class = as.numeric(Class))
Visualize observed response patterns with {plotly
}
gg <- ggplot(p2_long, aes(x=var, y=value, color = Class, size=Frequency)) +
geom_line(aes(as.numeric(var), frame = obs)) +
scale_color_viridis() + labs(x="Indicator", y= "Probability")
ggplotly(gg) %>% animation_opts(frame = 1000, transition = 0) %>%
animation_slider(currentvalue =
list(prefix = "Pattern ", font = list(color="red")))
Make a 3D plot with packages {ggplot2
}, {gg3D
}, and {gganimate
}.
theta= 170 # change perspective (tilt)
phi=40 # change perspective (rotation)
resp3d <- ggplot(p1_long, aes(x=as.numeric(var),
y=as.numeric(value),
z = as.numeric(obs)),
alpha = .8) +
axes_3D(theta=theta, phi=phi) +
stat_3D(theta=theta, phi=phi, geom="path",
aes(colour = Class, size = Frequency), alpha = .8) +
scale_color_manual(values=c("#FDE725FF", "#DE7065FF", "#238A8DFF", "#482677FF")) +
theme_void() +
annotate("text", x = -.3, y = 0.05, label = "Indicators ") +
annotate("text", x = .35, y = -.4, label = "Probability") +
annotate("text", x = .25, y = .42, label = "Pattern") +
annotate("text", x = .2, y = 0, label = "0.0") +
annotate("text", x = .34, y = -.33, label = "1.0") +
annotate("text", x = -.05, y = 0, angle = 6,
label = "Enjoy - Useful - Logical - Job - Adult") +
transition_states(obs, transition_length=1, state_length=5) +
shadow_mark(alpha = .1,) +
labs(title = "Observed response pattern = {closest_state}")
animate(resp3d, fps = 2)
Comparing model fit
Learning objective: Generate a comprehensive model fit summary table.
Information criteria: model is endorsed by lowest value:
BIC
: \[ =-2*LL+Npar*LN(N) \]aBIC
: \[-2*LL+Npar*LN((N+2)/24)\]CIAC
: \[-2*LL+Npar*(LN(N)+1))\]AWE
: \[ -2*LL+2*Npar*(LN(N)+1.5) \]
Comparing models:
VLMR
: Vuong-Lo-Mendell-Rubin LRT (TECH11)BLRT
: Bootstrap LRT (TECH14)BF
: Bayes FactorcmP(k)
: Correct Model Probability
Run a quick enumeration
lca_k1_6 <- lapply(1:6, function(k) {
lca_enum <- mplusObject(
TITLE = glue("Class {k}"),
VARIABLE = glue(
"categorical = Enjoy-Adult;
usevar = Enjoy-Adult;
classes = c({k}); "),
ANALYSIS =
"estimator = mlr;
type = mixture;
starts = 200 50;
processors = 10;",
OUTPUT = "sampstat residual tech11 tech14;",
PLOT =
"type = plot3;
series = Enjoy-Adult(*);",
usevariables = colnames(lsay_data),
rdata = lsay_data)
lca_enum_fit <- mplusModeler(lca_enum,
dataout=glue(here("enum_mplus", "c_lca_lsay_Lab10.dat")),
modelout=glue(here("enum_mplus", "c{k}_lca_lsay_Lab10.inp")) ,
check=TRUE, run = TRUE, hashfilename = FALSE)
})
Create model fit summary table
Extract data and calculate indices derived from the Log Likelihood
## <simpleError in file(file, "rt"): cannot open the connection>
n_size <- all_output[["c1_lca_lsay_Lab10.out"]][["summaries"]][["Observations"]]
enum_extract <- LatexSummaryTable(all_output, #
keepCols=c("Title","Parameters", "LL", "BIC", #
"aBIC", "BLRT_PValue", "T11_VLMR_PValue"), #
sortBy = "Title") #
allFit <- enum_extract %>%
mutate(aBIC = -2*LL+Parameters*log((n_size+2)/24)) %>%
mutate(CIAC = -2*LL+Parameters*(log(n_size)+1)) %>%
mutate(AWE = -2*LL+2*Parameters*(log(n_size)+1.5)) %>%
mutate(SIC = -.5*BIC) %>%
mutate(expSIC = exp(SIC - max(SIC))) %>%
mutate(expSUM = sum(expSIC)) %>%
mutate(BF = exp(SIC-lead(SIC))) %>%
mutate(cmPk = expSIC/expSUM) %>%
select(1:5,8:9,7,6,13,14)
Format table with package {gt
}
allFit %>%
gt() %>%
tab_header(
title = md("**Model Fit Summary Table**")) %>%
tab_source_note(
source_note = md("Data Source: **Longitudinal Study of American Youth.**")) %>%
cols_label(
Title = "Classes",
Parameters = md("*NPar*"),
LL = md("*LL*"),
T11_VLMR_PValue = html("VLMR"),
BLRT_PValue = html("BLRT"),
BF = html("Bayes<br>Factor"),
cmPk = html("cmP<sub>k")) %>%
tab_options(column_labels.font.weight = "bold") %>%
fmt_number(10:11,decimals = 2,
drop_trailing_zeros=TRUE,
suffixing = TRUE) %>%
fmt_number(2:9,decimals = 2)
Model Fit Summary Table | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Classes | NPar | LL | BIC | aBIC | CIAC | AWE | VLMR | BLRT | Bayes Factor |
cmPk |
Class 1 | 5.00 | −10,250.60 | 20,541.34 | 20,525.45 | 20,546.34 | 20,596.47 | NA | NA | 0 | 0 |
Class 2 | 11.00 | −8,785.32 | 17,658.92 | 17,623.97 | 17,669.93 | 17,780.22 | 0.00 | 0.00 | 0 | 0 |
Class 3 | 17.00 | −8,693.57 | 17,523.59 | 17,469.57 | 17,540.59 | 17,711.04 | 0.00 | 0.00 | 0 | 0 |
Class 4 | 23.00 | −8,664.09 | 17,512.79 | 17,439.71 | 17,535.79 | 17,766.40 | 0.00 | 0.00 | 5.22B | 0.5 |
Class 5 | 29.00 | −8,662.39 | 17,557.54 | 17,465.40 | 17,586.54 | 17,877.31 | 0.67 | 1.00 | 12.32B | 0 |
Class 6 | 35.00 | −8,661.54 | 17,604.01 | 17,492.80 | 17,639.01 | 17,989.94 | 0.75 | 1.00 | 0 | 0 |
Step1 - 3step LSAY - Lab9 | 23.00 | −8,664.09 | 17,512.79 | 17,439.71 | 17,535.79 | 17,766.40 | 0.00 | 0.00 | NA | 0.5 |
Data Source: Longitudinal Study of American Youth. |
References
Drew A. Linzer, Jeffrey B. Lewis (2011). poLCA: An R Package for Polytomous Variable Latent Class Analysis. Journal of Statistical Software, 42(10), 1-29. URL http://www.jstatsoft.org/v42/i10/.
Hallquist, M. N., & Wiley, J. F. (2018). MplusAutomation: An R Package for Facilitating Large-Scale Latent Variable Analyses in Mplus. Structural equation modeling: a multidisciplinary journal, 25(4), 621-638.
Miller, J. D., Hoffer, T., Suchner, R., Brown, K., & Nelson, C. (1992). LSAY codebook. Northern Illinois University.
Muthén, B. O., Muthén, L. K., & Asparouhov, T. (2017). Regression and mediation analysis using Mplus. Los Angeles, CA: Muthén & Muthén.
Muthén, L.K. and Muthén, B.O. (1998-2017). Mplus User’s Guide. Eighth Edition. Los Angeles, CA: Muthén & Muthén
R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686