Latent Variable Modelling in brms
I’ve been absent from this blog over the past few years. It seems I’ve been busy settling myself into my PhD! This is the first blog of 2020, and hopefully the first of many new statistical posts disseminating the various tidbits I’ve been learning during my PhD.
This blog post will cover modelling latent variables in the brms package in R. Latent variables are an important statistical concept in psychology, and I’ve found myself using them a lot in my PhD so far. But our options for statistically modelling latent variables, especially within Bayesian frameworks, are booming. Here’s a quick foray into this area.
What are latent variables?
Often, we are interested in studying variables which we cannot directly measure. For example, we may be interested in extraversion, a facet of personality that describes how outgoing and social someone is. There is no instrument or scientific measure that we can use to directly measure extraversion, at least not in the same way that we can directly measure distance in metres.
To get around this, psychologists often ask people batteries of validated questions, such as “I enjoy parties” and “I am outgoing and sociable”, that indirectly measure extraversion. Together, they assume that these questionnaire items predict a statistically unknown variable called a latent variable. This latent variable can simply be estimated, or can be used to predict other variables.
Using lavaan to measure latent variables
library(lavaan)
library(tidyverse)
The lavaan package in R is often used to measure latent variables. We will also use tidyverse to wrangle our data.
Let’s first create some simulated data to demonstrate this. We’ll create
an observed variable y
that is normally distributed around 0 with a
standard deviation of 2.
y <- rnorm(n = 100, mean = 0, sd = 2)
Now we will create a latent variable x
that is correlated with y
. To make this
more concrete, imagine that x
is extraversion, and y
is something
that is predicted by extraversion, like ability to make people laugh at
parties.
x <- rnorm(n = 100, mean = y, sd = 2)
We simulated latent x
and observed y
to be positively correlated.
cor(x, y) %>% round(2)
## [1] 0.62
Unfortunately, since x
is a latent variable, we can never know this
correlation for certain in the real world. Instead, we need to use
various measures of this latent variable, which we will call z1
, z2
,
and z3
, to infer x
. We add some noise around these manifest variables,
as none of them are perfect measures of x
.
z1 <- rnorm(n = 100, mean = x, sd = 0.5)
z2 <- rnorm(n = 100, mean = x, sd = 0.5)
z3 <- rnorm(n = 100, mean = x, sd = 0.5)
Put these together into one dataframe.
d <- data.frame(y = y, x = x, z1 = z1, z2 = z2, z3 = z3)
rm(x, y, z1, z2, z3)
Now we will use lavaan to measure this latent variable.
fit <- cfa('X =~ z1 + z2 + z3', data = d)
summary(fit)
## lavaan 0.6-5 ended normally after 58 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of free parameters 6
##
## Number of observations 100
##
## Model Test User Model:
##
## Test statistic 0.000
## Degrees of freedom 0
##
## Parameter Estimates:
##
## Information Expected
## Information saturated (h1) model Structured
## Standard errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## X =~
## z1 1.000
## z2 0.987 0.032 31.284 0.000
## z3 1.029 0.031 32.700 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .z1 0.315 0.063 4.972 0.000
## .z2 0.255 0.057 4.493 0.000
## .z3 0.221 0.057 3.873 0.000
## X 5.844 0.871 6.710 0.000
How well have we estimated the latent variable?
d$x.pred <- predict(fit)
cor(d$x, d$x.pred) %>% as.numeric() %>% round(2)
## [1] 0.99
We estimated it pretty well. Can we also retrieve the correlation
between x
and y
?
fit <- cfa('X =~ z1 + z2 + z3; y ~ X', data = d)
standardizedSolution(fit)[4,4] %>% round(2)
## [1] 0.61
This is essentially the same as our correlation above. This is a trivial
example, because we actually know what x
is in our data, but if we did
not, this latent variable approach would be the most valid way forward.
Here, just taking the mean of our continuous manifest variables gets us pretty close to the latent variable.
cor(d$x, apply(d[,c("z1","z2","z3")], 1, mean)) %>% round(2)
## [1] 0.99
But if we were using Likert scales, which we often do in psychology, taking the mean wouldn’t be valid.
Using brms
We can, however, take a Bayesian approach to this problem. The R package brms implements Bayesian regression. This package currently has latent variables in the works. But I thought I’d have a go here anyway.
In confirmatory factor analysis, we set one of the factor loadings to 1
as a constant. We cannot do this yet in brms, but we can set a stupidly
narrow prior such that no amount of data could feasibly shift it. Then
we use the missing data syntax mi()
in brms to estimate the latent variable.
library(brms)
# add empty latent variable to data frame
d$X <- as.numeric(NA)
# mi(x) tells brms that it is missing
bf1 <- bf(z1 ~ 0 + mi(X)) # factor loading 1 (constant)
bf2 <- bf(z2 ~ 0 + mi(X)) # factor loading 2
bf3 <- bf(z3 ~ 0 + mi(X)) # factor loading 3
bf4 <- bf(X | mi() ~ 0)
fit2 <- brm(bf1 + bf2 + bf3 + bf4 + set_rescor(FALSE), data = d,
prior = c(prior(normal(1, 0.000001), coef = miX, resp = z1),
prior(normal(1, 1), coef = miX, resp = z2),
prior(normal(1, 1), coef = miX, resp = z3)),
iter = 4000, warmup = 2000, chains = 2, cores = 2,
control = list(adapt_delta = 0.99, max_treedepth = 15))
save(fit2, file = "fit2.rda")
This takes a while to fit, but does eventually. Do we have similar parameters to those above?
summary(fit2)
## Family: MV(gaussian, gaussian, gaussian, gaussian)
## Links: mu = identity; sigma = identity
## mu = identity; sigma = identity
## mu = identity; sigma = identity
## mu = identity; sigma = identity
## Formula: z1 ~ 0 + mi(X)
## z2 ~ 0 + mi(X)
## z3 ~ 0 + mi(X)
## X | mi() ~ 0
## Data: d (Number of observations: 100)
## Samples: 2 chains, each with iter = 4000; warmup = 2000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## z1_miX 1.00 0.00 1.00 1.00 1.00 3687 2299
## z2_miX 0.99 0.03 0.93 1.06 1.00 1073 1625
## z3_miX 1.03 0.03 0.97 1.10 1.00 1082 1654
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma_z1 0.58 0.06 0.47 0.71 1.00 1530 1973
## sigma_z2 0.51 0.06 0.39 0.63 1.00 1126 1264
## sigma_z3 0.50 0.07 0.37 0.63 1.00 919 1331
## sigma_X 2.44 0.19 2.10 2.85 1.00 3949 2711
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Yes, loadings are the same. brms output gives us standard deviations instead of variances, which we can get by squaring.
brms also gives us posterior distributions for predicted factor scores. How similar are these to the ones lavaan gave us?
In the plot above, the dots are the predicted latent variable values from our lavaan model, and the distributions are the posterior densities for each individual’s Bayesian predicted factor score. They are very similar.
Finally, we can fit a structural equation model to see if this Bayesian
latent variable predicts y
in the same way. We just add another formula
to the brm()
call where y
is predicted by the estimated latent variable
mi(x)
.
# add empty latent variable
d$X <- as.numeric(NA)
bf1 <- bf(z1 ~ 0 + mi(X)) # factor loading 1 (constant)
bf2 <- bf(z2 ~ 0 + mi(X)) # factor loading 2
bf3 <- bf(z3 ~ 0 + mi(X)) # factor loading 3
bf4 <- bf(X | mi() ~ 0) # latent variable
bf5 <- bf(y ~ 0 + mi(X)) # added regression
fit3 <- brm(bf1 + bf2 + bf3 + bf4 + bf5 + set_rescor(FALSE), data = d,
prior = c(prior(normal(1, 0.000001), coef = miX, resp = z1),
prior(normal(1, 1), coef = miX, resp = z2),
prior(normal(1, 1), coef = miX, resp = z3),
prior(normal(0, 1), coef = miX, resp = y)),
iter = 4000, warmup = 2000, chains = 2, cores = 2,
control = list(adapt_delta = 0.99, max_treedepth = 15))
save(fit3, file = "fit3.rda")
The unstandardised parameter for the relationship between latent x
and
observed y
in the lavaan model was:
parameterEstimates(fit)[4,4] %>% round(2)
## [1] 0.47
And in the brms model:
post <- posterior_samples(fit3)
median(post$bsp_y_miX) %>% round(2)
## [1] 0.47
Here’s the posterior distribution for that regression parameter. It’s
reliably above zero, suggesting a positive effect of latent x
on
observed y
.
hist(post$bsp_y_miX)
This shows that brms can handle confirmatory factor analysis and structural equation modelling with full Bayesian inference in Stan. Unfortunately, brms offers no good way of measuring model fit (except for comparing to other models), while lavaan offers a variety of model fit indices. For example:
fitMeasures(fit)['rmsea'] %>% round(2)
## rmsea
## 0.07
fitMeasures(fit)['srmr'] %>% round(2)
## srmr
## 0.01
These fit measures indicate good model fit. brms does offer the
pp_check()
function though, which visualises model fit. For example,
let’s see how well the model predicts our outcome variable y
.
pp_check(fit3, resp = "y", nsamples = 100)
It does a pretty good job, though more data would make this a tighter fit.
The latent variable capabilities of brms are very exciting, and I look forward to their implementation in future versions of the package!
Session Info
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows >= 8 x64 (build 9200)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_New Zealand.1252
## [2] LC_CTYPE=English_New Zealand.1252
## [3] LC_MONETARY=English_New Zealand.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_New Zealand.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggridges_0.5.1 brms_2.10.0 Rcpp_1.0.2 lavaan_0.6-5
## [5] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3 purrr_0.3.2
## [9] readr_1.3.1 tidyr_1.0.0 tibble_2.1.3 ggplot2_3.2.1
## [13] tidyverse_1.2.1 rmarkdown_1.15
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-140 matrixStats_0.55.0 xts_0.11-2
## [4] lubridate_1.7.4 threejs_0.3.1 httr_1.4.1
## [7] rstan_2.19.2 tools_3.6.1 backports_1.1.5
## [10] R6_2.4.0 DT_0.9 lazyeval_0.2.2
## [13] colorspace_1.4-1 withr_2.1.2 prettyunits_1.0.2
## [16] processx_3.4.1 tidyselect_0.2.5 gridExtra_2.3
## [19] mnormt_1.5-5 Brobdingnag_1.2-6 compiler_3.6.1
## [22] cli_1.1.0 rvest_0.3.4 shinyjs_1.0
## [25] xml2_1.2.2 labeling_0.3 colourpicker_1.0
## [28] scales_1.0.0 dygraphs_1.1.1.6 callr_3.3.2
## [31] StanHeaders_2.19.0 digest_0.6.21 pbivnorm_0.6.0
## [34] base64enc_0.1-3 pkgconfig_2.0.3 htmltools_0.3.6
## [37] htmlwidgets_1.3 rlang_0.4.0 readxl_1.3.1
## [40] rstudioapi_0.10 shiny_1.3.2 generics_0.0.2
## [43] zoo_1.8-6 jsonlite_1.6 crosstalk_1.0.0
## [46] gtools_3.8.1 inline_0.3.15 magrittr_1.5
## [49] loo_2.1.0 bayesplot_1.7.0 Matrix_1.2-17
## [52] munsell_0.5.0 abind_1.4-5 lifecycle_0.1.0
## [55] stringi_1.4.3 yaml_2.2.0 MASS_7.3-51.4
## [58] pkgbuild_1.0.5 plyr_1.8.4 grid_3.6.1
## [61] parallel_3.6.1 promises_1.0.1 crayon_1.3.4
## [64] miniUI_0.1.1.1 lattice_0.20-38 haven_2.1.1
## [67] hms_0.5.1 ps_1.3.0 zeallot_0.1.0
## [70] knitr_1.25 pillar_1.4.2 igraph_1.2.4.1
## [73] markdown_1.1 shinystan_2.5.0 codetools_0.2-16
## [76] reshape2_1.4.3 stats4_3.6.1 rstantools_2.0.0
## [79] glue_1.3.1 evaluate_0.14 modelr_0.1.5
## [82] vctrs_0.2.0 httpuv_1.5.2 cellranger_1.1.0
## [85] gtable_0.3.0 assertthat_0.2.1 xfun_0.9
## [88] mime_0.7 xtable_1.8-4 broom_0.5.2
## [91] coda_0.19-3 later_0.8.0 rsconnect_0.8.15
## [94] shinythemes_1.1.2 ellipsis_0.3.0 bridgesampling_0.7-2