Fit Bayesian dynamic coevolutionary model in Stan

Fit Bayesian dynamic coevolutionary models in Stan for full Bayesian inference. The model allows users to assess causal directionality (X -> Y vs. Y -> X) and contingencies (X, then Y) in evolution. Several data types are supported, including binary, ordinal, count, continuous, and positive real variables. The model can additionally account for missing data, repeated observations, and controls for spatial proximity. Model fit can be assessed and compared with posterior predictive checks and cross-validation.

coev_fit(
  data,
  variables,
  id,
  tree,
  effects_mat = NULL,
  complete_cases = FALSE,
  dist_mat = NULL,
  dist_cov = "exp_quad",
  measurement_error = NULL,
  prior = NULL,
  scale = TRUE,
  estimate_correlated_drift = TRUE,
  estimate_residual = TRUE,
  log_lik = FALSE,
  prior_only = FALSE,
  adapt_delta = 0.95,
  ...
)

Arguments

data: An object of class data.frame (or one that can be coerced to that class) containing data of all variables used in the model.
variables: A named list identifying variables that should coevolve in the model and their associated response distributions as character strings (e.g. list(var1 = "bernoulli_logit", var2 = "ordered_logistic")). Must identify at least two variables. Variable names must refer to valid column names in data. Currently, the only supported response distributions are bernoulli_logit, ordered_logistic, poisson_softplus, negative_binomial_softplus, normal, and gamma_log.
id: A character of length one identifying the variable in the data that links rows to tips on the phylogeny. Must refer to a valid column name in the data. The id column must exactly match the tip labels in the phylogeny.
tree: A phylogenetic tree object of class phylo or multiPhylo. The tree(s) must be rooted and must include positive non-zero branch lengths. All trees in multiPhylo objects must have the same number of internal nodes and branches.
effects_mat: (optional) A boolean matrix with row and column names exactly matching the variables declared for the model. If not specified, all cross-lagged effects will be estimated in the model. If specified, the model will only estimate cross-lagged effects where cells in the matrix are TRUE and will ignore cross-lagged effects where cells in the matrix are FALSE. In the matrix, columns represent predictor variables and rows represent outcome variables. All autoregressive effects (e.g., X -> X) must be TRUE in the matrix.
complete_cases: (optional) Logical. If FALSE (default), all missing values are imputed by the model. If TRUE, taxa with missing data are excluded.
dist_mat: (optional) A distance matrix with row and column names exactly matching the tip labels in the phylogeny. If specified, the model will additionally control for spatial location by including a separate Gaussian Process over locations for every coevolving variable in the model.
dist_cov: A string specifying the covariance kernel used for Gaussian Processes over locations. Currently supported are "exp_quad" (exponentiated-quadratic kernel; default), "exponential" (exponential kernel), and "matern32" (Matern 3/2 kernel).
measurement_error: (optional) A named list of coevolving variables and their associated columns in the dataset containing standard errors. Only valid for normally-distributed variables. For example, if we declare variables = list(x = "normal", y = "normal"), then we could set measurement_error = list(x = "x_std_err") to tell the function to include measurement error on x using standard errors from the x_std_err column of the dataset.
prior: (optional) A named list of priors for the model. If not specified, the model uses default priors (see help(coev_fit)). Alternatively, the user can specify a named list of priors. The list must contain non-duplicated entries for any of the following parameters: the autoregressive effects (A_diag), the cross effects (A_offdiag), the Cholesky factor for the drift matrix (L_R), the drift std. dev. parameters (Q_sigma), the continuous time intercepts (b), the ancestral states for the traits (eta_anc), the cutpoints for ordinal variables (c), the overdispersion parameters for negative binomial variables (phi), the shape parameters for gamma variables (shape), the sigma parameters for Gaussian Processes over locations (sigma_dist), the rho parameters for Gaussian Processes over locations (rho_dist), the residual standard deviations when there are repeated observations (sigma_residual), and the Cholesky factor for the residual correlations when there are repeated observations (L_residual). These must be entered with valid prior strings, e.g. list(A_offdiag = "normal(0, 2)"). Invalid prior strings will throw an error when the function internally checks the syntax of resulting Stan code.
scale: Logical. If TRUE (default), variables following the normal and gamma_log response distributions are scaled before fitting the model. Continuous variables following the normal distribution are standardised (e.g., mean centered and divided by their standard deviation) and positive real variables following the gamma_log distribution are divided by the mean value without centering. This approach is recommended when using default priors to improve efficiency and ensure accurate inferences. If FALSE, variables are left unscaled for model fitting. In this case, users should take care to set sensible priors on variables.
estimate_correlated_drift: Logical. If TRUE (default), the model estimates the off-diagonals for the \(Q\) drift matrix (i.e., correlated drift). If FALSE, the off-diagonals for the \(Q\) drift matrix are set to zero.
estimate_residual: Logical. If TRUE (default), the model estimates residual standard deviations and residual correlations when there are repeated observations for taxa. If FALSE, residual standard deviations and residual correlations are not estimated. The latter may be preferable in cases where repeated observations are sparse (e.g., only some taxa have only few repeated observations). This argument only applies when repeated observations are present in the data.
log_lik: Logical. Set to FALSE by default. If TRUE, the model returns the pointwise log likelihood, which can be used to calculate WAIC and LOO.
prior_only: Logical. If FALSE (default), the model is fitted to the data and returns a posterior distribution. If TRUE, the model samples from the prior only, ignoring the likelihood.
adapt_delta: Argument for cmdstanr::sample(). Default is 0.95.
...: Additional arguments for cmdstanr::sample().

Value

Fitted model of class coevfit

Details

The coev_fit function generates the Stan code for the model, generates the data list, and then compiles and fits the model using the cmdstanr package.

Fit a Bayesian dynamic coevolutionary model in Stan. A general overview is provided in the vignette vignette("coevolve")

Available response distributions

Variables that should coevolve in the model are declared in the variables argument, using a named list with associated response distributions. For example: list(x = "bernoulli_logit", y = "ordered_logistic"). Currently, the only supported response distributions are bernoulli_logit, ordered_logistic, poisson_softplus, negative_binomial_softplus, normal, and gamma_log.

Default prior distributions

If priors are not explicitly declared by the user, default priors are used. Default priors were chosen to be weakly regularising, which improves model fitting and conservatism in parameter estimates. Default priors for Stan parameters are as follows:

A_diag (autoregressive effects) = std_normal()
A_offdiag (cross effects) = std_normal()
L_R (Cholesky factor for drift matrix) = lkj_corr_cholesky(4)
Q_sigma (drift std. dev. parameters) = std_normal()
b (continuous time intercepts) = std_normal()
eta_anc (trait ancestral states) = std_normal()
c (ordinal cutpoints) = normal(0, 2)
shape (shape parameters for gamma distributions) = gamma(0.01, 0.01)
sigma_dist (sigma for Gaussian process over locations) = exponential(1)
rho_dist (rho for Gaussian process over locations) = exponential(5)
sigma_residual (residual standard deviations) = exponential(1)
L_residual (Cholesky factor for residual correlations) = lkj_corr_cholesky(4)

The default prior for phi (the overdispersion parameter for the negative-binomial distribution) is scaled automatically based on the variance of the data.

We recommend that users assess the suitability of these default priors by fitting the model with prior_only = TRUE and then plotting prior predictive checks for all variables using the coev_plot_predictive_check() function.

Handling missing data

In order to retain the most information, the coev_fit() function automatically imputes all missing values. To turn off this behaviour and exclude taxa with missing data, set complete_cases = FALSE.

Dealing with repeated observations

If taxa appear in the dataset multiple times (i.e., there are repeated observations), the model will automatically estimate residual standard deviations and residual correlations that capture the within-taxa variation that is not due to the coevolutionary process. To turn off this behaviour, set estimate_residual = FALSE.

Incorporating measurement error

If any normally-distributed coevolving variables are measured with error, the user can pass standard errors for one or more of these variables into the model with the measurement_error argument.

Controlling for spatial location

If users declare a distance matrix with the dist_mat argument, the model will include several Gaussian processes (one per coevolving variable) over spatial locations.

References

Ringen, E., Martin, J. S., & Jaeggi, A. (2021). Novel phylogenetic methods reveal that resource-use intensification drives the evolution of "complex" societies. EcoEvoRXiv. doi:10.32942/osf.io/wfp95

Sheehan, O., Watts, J., Gray, R. D., Bulbulia, J., Claessens, S., Ringen, E. J., & Atkinson, Q. D. (2023). Coevolution of religious and political authority in Austronesian societies. Nature Human Behaviour, 7(1), 38-45. 10.1038/s41562-022-01471-y

Author

Scott Claessens scott.claessens@gmail.com, Erik Ringen erikjacob.ringen@uzh.ch

Examples

if (FALSE) { # \dontrun{
# fit dynamic coevolutionary model
fit <- coev_fit(
  data = authority$data,
  variables = list(
    political_authority = "ordered_logistic",
    religious_authority = "ordered_logistic"
  ),
  id = "language",
  tree = authority$phylogeny,
  # additional arguments for cmdstanr::sample()
  chains = 4,
  parallel_chains = 4,
  seed = 1
  )

# print model summary
summary(fit)
} # }