Make the Stan code for the Bayesian dynamic coevolutionary model. Stan code is generated, checked for syntactical errors, and then returned as a character string.

coev_make_stancode(
  data,
  variables,
  id,
  tree,
  effects_mat = NULL,
  complete_cases = FALSE,
  dist_mat = NULL,
  dist_cov = "exp_quad",
  measurement_error = NULL,
  prior = NULL,
  scale = TRUE,
  estimate_correlated_drift = TRUE,
  estimate_residual = TRUE,
  log_lik = FALSE,
  prior_only = FALSE
)

Arguments

data

An object of class data.frame (or one that can be coerced to that class) containing data of all variables used in the model.

variables

A named list identifying variables that should coevolve in the model and their associated response distributions as character strings (e.g. list(var1 = "bernoulli_logit", var2 = "ordered_logistic")). Must identify at least two variables. Variable names must refer to valid column names in data. Currently, the only supported response distributions are bernoulli_logit, ordered_logistic, poisson_softplus, negative_binomial_softplus, normal, and gamma_log.

id

A character of length one identifying the variable in the data that links rows to tips on the phylogeny. Must refer to a valid column name in the data. The id column must exactly match the tip labels in the phylogeny.

tree

A phylogenetic tree object of class phylo or multiPhylo. The tree(s) must be rooted and must include positive non-zero branch lengths. All trees in multiPhylo objects must have the same number of internal nodes and branches.

effects_mat

(optional) A boolean matrix with row and column names exactly matching the variables declared for the model. If not specified, all cross-lagged effects will be estimated in the model. If specified, the model will only estimate cross-lagged effects where cells in the matrix are TRUE and will ignore cross-lagged effects where cells in the matrix are FALSE. In the matrix, columns represent predictor variables and rows represent outcome variables. All autoregressive effects (e.g., X -> X) must be TRUE in the matrix.

complete_cases

(optional) Logical. If FALSE (default), all missing values are imputed by the model. If TRUE, taxa with missing data are excluded.

dist_mat

(optional) A distance matrix with row and column names exactly matching the tip labels in the phylogeny. If specified, the model will additionally control for spatial location by including a separate Gaussian Process over locations for every coevolving variable in the model.

dist_cov

A string specifying the covariance kernel used for Gaussian Processes over locations. Currently supported are "exp_quad" (exponentiated-quadratic kernel; default), "exponential" (exponential kernel), and "matern32" (Matern 3/2 kernel).

measurement_error

(optional) A named list of coevolving variables and their associated columns in the dataset containing standard errors. Only valid for normally-distributed variables. For example, if we declare variables = list(x = "normal", y = "normal"), then we could set measurement_error = list(x = "x_std_err") to tell the function to include measurement error on x using standard errors from the x_std_err column of the dataset.

prior

(optional) A named list of priors for the model. If not specified, the model uses default priors (see help(coev_fit)). Alternatively, the user can specify a named list of priors. The list must contain non-duplicated entries for any of the following parameters: the autoregressive effects (A_diag), the cross effects (A_offdiag), the Cholesky factor for the drift matrix (L_R), the drift std. dev. parameters (Q_sigma), the continuous time intercepts (b), the ancestral states for the traits (eta_anc), the cutpoints for ordinal variables (c), the overdispersion parameters for negative binomial variables (phi), the shape parameters for gamma variables (shape), the sigma parameters for Gaussian Processes over locations (sigma_dist), the rho parameters for Gaussian Processes over locations (rho_dist), the residual standard deviations when there are repeated observations (sigma_residual), and the Cholesky factor for the residual correlations when there are repeated observations (L_residual). These must be entered with valid prior strings, e.g. list(A_offdiag = "normal(0, 2)"). Invalid prior strings will throw an error when the function internally checks the syntax of resulting Stan code.

scale

Logical. If TRUE (default), variables following the normal and gamma_log response distributions are scaled before fitting the model. Continuous variables following the normal distribution are standardised (e.g., mean centered and divided by their standard deviation) and positive real variables following the gamma_log distribution are divided by the mean value without centering. This approach is recommended when using default priors to improve efficiency and ensure accurate inferences. If FALSE, variables are left unscaled for model fitting. In this case, users should take care to set sensible priors on variables.

estimate_correlated_drift

Logical. If TRUE (default), the model estimates the off-diagonals for the $$Q$$ drift matrix (i.e., correlated drift). If FALSE, the off-diagonals for the $$Q$$ drift matrix are set to zero.

estimate_residual

Logical. If TRUE (default), the model estimates residual standard deviations and residual correlations when there are repeated observations for taxa. If FALSE, residual standard deviations and residual correlations are not estimated. The latter may be preferable in cases where repeated observations are sparse (e.g., only some taxa have only few repeated observations). This argument only applies when repeated observations are present in the data.

log_lik

Logical. Set to FALSE by default. If TRUE, the model returns the pointwise log likelihood, which can be used to calculate WAIC and LOO.

prior_only

Logical. If FALSE (default), the model is fitted to the data and returns a posterior distribution. If TRUE, the model samples from the prior only, ignoring the likelihood.

Value

A character string containing the Stan code to fit the dynamic coevolutionary model

Details

For further details, see help(coev_fit)

References

Ringen, E., Martin, J. S., & Jaeggi, A. (2021). Novel phylogenetic methods reveal that resource-use intensification drives the evolution of "complex" societies. EcoEvoRXiv. doi:10.32942/osf.io/wfp95

Sheehan, O., Watts, J., Gray, R. D., Bulbulia, J., Claessens, S., Ringen, E. J., & Atkinson, Q. D. (2023). Coevolution of religious and political authority in Austronesian societies. Nature Human Behaviour, 7(1), 38-45. 10.1038/s41562-022-01471-y

Author

Scott Claessens scott.claessens@gmail.com, Erik Ringen erikjacob.ringen@uzh.ch

Examples

# make stan code
stan_code <- coev_make_stancode(
  data = authority$data,
  variables = list(
    political_authority = "ordered_logistic",
    religious_authority = "ordered_logistic"
  ),
  id = "language",
  tree = authority$phylogeny
  )

# print Stan code
cat(stan_code)
#> // Generated with coevolve 0.1.0.9002
#> functions {
#>   // Charles Driver's solver for the asymptotic Q matrix
#>   matrix ksolve (matrix A, matrix Q) {
#>     int d = rows(A);
#>     int d2 = (d * d - d) %/% 2;
#>     matrix [d + d2, d + d2] O;
#>     vector [d + d2] triQ;
#>     matrix[d,d] AQ;
#>     int z = 0;         // z is row of output
#>     for (j in 1:d) {   // for column reference of solution vector
#>       for (i in 1:j) { // and row reference...
#>         if (j >= i) {  // if i and j denote a covariance parameter
#>           int y = 0;   // start new output row
#>           z += 1;      // shift current output row down
#>           for (ci in 1:d) {   // for columns and
#>             for (ri in 1:d) { // rows of solution
#>               if (ci >= ri) { // when in upper tri (inc diag)
#>                 y += 1;       // move to next column of output
#>                 if (i == j) { // if output row is a diag element
#>                   if (ri == i) O[z, y] = 2 * A[ri, ci];
#>                   if (ci == i) O[z, y] = 2 * A[ci, ri];
#>                 }
#>                 if (i != j) { // if output row is not a diag element
#>                   //if column matches row, sum both A diags
#>                   if (y == z) O[z, y] = A[ri, ri] + A[ci, ci];
#>                   if (y != z) { // otherwise...
#>                     // if solution element is related to output row...
#>                     if (ci == ri) { // if solution element is variance
#>                       // if variance of solution corresponds to row
#>                       if (ci == i) O[z, y] = A[j, ci];
#>                       // if variance of solution corresponds to col
#>                       if (ci == j) O[z, y] = A[i, ci];
#>                     }
#>                     //if solution element is a related covariance
#>                     if (ci != ri && (ri == i || ri == j || ci == i || ci == j )) {
#>                       // for row 1,2 / 2,1 of output,
#>                       // if solution row ri 1 (match)
#>                       // and column ci 3, we need A[2,3]
#>                       if (ri == i) O[z, y] = A[j, ci];
#>                       if (ri == j) O[z, y] = A[i, ci];
#>                       if (ci == i) O[z, y] = A[j, ri];
#>                       if (ci == j) O[z, y] = A[i, ri];
#>                     }
#>                   }
#>                 }
#>                 if (is_nan(O[z, y])) O[z, y] = 0;
#>               }
#>             }
#>           }
#>         }
#>       }
#>     }
#>     z = 0; // get upper tri of Q
#>     for (j in 1:d) {
#>       for (i in 1:j) {
#>         z += 1;
#>         triQ[z] = Q[i, j];
#>       }
#>     }
#>     triQ = -O \ triQ; // get upper tri of asymQ
#>     z = 0; // put upper tri of asymQ into matrix
#>     for (j in 1:d) {
#>       for (i in 1:j) {
#>         z += 1;
#>         AQ[i, j] = triQ[z];
#>         if (i != j) AQ[j, i] = triQ[z];
#>       }
#>     }
#>     return AQ;
#>   }
#>   
#>   // return number of matches of y in vector x
#>   int num_matches(vector x, real y) {
#>     int n = 0;
#>     for (i in 1:rows(x))
#>       if (x[i] == y)
#>         n += 1;
#>     return n;
#>   }
#>   
#>   // return indices in vector x where x == y
#>   array[] int which_equal(vector x, real y) {
#>     array [num_matches(x, y)] int match_positions;
#>     int pos = 1;
#>     for (i in 1:rows(x)) {
#>       if (x[i] == y) {
#>         match_positions[pos] = i;
#>         pos += 1;
#>       }
#>     }
#>     return match_positions;
#>   }
#> }
#> data{
#>   int<lower=1> N_tips; // number of tips
#>   int<lower=1> N_tree; // number of trees
#>   int<lower=1> N_obs; // number of observations
#>   int<lower=2> J; // number of response traits
#>   int<lower=1> N_seg; // total number of segments in the trees
#>   array[N_tree, N_seg] int<lower=1> node_seq; // index of tree nodes
#>   array[N_tree, N_seg] int<lower=0> parent; // index of parent nodes
#>   array[N_tree, N_seg] real ts; // time since parent
#>   array[N_tree, N_seg] int<lower=0,upper=1> tip; // segment ends in tip
#>   array[J,J] int<lower=0,upper=1> effects_mat; // effects matrix
#>   int<lower=2> num_effects; // number of effects being estimated
#>   matrix[N_obs,J] y; // observed data
#>   matrix[N_obs,J] miss; // are data points missing?
#>   array[N_obs] int<lower=1> tip_id; // group index between 1 and N_tips
#>   int<lower=0,upper=1> prior_only; // should likelihood be ignored?
#> }
#> transformed data{
#>   vector[to_int(N_obs - sum(col(miss, 1)))] obs1; // observed data for variable 1
#>   vector[to_int(N_obs - sum(col(miss, 2)))] obs2; // observed data for variable 2
#>   obs1 = col(y, 1)[which_equal(col(miss, 1), 0)];
#>   obs2 = col(y, 2)[which_equal(col(miss, 2), 0)];
#> }
#> parameters{
#>   vector<upper=0>[J] A_diag; // autoregressive terms of A
#>   vector[num_effects - J] A_offdiag; // cross-lagged terms of A
#>   cholesky_factor_corr[J] L_R; // lower-tri choleksy decomp corr mat
#>   vector<lower=0>[J] Q_sigma; // std deviation parameters of the Q mat
#>   vector[J] b; // SDE intercepts
#>   array[N_tree] vector[J] eta_anc; // ancestral states
#>   array[N_tree, N_seg - 1] vector[J] z_drift; // stochastic drift
#>   array[N_tree] matrix[N_tips, J] terminal_drift; // drift for the tips
#>   ordered[3] c1; // cut points for variable 1
#>   ordered[3] c2; // cut points for variable 2
#> }
#> transformed parameters{
#>   array[N_tree, N_seg] vector[J] eta;
#>   matrix[J,J] A = diag_matrix(A_diag); // selection matrix
#>   matrix[J,J] Q = diag_matrix(Q_sigma) * (L_R * L_R') * diag_matrix(Q_sigma); // drift matrix
#>   matrix[J,J] Q_inf; // asymptotic covariance matrix
#>   array[N_tree, N_seg] matrix[J,J] VCV_tips; // vcov matrix for drift
#>   array[N_tree,N_tips] vector[J] tdrift; // terminal drift
#>   // fill off diagonal of A matrix
#>   {
#>     int ticker = 1;
#>     for (i in 1:J) {
#>       for (j in 1:J) {
#>         if (i != j) {
#>           if (effects_mat[i,j] == 1) {
#>             A[i,j] = A_offdiag[ticker];
#>             ticker += 1;
#>           } else if (effects_mat[i,j] == 0) {
#>             A[i,j] = 0;
#>           }
#>         }
#>       }
#>     }
#>   }
#>   // calculate asymptotic covariance
#>   Q_inf = ksolve(A, Q);
#>   // loop over phylogenetic trees
#>   for (t in 1:N_tree) {
#>     // setting ancestral states and placeholders
#>     eta[t, node_seq[t, 1]] = eta_anc[t];
#>     VCV_tips[t, node_seq[t, 1]] = diag_matrix(rep_vector(-99, J));
#>     for (i in 2:N_seg) {
#>       matrix[J,J] A_delta; // amount of deterministic change
#>       matrix[J,J] VCV; // vcov matrix of stochastic change
#>       vector[J] drift_seg; // accumulated drift over the segment
#>       A_delta = matrix_exp(A * ts[t, i]);
#>       VCV = Q_inf - quad_form_sym(Q_inf, A_delta');
#>       drift_seg = cholesky_decompose(VCV) * z_drift[t, i-1];
#>       // if not a tip, add the drift parameter
#>       if (tip[t, i] == 0) {
#>         eta[t, node_seq[t, i]] = to_vector(
#>           A_delta * eta[t, parent[t, i]] + ((A \ add_diag(A_delta, -1)) * b) + drift_seg
#>         );
#>         VCV_tips[t, node_seq[t, i]] = diag_matrix(rep_vector(-99, J));
#>       }
#>       // if is a tip, omit, we'll deal with it in the model block;
#>       else {
#>         eta[t, node_seq[t, i]] = to_vector(
#>           A_delta * eta[t, parent[t, i]] + ((A \ add_diag(A_delta, -1)) * b)
#>         );
#>         VCV_tips[t, node_seq[t, i]] = VCV;
#>       }
#>     }
#>     for (i in 1:N_tips) {
#>       tdrift[t,i] = cholesky_decompose(VCV_tips[t,i]) * to_vector(terminal_drift[t][i,]);
#>     }
#>   }
#> }
#> model{
#>   b ~ std_normal();
#>   for (t in 1:N_tree) {
#>     eta_anc[t] ~ std_normal();
#>     for (i in 1:(N_seg - 1)) z_drift[t, i] ~ std_normal();
#>     to_vector(terminal_drift[t]) ~ std_normal();
#>   }
#>   A_offdiag ~ std_normal();
#>   A_diag ~ std_normal();
#>   L_R ~ lkj_corr_cholesky(4);
#>   Q_sigma ~ std_normal();
#>   c1 ~ normal(0, 2);
#>   c2 ~ normal(0, 2);
#>   if (!prior_only) {
#>     for (i in 1:N_obs) {
#>       vector[N_tree] lp = rep_vector(0.0, N_tree);
#>       for (t in 1:N_tree) {
#>         if (miss[i,1] == 0) lp[t] += ordered_logistic_lpmf(to_int(y[i,1]) | eta[t,tip_id[i]][1] + tdrift[t,tip_id[i]][1], c1);
#>         if (miss[i,2] == 0) lp[t] += ordered_logistic_lpmf(to_int(y[i,2]) | eta[t,tip_id[i]][2] + tdrift[t,tip_id[i]][2], c2);
#>       }
#>       target += log_sum_exp(lp);
#>     }
#>   }
#> }
#> generated quantities{
#>   array[N_tree,N_obs,J] real yrep; // predictive checks
#>   matrix[J,J] cor_R; // correlated drift
#>   cor_R = multiply_lower_tri_self_transpose(L_R);
#>   {
#>     array[N_tree,N_tips] vector[J] terminal_drift_rep;
#>     for (i in 1:N_tips) {
#>       for (t in 1:N_tree) {
#>         for (j in 1:J) terminal_drift_rep[t,i][j] = normal_rng(0, 1);
#>         terminal_drift_rep[t,i] = cholesky_decompose(VCV_tips[t, i]) * terminal_drift_rep[t,i];
#>       }
#>     }
#>     for (i in 1:N_obs) {
#>       for (t in 1:N_tree) {
#>         yrep[t,i,1] = ordered_logistic_rng(eta[t,tip_id[i]][1] + terminal_drift_rep[t,tip_id[i]][1], c1);
#>         yrep[t,i,2] = ordered_logistic_rng(eta[t,tip_id[i]][2] + terminal_drift_rep[t,tip_id[i]][2], c2);
#>       }
#>     }
#>   }
#> }