R/coev_make_stancode.R
coev_make_stancode.Rd
Make the Stan code for the Bayesian dynamic coevolutionary model. Stan code is generated, checked for syntactical errors, and then returned as a character string.
coev_make_stancode(
data,
variables,
id,
tree,
effects_mat = NULL,
complete_cases = FALSE,
dist_mat = NULL,
dist_cov = "exp_quad",
measurement_error = NULL,
prior = NULL,
scale = TRUE,
estimate_correlated_drift = TRUE,
estimate_residual = TRUE,
log_lik = FALSE,
prior_only = FALSE
)
An object of class data.frame
(or one that can be coerced
to that class) containing data of all variables used in the model.
A named list identifying variables that should coevolve in
the model and their associated response distributions as character strings
(e.g. list(var1 = "bernoulli_logit", var2 = "ordered_logistic")
).
Must identify at least two variables. Variable names must refer to valid
column names in data. Currently, the only supported response distributions
are bernoulli_logit
, ordered_logistic
,
poisson_softplus
, negative_binomial_softplus
, normal
,
and gamma_log
.
A character of length one identifying the variable in the data that links rows to tips on the phylogeny. Must refer to a valid column name in the data. The id column must exactly match the tip labels in the phylogeny.
A phylogenetic tree object of class phylo
or
multiPhylo
. The tree(s) must be rooted and must include positive
non-zero branch lengths. All trees in multiPhylo
objects must have
the same number of internal nodes and branches.
(optional) A boolean matrix with row and column names exactly matching the variables declared for the model. If not specified, all cross-lagged effects will be estimated in the model. If specified, the model will only estimate cross-lagged effects where cells in the matrix are TRUE and will ignore cross-lagged effects where cells in the matrix are FALSE. In the matrix, columns represent predictor variables and rows represent outcome variables. All autoregressive effects (e.g., X -> X) must be TRUE in the matrix.
(optional) Logical. If FALSE
(default), all
missing values are imputed by the model. If TRUE
, taxa with missing
data are excluded.
(optional) A distance matrix with row and column names exactly matching the tip labels in the phylogeny. If specified, the model will additionally control for spatial location by including a separate Gaussian Process over locations for every coevolving variable in the model.
A string specifying the covariance kernel used for Gaussian
Processes over locations. Currently supported are "exp_quad"
(exponentiated-quadratic kernel; default), "exponential"
(exponential kernel), and "matern32"
(Matern 3/2 kernel).
(optional) A named list of coevolving variables and
their associated columns in the dataset containing standard errors. Only
valid for normally-distributed variables. For example, if we declare
variables = list(x = "normal", y = "normal")
, then we could set
measurement_error = list(x = "x_std_err")
to tell the function to
include measurement error on x
using standard errors from the
x_std_err
column of the dataset.
(optional) A named list of priors for the model. If not
specified, the model uses default priors (see help(coev_fit)
).
Alternatively, the user can specify a named list of priors. The list must
contain non-duplicated entries for any of the following parameters: the
autoregressive effects (A_diag
), the cross effects
(A_offdiag
), the Cholesky factor for the drift matrix (L_R
),
the drift std. dev. parameters (Q_sigma
), the continuous time
intercepts (b
), the ancestral states for the traits
(eta_anc
), the cutpoints for ordinal variables (c
), the
overdispersion parameters for negative binomial variables (phi
),
the shape parameters for gamma variables (shape
), the sigma
parameters for Gaussian Processes over locations (sigma_dist
), the
rho parameters for Gaussian Processes over locations (rho_dist
), the
residual standard deviations when there are repeated observations
(sigma_residual
), and the Cholesky factor for the residual
correlations when there are repeated observations (L_residual
).
These must be entered with valid prior strings, e.g.
list(A_offdiag = "normal(0, 2)")
. Invalid prior strings will throw
an error when the function internally checks the syntax of resulting Stan
code.
Logical. If TRUE
(default), variables following the
normal
and gamma_log
response distributions are scaled before
fitting the model. Continuous variables following the normal
distribution are standardised (e.g., mean centered and divided by their
standard deviation) and positive real variables following the
gamma_log
distribution are divided by the mean value without
centering. This approach is recommended when using default priors to
improve efficiency and ensure accurate inferences. If FALSE
,
variables are left unscaled for model fitting. In this case, users should
take care to set sensible priors on variables.
Logical. If TRUE
(default), the model
estimates the off-diagonals for the $$Q$$ drift matrix (i.e., correlated
drift). If FALSE
, the off-diagonals for the $$Q$$ drift matrix
are set to zero.
Logical. If TRUE
(default), the model
estimates residual standard deviations and residual correlations when there
are repeated observations for taxa. If FALSE
, residual standard
deviations and residual correlations are not estimated. The latter may be
preferable in cases where repeated observations are sparse (e.g., only some
taxa have only few repeated observations). This argument only applies when
repeated observations are present in the data.
Logical. Set to FALSE
by default. If TRUE
, the
model returns the pointwise log likelihood, which can be used to calculate
WAIC and LOO.
Logical. If FALSE
(default), the model is fitted to
the data and returns a posterior distribution. If TRUE
, the model
samples from the prior only, ignoring the likelihood.
A character string containing the Stan code to fit the dynamic coevolutionary model
For further details, see help(coev_fit)
Ringen, E., Martin, J. S., & Jaeggi, A. (2021). Novel phylogenetic methods
reveal that resource-use intensification drives the evolution of "complex"
societies. EcoEvoRXiv. doi:10.32942/osf.io/wfp95
Sheehan, O., Watts, J., Gray, R. D., Bulbulia, J., Claessens, S., Ringen,
E. J., & Atkinson, Q. D. (2023). Coevolution of religious and political
authority in Austronesian societies. Nature Human Behaviour,
7(1), 38-45. 10.1038/s41562-022-01471-y
# make stan code
stan_code <- coev_make_stancode(
data = authority$data,
variables = list(
political_authority = "ordered_logistic",
religious_authority = "ordered_logistic"
),
id = "language",
tree = authority$phylogeny
)
# print Stan code
cat(stan_code)
#> // Generated with coevolve 0.1.0.9002
#> functions {
#> // Charles Driver's solver for the asymptotic Q matrix
#> matrix ksolve (matrix A, matrix Q) {
#> int d = rows(A);
#> int d2 = (d * d - d) %/% 2;
#> matrix [d + d2, d + d2] O;
#> vector [d + d2] triQ;
#> matrix[d,d] AQ;
#> int z = 0; // z is row of output
#> for (j in 1:d) { // for column reference of solution vector
#> for (i in 1:j) { // and row reference...
#> if (j >= i) { // if i and j denote a covariance parameter
#> int y = 0; // start new output row
#> z += 1; // shift current output row down
#> for (ci in 1:d) { // for columns and
#> for (ri in 1:d) { // rows of solution
#> if (ci >= ri) { // when in upper tri (inc diag)
#> y += 1; // move to next column of output
#> if (i == j) { // if output row is a diag element
#> if (ri == i) O[z, y] = 2 * A[ri, ci];
#> if (ci == i) O[z, y] = 2 * A[ci, ri];
#> }
#> if (i != j) { // if output row is not a diag element
#> //if column matches row, sum both A diags
#> if (y == z) O[z, y] = A[ri, ri] + A[ci, ci];
#> if (y != z) { // otherwise...
#> // if solution element is related to output row...
#> if (ci == ri) { // if solution element is variance
#> // if variance of solution corresponds to row
#> if (ci == i) O[z, y] = A[j, ci];
#> // if variance of solution corresponds to col
#> if (ci == j) O[z, y] = A[i, ci];
#> }
#> //if solution element is a related covariance
#> if (ci != ri && (ri == i || ri == j || ci == i || ci == j )) {
#> // for row 1,2 / 2,1 of output,
#> // if solution row ri 1 (match)
#> // and column ci 3, we need A[2,3]
#> if (ri == i) O[z, y] = A[j, ci];
#> if (ri == j) O[z, y] = A[i, ci];
#> if (ci == i) O[z, y] = A[j, ri];
#> if (ci == j) O[z, y] = A[i, ri];
#> }
#> }
#> }
#> if (is_nan(O[z, y])) O[z, y] = 0;
#> }
#> }
#> }
#> }
#> }
#> }
#> z = 0; // get upper tri of Q
#> for (j in 1:d) {
#> for (i in 1:j) {
#> z += 1;
#> triQ[z] = Q[i, j];
#> }
#> }
#> triQ = -O \ triQ; // get upper tri of asymQ
#> z = 0; // put upper tri of asymQ into matrix
#> for (j in 1:d) {
#> for (i in 1:j) {
#> z += 1;
#> AQ[i, j] = triQ[z];
#> if (i != j) AQ[j, i] = triQ[z];
#> }
#> }
#> return AQ;
#> }
#>
#> // return number of matches of y in vector x
#> int num_matches(vector x, real y) {
#> int n = 0;
#> for (i in 1:rows(x))
#> if (x[i] == y)
#> n += 1;
#> return n;
#> }
#>
#> // return indices in vector x where x == y
#> array[] int which_equal(vector x, real y) {
#> array [num_matches(x, y)] int match_positions;
#> int pos = 1;
#> for (i in 1:rows(x)) {
#> if (x[i] == y) {
#> match_positions[pos] = i;
#> pos += 1;
#> }
#> }
#> return match_positions;
#> }
#> }
#> data{
#> int<lower=1> N_tips; // number of tips
#> int<lower=1> N_tree; // number of trees
#> int<lower=1> N_obs; // number of observations
#> int<lower=2> J; // number of response traits
#> int<lower=1> N_seg; // total number of segments in the trees
#> array[N_tree, N_seg] int<lower=1> node_seq; // index of tree nodes
#> array[N_tree, N_seg] int<lower=0> parent; // index of parent nodes
#> array[N_tree, N_seg] real ts; // time since parent
#> array[N_tree, N_seg] int<lower=0,upper=1> tip; // segment ends in tip
#> array[J,J] int<lower=0,upper=1> effects_mat; // effects matrix
#> int<lower=2> num_effects; // number of effects being estimated
#> matrix[N_obs,J] y; // observed data
#> matrix[N_obs,J] miss; // are data points missing?
#> array[N_obs] int<lower=1> tip_id; // group index between 1 and N_tips
#> int<lower=0,upper=1> prior_only; // should likelihood be ignored?
#> }
#> transformed data{
#> vector[to_int(N_obs - sum(col(miss, 1)))] obs1; // observed data for variable 1
#> vector[to_int(N_obs - sum(col(miss, 2)))] obs2; // observed data for variable 2
#> obs1 = col(y, 1)[which_equal(col(miss, 1), 0)];
#> obs2 = col(y, 2)[which_equal(col(miss, 2), 0)];
#> }
#> parameters{
#> vector<upper=0>[J] A_diag; // autoregressive terms of A
#> vector[num_effects - J] A_offdiag; // cross-lagged terms of A
#> cholesky_factor_corr[J] L_R; // lower-tri choleksy decomp corr mat
#> vector<lower=0>[J] Q_sigma; // std deviation parameters of the Q mat
#> vector[J] b; // SDE intercepts
#> array[N_tree] vector[J] eta_anc; // ancestral states
#> array[N_tree, N_seg - 1] vector[J] z_drift; // stochastic drift
#> array[N_tree] matrix[N_tips, J] terminal_drift; // drift for the tips
#> ordered[3] c1; // cut points for variable 1
#> ordered[3] c2; // cut points for variable 2
#> }
#> transformed parameters{
#> array[N_tree, N_seg] vector[J] eta;
#> matrix[J,J] A = diag_matrix(A_diag); // selection matrix
#> matrix[J,J] Q = diag_matrix(Q_sigma) * (L_R * L_R') * diag_matrix(Q_sigma); // drift matrix
#> matrix[J,J] Q_inf; // asymptotic covariance matrix
#> array[N_tree, N_seg] matrix[J,J] VCV_tips; // vcov matrix for drift
#> array[N_tree,N_tips] vector[J] tdrift; // terminal drift
#> // fill off diagonal of A matrix
#> {
#> int ticker = 1;
#> for (i in 1:J) {
#> for (j in 1:J) {
#> if (i != j) {
#> if (effects_mat[i,j] == 1) {
#> A[i,j] = A_offdiag[ticker];
#> ticker += 1;
#> } else if (effects_mat[i,j] == 0) {
#> A[i,j] = 0;
#> }
#> }
#> }
#> }
#> }
#> // calculate asymptotic covariance
#> Q_inf = ksolve(A, Q);
#> // loop over phylogenetic trees
#> for (t in 1:N_tree) {
#> // setting ancestral states and placeholders
#> eta[t, node_seq[t, 1]] = eta_anc[t];
#> VCV_tips[t, node_seq[t, 1]] = diag_matrix(rep_vector(-99, J));
#> for (i in 2:N_seg) {
#> matrix[J,J] A_delta; // amount of deterministic change
#> matrix[J,J] VCV; // vcov matrix of stochastic change
#> vector[J] drift_seg; // accumulated drift over the segment
#> A_delta = matrix_exp(A * ts[t, i]);
#> VCV = Q_inf - quad_form_sym(Q_inf, A_delta');
#> drift_seg = cholesky_decompose(VCV) * z_drift[t, i-1];
#> // if not a tip, add the drift parameter
#> if (tip[t, i] == 0) {
#> eta[t, node_seq[t, i]] = to_vector(
#> A_delta * eta[t, parent[t, i]] + ((A \ add_diag(A_delta, -1)) * b) + drift_seg
#> );
#> VCV_tips[t, node_seq[t, i]] = diag_matrix(rep_vector(-99, J));
#> }
#> // if is a tip, omit, we'll deal with it in the model block;
#> else {
#> eta[t, node_seq[t, i]] = to_vector(
#> A_delta * eta[t, parent[t, i]] + ((A \ add_diag(A_delta, -1)) * b)
#> );
#> VCV_tips[t, node_seq[t, i]] = VCV;
#> }
#> }
#> for (i in 1:N_tips) {
#> tdrift[t,i] = cholesky_decompose(VCV_tips[t,i]) * to_vector(terminal_drift[t][i,]);
#> }
#> }
#> }
#> model{
#> b ~ std_normal();
#> for (t in 1:N_tree) {
#> eta_anc[t] ~ std_normal();
#> for (i in 1:(N_seg - 1)) z_drift[t, i] ~ std_normal();
#> to_vector(terminal_drift[t]) ~ std_normal();
#> }
#> A_offdiag ~ std_normal();
#> A_diag ~ std_normal();
#> L_R ~ lkj_corr_cholesky(4);
#> Q_sigma ~ std_normal();
#> c1 ~ normal(0, 2);
#> c2 ~ normal(0, 2);
#> if (!prior_only) {
#> for (i in 1:N_obs) {
#> vector[N_tree] lp = rep_vector(0.0, N_tree);
#> for (t in 1:N_tree) {
#> if (miss[i,1] == 0) lp[t] += ordered_logistic_lpmf(to_int(y[i,1]) | eta[t,tip_id[i]][1] + tdrift[t,tip_id[i]][1], c1);
#> if (miss[i,2] == 0) lp[t] += ordered_logistic_lpmf(to_int(y[i,2]) | eta[t,tip_id[i]][2] + tdrift[t,tip_id[i]][2], c2);
#> }
#> target += log_sum_exp(lp);
#> }
#> }
#> }
#> generated quantities{
#> array[N_tree,N_obs,J] real yrep; // predictive checks
#> matrix[J,J] cor_R; // correlated drift
#> cor_R = multiply_lower_tri_self_transpose(L_R);
#> {
#> array[N_tree,N_tips] vector[J] terminal_drift_rep;
#> for (i in 1:N_tips) {
#> for (t in 1:N_tree) {
#> for (j in 1:J) terminal_drift_rep[t,i][j] = normal_rng(0, 1);
#> terminal_drift_rep[t,i] = cholesky_decompose(VCV_tips[t, i]) * terminal_drift_rep[t,i];
#> }
#> }
#> for (i in 1:N_obs) {
#> for (t in 1:N_tree) {
#> yrep[t,i,1] = ordered_logistic_rng(eta[t,tip_id[i]][1] + terminal_drift_rep[t,tip_id[i]][1], c1);
#> yrep[t,i,2] = ordered_logistic_rng(eta[t,tip_id[i]][2] + terminal_drift_rep[t,tip_id[i]][2], c2);
#> }
#> }
#> }
#> }