Map trait evolution on a time-calibrated phylogeny

Map trait evolution on a time-calibrated phylogeny in several steps:

Step 1: Fit evolutionary models to trait data using Maximum Likelihood.
Step 2: Select the best fitting model comparing AICc.
Step 3: Infer ancestral characters estimates (ACE) at nodes.
Step 4: Run stochastic mapping simulations to generate evolutionary histories compatible with the best model and inferred ACE. (Only for categorical and biogeographic data)
Step 5: Infer ancestral states along branches.
- For continuous traits: use interpolation to produce a contMap.
- For categorical and biogeographic data: compute posterior frequencies of each state/range to produce a densityMap for each state/range.

Usage

prepare_trait_data(
  tip_data,
  trait_data_type,
  phylo,
  seed = NULL,
  evolutionary_models = NULL,
  Q_matrix = NULL,
  BioGeoBEARS_directory_path = "./BioGeoBEARS_directory/",
  keep_BioGeoBEARS_files = TRUE,
  prefix_for_files = NULL,
  nb_cores = 1,
  max_range_size = 2,
  split_multi_area_ranges = FALSE,
  ...,
  res = 100,
  nb_simulations = 1000,
  color_scale = NULL,
  colors_per_levels = NULL,
  plot_map = TRUE,
  plot_overlay = TRUE,
  add_ACE_pies = TRUE,
  PDF_file_path = NULL,
  return_ace = TRUE,
  return_BSM = FALSE,
  return_simmaps = FALSE,
  return_best_model_fit = FALSE,
  return_model_selection_df = FALSE,
  verbose = TRUE
)

Arguments

tip_data

Named numerical or character string vector of trait values/states/ranges at tips. Names should be ordered as the tip labels in the phylogeny found in phylo$tip.label. For biogeographic data, ranges should follow the coding scheme of BioGeoBEARS with a unique CAPITAL letter per unique areas (ex: A, B), combined to form multi-area ranges (Ex: AB). Alternatively, you can provide tip_data as a matrix or data.frame of binary presence/absence in each area (coded as unique CAPITAL letter). In this case, columns are unique areas, rows are taxa, and values are integer (0/1) signaling absence or presence of the taxa in the area.

trait_data_type

Character string. Type of trait data. Either: "continuous", "categorical" or "biogeographic".

phylo

Time-calibrated phylogeny. Object of class "phylo" as defined in ape. Tip labels (phylo$tip.label) should match names in tip_data.

seed

Integer. Set the seed to ensure reproducibility. Default is NULL (a random seed is used).

evolutionary_models

(Vector of) character string(s). To provide the set of evolutionary models to fit on the data.

Models available for continuous data are detailed in geiger::fitContinuous(). Default is "BM".
Models available for categorical data are detailed in geiger::fitDiscrete(). Default is "ARD".
Models for biogeographic data are fit with R package BioGeoBEARS using BioGeoBEARS::bears_optim_run(). Default is "DEC".
See list in "Details" section.

Q_matrix

Custom Q-matrix for categorical data representing transition classes between states. Transitions with similar integers are estimated with a shared rate parameter. Transitions with 0 represent rates that are fixed to zero (i.e., impossible transitions). Diagonal must be populated with NA. row.names(Q_matrix) and col.names(Q_matrix) are the states. Provide "matrix" among the model listed in 'evolutionary_models' to use the custom Q-matrix for modeling. Only for categorical data.

BioGeoBEARS_directory_path

Character string. The path to the directory used to store input/output files generated for/by BioGeoBEARS during biogeographic historical inferences. Use '/' to separate directory and sub-directories. It must end with '/'. Default is "./BioGeoBEARS_directory/". Only for biogeographic data.

keep_BioGeoBEARS_files

Logical. Whether the BioGeoBEARS_directory and its content should be kept after the run. Default = TRUE. Only for biogeographic data.

prefix_for_files

Character string. Prefix to add to all BioGeoBEARS files stored in the BioGeoBEARS_directory_path if keep_BioGeoBEARS_files = TRUE. Files will be exported such as 'prefix_*' with an underscore separating the prefix and the file name. Default is NULL (no prefix is added). Only for biogeographic data.

nb_cores

Interger. Number of cores to use for parallel computation during BioGeoBEARS runs. Default = 1. Only for biogeographic data.

max_range_size

Integer. Maximum number of unique areas encompassed by multi-range areas. Default = 2. Only for biogeographic data.

split_multi_area_ranges

Logical. Whether to split multi-area ranges across unique areas when mapping ranges. Ex: For range EW, posterior probabilities will be split equally between Eastern Palearctic (E) and Western Palearctic (W). Default = FALSE. Only for biogeographic data.

...

Additional arguments to be passed down to the functions used to fit models (See evolutionary_models) and produce simmaps with phytools::make.simmap() or BioGeoBEARS::runBSM().

res

Integer. Define the number of time steps used to interpolate/estimate trait value/state/range in contMap/densityMaps. Default = 100.

nb_simulations

Integer. Define the number of simulations generated for stochastic mapping. Default = 1000. Only for "categorical" and "biogeographic" data.

color_scale

Vector of character string. List of colors to use to build the color scale with grDevices::colorRampPalette() showing the evolution of a continuous trait on the contMap. From lowest values to highest values. Only for continuous data. Default = NULL will use the rainbow() color palette.

colors_per_levels

Named character string. To set the colors to use to map each state/range posterior probabilities. Names = states/ranges; values = colors. If NULL (default), the rainbow() color scale will be used for categorical trait; the BioGeoBEARS::get_colors_for_states_list_0based() will be used for biogeographic ranges. If no color is provided for multi-area ranges, they will be interpolated based on the colors provided for unique areas. Only for categorical and biogeographic data.

plot_map

Logical. Whether to plot or not the phylogeny with mapped trait evolution. Default = TRUE.

plot_overlay

Logical. If TRUE (default), plot a unique densityMap with overlapping states/ranges using transparency. If FALSE, plot a densityMap per state/range. Only for "categorical" and "biogeographic" data.

add_ACE_pies

Logical. Whether to add pies of posterior probabilities of states/ranges at internal nodes on the mapped phylogeny. Default = TRUE. Only for categorical and biogeographic data.

PDF_file_path

Character string. If provided, the plot will be saved in a PDF file following the path provided here. The path must end with ".pdf".

return_ace

Logical. Whether the named vector of ancestral characters estimates (ACE) at internal nodes should be returned in the output. Default = TRUE.

return_BSM

Logical. (Only for Biogeographic data) Whether the summary tables of anagenetic and cladogenetic events generated during the Biogeographic Stochastic Mapping (BSM) process should be returned in the output. Default = FALSE.

return_simmaps

Logical. Whether the evolutionary histories simulated during stochastic mapping (i.e., simmaps) should be returned in the output. Default = TRUE. Only for "categorical" and "biogeographic" data.

return_best_model_fit

Logical. Whether to include the output of the best fitting model in the function output. Default = FALSE.

return_model_selection_df

Logical. Whether to include the data.frame summarizing model comparisons used to select the best fitting model should be returned in the output. Default = FALSE.

verbose

Logical. Should progression be displayed? A message will be printed for every steps in the process. Default is TRUE.

Value

The function returns a list with at least two elements.

$contMap (For "continuous" data) Object of class "contMap", typically generated with phytools::contMap(), that contains a phylogenetic tree and associated continuous trait mapping.
$densityMaps (For "categorical" and "biogeographic" data) List of objects of class "densityMap, typically generated with phytools::densityMap(), that contains a phylogenetic tree and associated mapping of probability to harbor a given state/range along branches. The list contains one "densityMap per state/range found in the tip_data.
$densityMaps_all_ranges (For "biogeographic" data only, if split_multi_area_ranges = TRUE) Same as $densityMaps, but for all ranges including the multi-areas ranges (e.g., AB) while $densityMaps will display posterior probabilities for unique areas only (e.g., A and B), with multi-areas ranges split across the unique areas they encompass.
$trait_data_type Character string. Record the type of trait data. Either: "continuous", "categorical" or "biogeographic".

If return_ace = TRUE,

$ace For continuous traits: Named vector that record the ancestral characters estimates (ACE) at internal nodes. For categorical and biogeographic data: Matrix that record the posterior probabilities of ancestral states/ranges (characters) estimates (ACE) at internal nodes. Rows are internal nodes. Columns are states/ranges. Values are posterior probabilities of each state per node.
$ace_all_ranges For biogeographic data, if split_multi_area_ranges = TRUE: Named vector that record the ancestral characters estimates (ACE) at internal nodes, but including all ranges observed in the simmaps, including the multi-areas ranges (e.g., AB), while $ace will display posterior probabilities for unique areas only (e.g., A and B), with multi-areas ranges split across the unique areas they encompass.

If return_BSM = TRUE, (Only for biogeographic data)

$BSM_output List of two lists that contains summary information of cladogenetic ($RES_caldo_events_tables) and anagenetic ($RES_ana_events_tables) events recording across the N simulations of biogeographic histories performed during Biogeographic Stochastic Mapping (BSM). Each element of the list is a data.frame recording events occurring during one simulation.

If return_simmaps = TRUE, (Only for categorical and biogeographic data)

$simmaps List that contains as many objects of class "simmap" that nb_simulations were requested. Each simmap object is a phylogeny with one simulated geographic history (i.e., transitions in geographic ranges) mapped along branches.

If return_best_model_fit = TRUE,

$best_model_fit List that provides the output of the best fitting model.

If model_selection_df = TRUE,

$model_selection_df Data.frame that summarizes model comparisons used to select the best fitting model.

For biogeographic data, the function also produce input and output files associated with BioGeoBEARS and stored in the directory specified in BioGeoBEARS_directory_path. The directory and its content are kept if keep_BioGeoBEARS_files = TRUE

Details

Map trait evolution on a time-calibrated phylogeny in several steps:

Step 1: Models are fit using Maximum Likelihood approach:

For "continuous" data models are fit with geiger::fitContinuous(): "BM", "OU", "EB", "rate_trend", "lambda", "kappa", "delta". Default is "BM".
For "categorical" data models are fit with geiger::fitDiscrete(): "ER", "SYM", "ARD", "meristic", "matrix". Default is "ARD".
For "biogeographic" data models are fit with R package BioGeoBEARS: "BAYAREALIKE", "DIVALIKE", "DEC", "BAYAREALIKE+J", "DIVALIKE+J", "DEC+J". Default is "DEC".

Step 2: Best model is identified among the list of evolutionary_models by comparing the corrected AIC (AICc) and selecting the model with lowest AICc.

Step 3: For continuous traits: Ancestral characters estimates (ACE) are inferred with phytools::fastAnc on a tree with modified branch lengths scaled to reflect the evolutionary rates estimated from the best model using phytools::rescale().

Step 4: Stochastic Mapping.

For categorical and biogeographic data, stochastic mapping simulations are performed to generate evolutionary histories compatible with the best model and inferred ACE. Node states/ranges are drawn from the scaled marginal likelihoods of ACE, and states/ranges shifts along branches are simulated according to the transition matrix Q estimated from the best fitting model.

Step 5: Infer ancestral states along branches.

For continuous traits: ancestral trait values along branches are interpolated with phytools::contMap(). This provides quick estimates of trait value at any point in time, but it does not provide accurate ML estimates in case of models that are time or trait-value dependent (such as "EB" or "OU") as the interpolation used to built the contMap is assuming a constant rate along each branch. However, ancestral trait values at nodes remain accurate
For categorical and biogeographic data: compute posterior frequencies of each state/range among the simulated evolutionary histories (simmaps) to produce a densityMap for each state/range that reflects the changes along branches in probability of harboring a given state/range.

Note on macroevolutionary models of trait evolution

This function provides an easy solution to map trait evolution on a time-calibrated phylogeny and obtain the contMap/densityMaps objects needed to run the deepSTRAPP workflow (run_deepSTRAPP_for_focal_time, run_deepSTRAPP_over_time). However, it does not explore the most complex options for trait evolution. You may need to explore more complex models to capture the dynamics of trait evolution. such as trait-dependent multi-rate models (phytools::brownie.lite(), OUwie::OUwie), Bayesian MCMC implementations allowing a thorough exploration of location and number of regime shifts (Ex: BayesTraits, RevBayes), or RRphylo for a penalized phylogenetic ridge regression approach that allows regime shifts across all branches.

Note on macroevolutionary models of biogeographic history

This function provides an easy solution to infer ancestral geographic ranges using the BioGeoBEARS framework. It allows to directly compare model fits across 6 models: DEC, DEC+J, DIVALIKE, DIVALIKE+J, BAYAREALIKE, BAYAREALIKE+J. It uses a step-wise approach by using MLE estimates of previous runs as staring parameter values when increasing complexity (adding +J parameters). However, it does not explore the most complex options for historical biogeography. You may need to explore more complex models to capture the dynamics of range evolution such as time-stratification with adjacency matrices, dispersal multipliers (+W), distance-based dispersal probabilities (+X), or other features. See for instance, http://phylo.wikidot.com/biogeobears).

References

For macroevolutionary models in geiger: Pennell, M. W., Eastman, J. M., Slater, G. J., Brown, J. W., Uyeda, J. C., FitzJohn, R. G., ... & Harmon, L. J. (2014). geiger v2. 0: an expanded suite of methods for fitting macroevolutionary models to phylogenetic trees. Bioinformatics, 30(15), 2216-2218.. https://doi.org/10.1093/bioinformatics/btu181.

For BioGeoBEARS: Matzke, Nicholas J. (2018). BioGeoBEARS: BioGeography with Bayesian (and likelihood) Evolutionary Analysis with R Scripts. version 1.1.1, published on GitHub on November 6, 2018. DOI: http://dx.doi.org/10.5281/zenodo.1478250. Website: http://phylo.wikidot.com/biogeobears.

Author

Maël Doré