
Map trait evolution on a time-calibrated phylogeny
Source:R/prepare_trait_data.R
prepare_trait_data.RdMap trait evolution on a time-calibrated phylogeny in several steps:
Step 1: Fit evolutionary models to trait data using Maximum Likelihood.
Step 2: Select the best fitting model comparing AICc.
Step 3: Infer ancestral characters estimates (ACE) at nodes.
Step 4: Run stochastic mapping simulations to generate evolutionary histories compatible with the best model and inferred ACE. (Only for categorical and biogeographic data)
Step 5: Infer ancestral states along branches.
For continuous traits: use interpolation to produce a
contMap.For categorical and biogeographic data: compute posterior frequencies of each state/range to produce a
densityMapfor each state/range.
Usage
prepare_trait_data(
tip_data,
trait_data_type,
phylo,
seed = NULL,
evolutionary_models = NULL,
Q_matrix = NULL,
BioGeoBEARS_directory_path = "./BioGeoBEARS_directory/",
keep_BioGeoBEARS_files = TRUE,
prefix_for_files = NULL,
nb_cores = 1,
max_range_size = 2,
split_multi_area_ranges = FALSE,
...,
res = 100,
nb_simulations = 1000,
color_scale = NULL,
colors_per_levels = NULL,
plot_map = TRUE,
plot_overlay = TRUE,
add_ACE_pies = TRUE,
PDF_file_path = NULL,
return_ace = TRUE,
return_BSM = FALSE,
return_simmaps = FALSE,
return_best_model_fit = FALSE,
return_model_selection_df = FALSE,
verbose = TRUE
)Arguments
- tip_data
Named numerical or character string vector of trait values/states/ranges at tips. Names should be ordered as the tip labels in the phylogeny found in
phylo$tip.label. For biogeographic data, ranges should follow the coding scheme of BioGeoBEARS with a unique CAPITAL letter per unique areas (ex: A, B), combined to form multi-area ranges (Ex: AB). Alternatively, you can provide tip_data as a matrix or data.frame of binary presence/absence in each area (coded as unique CAPITAL letter). In this case, columns are unique areas, rows are taxa, and values are integer (0/1) signaling absence or presence of the taxa in the area.- trait_data_type
Character string. Type of trait data. Either: "continuous", "categorical" or "biogeographic".
- phylo
Time-calibrated phylogeny. Object of class
"phylo"as defined in ape. Tip labels (phylo$tip.label) should match names intip_data.- seed
Integer. Set the seed to ensure reproducibility. Default is
NULL(a random seed is used).- evolutionary_models
(Vector of) character string(s). To provide the set of evolutionary models to fit on the data.
Models available for continuous data are detailed in
geiger::fitContinuous(). Default is"BM".Models available for categorical data are detailed in
geiger::fitDiscrete(). Default is"ARD".Models for biogeographic data are fit with R package
BioGeoBEARSusingBioGeoBEARS::bears_optim_run(). Default is"DEC".See list in "Details" section.
- Q_matrix
Custom Q-matrix for categorical data representing transition classes between states. Transitions with similar integers are estimated with a shared rate parameter. Transitions with
0represent rates that are fixed to zero (i.e., impossible transitions). Diagonal must be populated withNA.row.names(Q_matrix)andcol.names(Q_matrix)are the states. Provide"matrix"among the model listed in 'evolutionary_models' to use the custom Q-matrix for modeling. Only for categorical data.- BioGeoBEARS_directory_path
Character string. The path to the directory used to store input/output files generated for/by BioGeoBEARS during biogeographic historical inferences. Use '/' to separate directory and sub-directories. It must end with '/'. Default is
"./BioGeoBEARS_directory/". Only for biogeographic data.- keep_BioGeoBEARS_files
Logical. Whether the
BioGeoBEARS_directoryand its content should be kept after the run. Default =TRUE. Only for biogeographic data.- prefix_for_files
Character string. Prefix to add to all BioGeoBEARS files stored in the
BioGeoBEARS_directory_pathifkeep_BioGeoBEARS_files = TRUE. Files will be exported such as 'prefix_*' with an underscore separating the prefix and the file name. Default isNULL(no prefix is added). Only for biogeographic data.- nb_cores
Interger. Number of cores to use for parallel computation during BioGeoBEARS runs. Default =
1. Only for biogeographic data.- max_range_size
Integer. Maximum number of unique areas encompassed by multi-range areas. Default =
2. Only for biogeographic data.- split_multi_area_ranges
Logical. Whether to split multi-area ranges across unique areas when mapping ranges. Ex: For range EW, posterior probabilities will be split equally between Eastern Palearctic (E) and Western Palearctic (W). Default =
FALSE. Only for biogeographic data.- ...
Additional arguments to be passed down to the functions used to fit models (See
evolutionary_models) and produce simmaps withphytools::make.simmap()orBioGeoBEARS::runBSM().- res
Integer. Define the number of time steps used to interpolate/estimate trait value/state/range in
contMap/densityMaps. Default =100.- nb_simulations
Integer. Define the number of simulations generated for stochastic mapping. Default =
1000. Only for "categorical" and "biogeographic" data.- color_scale
Vector of character string. List of colors to use to build the color scale with
grDevices::colorRampPalette()showing the evolution of a continuous trait on thecontMap. From lowest values to highest values. Only for continuous data. Default =NULLwill use the rainbow() color palette.- colors_per_levels
Named character string. To set the colors to use to map each state/range posterior probabilities. Names = states/ranges; values = colors. If
NULL(default), therainbow()color scale will be used for categorical trait; theBioGeoBEARS::get_colors_for_states_list_0based()will be used for biogeographic ranges. If no color is provided for multi-area ranges, they will be interpolated based on the colors provided for unique areas. Only for categorical and biogeographic data.- plot_map
Logical. Whether to plot or not the phylogeny with mapped trait evolution. Default =
TRUE.- plot_overlay
Logical. If
TRUE(default), plot a uniquedensityMapwith overlapping states/ranges using transparency. IfFALSE, plot adensityMapper state/range. Only for "categorical" and "biogeographic" data.- add_ACE_pies
Logical. Whether to add pies of posterior probabilities of states/ranges at internal nodes on the mapped phylogeny. Default =
TRUE. Only for categorical and biogeographic data.- PDF_file_path
Character string. If provided, the plot will be saved in a PDF file following the path provided here. The path must end with ".pdf".
- return_ace
Logical. Whether the named vector of ancestral characters estimates (ACE) at internal nodes should be returned in the output. Default =
TRUE.- return_BSM
Logical. (Only for Biogeographic data) Whether the summary tables of anagenetic and cladogenetic events generated during the Biogeographic Stochastic Mapping (BSM) process should be returned in the output. Default =
FALSE.- return_simmaps
Logical. Whether the evolutionary histories simulated during stochastic mapping (i.e.,
simmaps) should be returned in the output. Default =TRUE. Only for "categorical" and "biogeographic" data.- return_best_model_fit
Logical. Whether to include the output of the best fitting model in the function output. Default =
FALSE.- return_model_selection_df
Logical. Whether to include the data.frame summarizing model comparisons used to select the best fitting model should be returned in the output. Default =
FALSE.- verbose
Logical. Should progression be displayed? A message will be printed for every steps in the process. Default is
TRUE.
Value
The function returns a list with at least two elements.
$contMap(For "continuous" data) Object of class"contMap", typically generated withphytools::contMap(), that contains a phylogenetic tree and associated continuous trait mapping.$densityMaps(For "categorical" and "biogeographic" data) List of objects of class"densityMap, typically generated withphytools::densityMap(), that contains a phylogenetic tree and associated mapping of probability to harbor a given state/range along branches. The list contains one"densityMapper state/range found in thetip_data.$densityMaps_all_ranges(For "biogeographic" data only, ifsplit_multi_area_ranges = TRUE) Same as$densityMaps, but for all ranges including the multi-areas ranges (e.g., AB) while$densityMapswill display posterior probabilities for unique areas only (e.g., A and B), with multi-areas ranges split across the unique areas they encompass.$trait_data_typeCharacter string. Record the type of trait data. Either: "continuous", "categorical" or "biogeographic".
If return_ace = TRUE,
$aceFor continuous traits: Named vector that record the ancestral characters estimates (ACE) at internal nodes. For categorical and biogeographic data: Matrix that record the posterior probabilities of ancestral states/ranges (characters) estimates (ACE) at internal nodes. Rows are internal nodes. Columns are states/ranges. Values are posterior probabilities of each state per node.$ace_all_rangesFor biogeographic data, ifsplit_multi_area_ranges = TRUE: Named vector that record the ancestral characters estimates (ACE) at internal nodes, but including all ranges observed in the simmaps, including the multi-areas ranges (e.g., AB), while$acewill display posterior probabilities for unique areas only (e.g., A and B), with multi-areas ranges split across the unique areas they encompass.
If return_BSM = TRUE, (Only for biogeographic data)
$BSM_outputList of two lists that contains summary information of cladogenetic ($RES_caldo_events_tables) and anagenetic ($RES_ana_events_tables) events recording across the N simulations of biogeographic histories performed during Biogeographic Stochastic Mapping (BSM). Each element of the list is a data.frame recording events occurring during one simulation.
If return_simmaps = TRUE, (Only for categorical and biogeographic data)
$simmapsList that contains as many objects of class"simmap"thatnb_simulationswere requested. Each simmap object is a phylogeny with one simulated geographic history (i.e., transitions in geographic ranges) mapped along branches.
If return_best_model_fit = TRUE,
$best_model_fitList that provides the output of the best fitting model.
If model_selection_df = TRUE,
$model_selection_dfData.frame that summarizes model comparisons used to select the best fitting model.
For biogeographic data, the function also produce input and output files associated with BioGeoBEARS and stored in the directory
specified in BioGeoBEARS_directory_path. The directory and its content are kept if keep_BioGeoBEARS_files = TRUE
Details
Map trait evolution on a time-calibrated phylogeny in several steps:
Step 1: Models are fit using Maximum Likelihood approach:
For "continuous" data models are fit with
geiger::fitContinuous(): "BM", "OU", "EB", "rate_trend", "lambda", "kappa", "delta". Default is"BM".For "categorical" data models are fit with
geiger::fitDiscrete(): "ER", "SYM", "ARD", "meristic", "matrix". Default is"ARD".For "biogeographic" data models are fit with R package
BioGeoBEARS: "BAYAREALIKE", "DIVALIKE", "DEC", "BAYAREALIKE+J", "DIVALIKE+J", "DEC+J". Default is"DEC".
Step 2: Best model is identified among the list of evolutionary_models by comparing the corrected AIC (AICc)
and selecting the model with lowest AICc.
Step 3: For continuous traits: Ancestral characters estimates (ACE) are inferred with phytools::fastAnc on a tree
with modified branch lengths scaled to reflect the evolutionary rates estimated from the best model using phytools::rescale().
Step 4: Stochastic Mapping.
For categorical and biogeographic data, stochastic mapping simulations are performed to generate evolutionary histories compatible with the best model and inferred ACE. Node states/ranges are drawn from the scaled marginal likelihoods of ACE, and states/ranges shifts along branches are simulated according to the transition matrix Q estimated from the best fitting model.
Step 5: Infer ancestral states along branches.
For continuous traits: ancestral trait values along branches are interpolated with
phytools::contMap(). This provides quick estimates of trait value at any point in time, but it does not provide accurate ML estimates in case of models that are time or trait-value dependent (such as "EB" or "OU") as the interpolation used to built the contMap is assuming a constant rate along each branch. However, ancestral trait values at nodes remain accurateFor categorical and biogeographic data: compute posterior frequencies of each state/range among the simulated evolutionary histories (
simmaps) to produce adensityMapfor each state/range that reflects the changes along branches in probability of harboring a given state/range.
Note on macroevolutionary models of trait evolution
This function provides an easy solution to map trait evolution on a time-calibrated phylogeny
and obtain the contMap/densityMaps objects needed to run the deepSTRAPP workflow (run_deepSTRAPP_for_focal_time, run_deepSTRAPP_over_time).
However, it does not explore the most complex options for trait evolution. You may need to explore more complex models to capture the dynamics of trait evolution.
such as trait-dependent multi-rate models (phytools::brownie.lite(), OUwie::OUwie), Bayesian MCMC implementations allowing a thorough exploration
of location and number of regime shifts (Ex: BayesTraits, RevBayes), or RRphylo for a penalized phylogenetic ridge regression approach that allows regime shifts across all branches.
Note on macroevolutionary models of biogeographic history
This function provides an easy solution to infer ancestral geographic ranges using the BioGeoBEARS framework. It allows to directly compare model fits across 6 models: DEC, DEC+J, DIVALIKE, DIVALIKE+J, BAYAREALIKE, BAYAREALIKE+J. It uses a step-wise approach by using MLE estimates of previous runs as staring parameter values when increasing complexity (adding +J parameters). However, it does not explore the most complex options for historical biogeography. You may need to explore more complex models to capture the dynamics of range evolution such as time-stratification with adjacency matrices, dispersal multipliers (+W), distance-based dispersal probabilities (+X), or other features. See for instance, http://phylo.wikidot.com/biogeobears).
References
For macroevolutionary models in geiger: Pennell, M. W., Eastman, J. M., Slater, G. J., Brown, J. W., Uyeda, J. C., FitzJohn, R. G., ... & Harmon, L. J. (2014). geiger v2. 0: an expanded suite of methods for fitting macroevolutionary models to phylogenetic trees. Bioinformatics, 30(15), 2216-2218.. https://doi.org/10.1093/bioinformatics/btu181.
For BioGeoBEARS: Matzke, Nicholas J. (2018). BioGeoBEARS: BioGeography with Bayesian (and likelihood) Evolutionary Analysis with R Scripts. version 1.1.1, published on GitHub on November 6, 2018. DOI: http://dx.doi.org/10.5281/zenodo.1478250. Website: http://phylo.wikidot.com/biogeobears.
See also
geiger::fitContinuous() geiger::fitDiscrete() phytools::contMap() phytools::densityMap()
For a guided tutorial, see the associated vignettes:
For continuous trait data:
vignette("model_continuous_trait_evolution", package = "deepSTRAPP")For categorical trait data:
vignette("model_categorical_trait_evolution", package = "deepSTRAPP")For biogeographic range data:
vignette("model_biogeographic_range_evolution", package = "deepSTRAPP")