Title: | Pre- And Post-Processing in Bayesian Evolutionary Analyses |
---|---|
Description: | Functions are provided for prior specification in divergence time estimation using fossils as well as other kinds of data. It provides tools for interacting with the input and output of Bayesian platforms in evolutionary biology such as 'BEAST2', 'MrBayes', 'RevBayes', or 'MCMCTree'. It Implements a simple measure similarity between probability density functions for comparing prior and posterior Bayesian densities, as well as code for calculating the combination of distributions using conflation of Hill (2008). Functions for estimating the origination time in collections of distributions using the x-intercept (e.g., Draper and Smith, 1998) and stratigraphic intervals (Marshall 2010) are also available. Hill, T. 2008. "Conflations of probability distributions". Transactions of the American Mathematical Society, 363:3351-3372. <doi:10.48550/arXiv.0808.1808>, Draper, N. R. and Smith, H. 1998. "Applied Regression Analysis". 1--706. Wiley Interscience, New York. <DOI:10.1002/9781118625590>, Marshall, C. R. 2010. "Using confidence intervals to quantify the uncertainty in the end-points of stratigraphic ranges". Quantitative Methods in Paleobiology, 291--316. <DOI:10.1017/S1089332600001911>. |
Authors: | Gustavo A. Ballen [aut, cre], Sandra Reinales [aut] |
Maintainer: | Gustavo A. Ballen <[email protected]> |
License: | file LICENSE |
Version: | 1.5.0 |
Built: | 2024-11-13 05:14:40 UTC |
Source: | https://github.com/gaballench/tbea |
A dataset containing point estimates and uncertainty intervals of divergence times for clade pairs east and west of the Andes, compiled by Ballen (2020).
data(andes)
data(andes)
A data frame with three columns:
Estimated age (in Ma) from a given rock sample
Standard deviation of the age estimate
Sample code as in Table 3.2
Ballen, Gustavo A. 2020. Fossil freshwater fishes and the biogeography of northern South America. 2020. PhD thesis, Museu de Zoologia, Universidade de São Paulo, São Paulo, 2020. doi:10.11606/T.38.2020.tde-06052020-181631.
c_truncauchy: Estimate the c parameter for the truncated cauchy L distribution to be used in MCMCTree
c_truncauchy(tl, tr, p = 0.1, pr = 0.975, al = 0.025, output = "par")
c_truncauchy(tl, tr, p = 0.1, pr = 0.975, al = 0.025, output = "par")
tl |
minimum age. |
tr |
maximum age |
p |
constant p involved in Cauchy parameters location and scale. Set to 0.1 by default. It determines how close the mode of the distribution is to the tl min age. |
pr |
percentile to the right of the distribution (0.975 by default) |
al |
alpha to the right of the minimum on x. Set it to zero if a hard minimum is desired, otherwise the random variable can take values below t_L with probability al. Set to 0.025 by default. |
output |
Whether to return just the parameters or all of the optimisation output. Defaults to "par". Leave it blank "" or with different text in order to return all of the optimisation output. |
We solve for c while fixing p=0.1 so that the mode of the distribution is closer to the t_L and then we calculate c so that t_R is at the desired max age. note that ar and al are NOT complements, thus both can be 0.025. Optimisation proceeds by fixing p in t_L(1-p) and then using numerical optimisation to find c in c*t_L.
Either the parameter optimisation value as a numeric vector of length one (when output="par") or the complete optimisation output as a list (otherwise)
Gustavo A. Ballen
testValues.tr <- c(4.93, 12.12, 24.43, 49.20) # the values below should be approx. c = 0.2, 0.5, 1, 2 # according to the paml documentation for (i in testValues.tr) { print(c_truncauchy(tl=1, tr=i, p=0.1, pr=0.975, al=0.025)) }
testValues.tr <- c(4.93, 12.12, 24.43, 49.20) # the values below should be approx. c = 0.2, 0.5, 1, 2 # according to the paml documentation for (i in testValues.tr) { print(c_truncauchy(tl=1, tr=i, p=0.1, pr=0.975, al=0.025)) }
concatNexus: Function for concatenation of nexus matrices both morphological and molecular
concatNexus( matrices = NULL, pattern, path, filename, morpho = FALSE, morphoFilename = NULL, sumFilename )
concatNexus( matrices = NULL, pattern, path, filename, morpho = FALSE, morphoFilename = NULL, sumFilename )
matrices |
A vector of type 'character' with paths to the nexus alignments or their file names. If |
pattern |
A vector of type 'character' and length one containing the text pattern to identify the alignments of interest. It would be tipically be some suffix and/or file extension (see examples). |
path |
A vector of type 'character' and length one pointing to the directory where the matrices are located. It is used in combination with |
filename |
A vector of type 'character' and length one with the file name (or path and file name) for the concatenated output matrix. |
morpho |
A vector of type 'logical' and length one indicating whether a morphological matrix is included in the concatenation. |
morphoFilename |
A vector of type 'character' and length one with the file name or path to the morphological nexus matrix. Needed if |
sumFilename |
A vector of type 'character' and length one with the file name or path to the summary information of partition start and end positions. Useful for specifying concatenated analyses in MrBayes where each partition in the matrix might have its own substitution model. |
This function will concatenate matrices in nexus format (mandatory) and write to the disk the output and summary information on the partitions. It requires that the input matrices all share the same taxa in the same positions.
This function writes to the disk two files, one with the concatenated matrix and one with the summary information on partition positions in the complete matrix.
Gustavo A. Ballen
# Concatenate all the matrices in a given path, # ending with the pattern 'aligned.nex', including a morphological matrix # also defined with a pattern ## Not run: path <- "sequences" pattern <- "aligned.nex$" concatNexus(matrices = NULL, pattern = pattern, filename = paste(path, "concatenatedMolmorph.nexus", sep = "/"), path = path, morpho = TRUE, morphoFilename = paste(path, grep(pattern = "morfologia", x = dir(path, pattern), value = TRUE), sep = "/"), sumFilename = "partitions.txt") ## End(Not run) # Concatenate arbitrary matrices in the working directory, # including a morphological matrix, return a concatenated file in the same dir ## Not run: concatNexus(matrices = c("coi.nex", "rag1.nex", "cytb.nex", "morphology.nex"), filename = "concatenatedMolmorph.nexus", morpho = TRUE, morphoFilename = "morphology.nex", sumFilename = "partitions.txt") ## End(Not run)
# Concatenate all the matrices in a given path, # ending with the pattern 'aligned.nex', including a morphological matrix # also defined with a pattern ## Not run: path <- "sequences" pattern <- "aligned.nex$" concatNexus(matrices = NULL, pattern = pattern, filename = paste(path, "concatenatedMolmorph.nexus", sep = "/"), path = path, morpho = TRUE, morphoFilename = paste(path, grep(pattern = "morfologia", x = dir(path, pattern), value = TRUE), sep = "/"), sumFilename = "partitions.txt") ## End(Not run) # Concatenate arbitrary matrices in the working directory, # including a morphological matrix, return a concatenated file in the same dir ## Not run: concatNexus(matrices = c("coi.nex", "rag1.nex", "cytb.nex", "morphology.nex"), filename = "concatenatedMolmorph.nexus", morpho = TRUE, morphoFilename = "morphology.nex", sumFilename = "partitions.txt") ## End(Not run)
conflate: Calculate the conflation of multiple distributions pdfs, plot = TRUE, from, to, n, add = FALSE
conflate(pdfs, plot = TRUE, from, to, n, add = FALSE)
conflate(pdfs, plot = TRUE, from, to, n, add = FALSE)
pdfs |
A vector of calls to density_fun for defininf each individual distribution. |
plot |
Whether to plot using curve |
from , to , n
|
The appropriate values from and to which to calculate the conflation, and a number of points n. These are the same used by the function curve but are still necessary even if no plot is required. |
add |
Whether to add the curve to an existing plot. |
Produces either a plot or a data frame with the x and y values for the conflated PDF. It uses as input a vector of densities constructed with density_fun, and further parameters to be pased to curve if no plot is desired these are still used for returning a data frame with the x and y values from evaluation the conflated PDF on the sequence of x values determined by a number n of equidistant points between from and to.
A tree of class phylo with summary branch lengths in tree$edge.length.
conflated_normals <- conflate(c("density_fun(x, 'dnorm', mean=0, sd=1)", "density_fun(x, 'dnorm', mean=3, sd=1)"), from=-4, to=4, n=101, plot=FALSE) plot(conflated_normals)
conflated_normals <- conflate(c("density_fun(x, 'dnorm', mean=0, sd=1)", "density_fun(x, 'dnorm', mean=3, sd=1)"), from=-4, to=4, n=101, plot=FALSE) plot(conflated_normals)
crossplot: Plot the median and HPD interval bars for pairs of distribution
crossplot( log1Path, log2Path, skip.char = "#", pattern = NULL, idx.cols = NULL, bar.lty, bar.lwd, identity.lty, identity.lwd, extra.space = 0.5, ... )
crossplot( log1Path, log2Path, skip.char = "#", pattern = NULL, idx.cols = NULL, bar.lty, bar.lwd, identity.lty, identity.lwd, extra.space = 0.5, ... )
log1Path |
character vector of length 1. Path to the first log file. |
log2Path |
character vector of length 1. Path to the second log file. |
skip.char |
character vector of length 1, with '#' as default value. Which symbol is used as a comment. This will allow to ignore lines which start with the symbol when reading data. |
pattern |
character vector of length 1. the pattern for subsetting the columns containing the data to be plotted. |
idx.cols |
either an integer vector with the position of the columns to pick, or a character vector with the column names to pick. Defaults to NULL. |
bar.lty |
The line type to be used as error bars. |
bar.lwd |
As above but the width |
identity.lty |
The line type to be used in the identity y = x line |
identity.lwd |
As above but the width. |
extra.space |
numeric vector of length 1. How much space to be allowed in both xlim and ylim depending on the smallest value in highest density intervals plus or minus extra.space. A value of 0.5 units on the dimension of interest is used by default. |
... |
Optional arguments to be passed to 'plot' such as 'main', 'xlab', 'ylab', 'pch' and 'cex'. |
The function produces a crossplot, which is a scatterplot where we are comparing two distributions associated to each point by means of the medians as the points, and the highest density intervals as bars around the point. For instance, x may represent the prior of a set of parameters while y represents the posterior. Error bars on the x axis then are highest density intervals from the prior, and those on the y axis represent the interval for the posterior.
This function can also be used to compare two independent runs for (visual) convergence: If they are sampling the same posterior distribution, then they should fall on the identity y=x line.
This function returns nothing, it plots to the graphical device.
Gustavo A. Ballen
## Not run: crossplot(log1Path="log1.tsv", log2Path="log2.tsv", skip.char="#", pattern="par", cols=NULL, bar.lty=1, bar.lwd=1, identity.lty=2, identity.lwd=1, extra.space=0.5, main="My plot", xlab="log 1 (prior)", ylab="log 2 (posterior)", pch=19) ## End(Not run)
## Not run: crossplot(log1Path="log1.tsv", log2Path="log2.tsv", skip.char="#", pattern="par", cols=NULL, bar.lty=1, bar.lwd=1, identity.lty=2, identity.lwd=1, extra.space=0.5, main="My plot", xlab="log 1 (prior)", ylab="log 2 (posterior)", pch=19) ## End(Not run)
density_fun: A way to represent distributions to be conflated
density_fun(x, dist, ...)
density_fun(x, dist, ...)
x |
a symbol which needs to be present in order to allow passing values towards the distruibution generator. It should be just x, without quotation marks (see examples). |
dist |
A character (using single quotes!) with the name of the distribution to use. |
... |
Parameters to be passed on to dist. See examples |
Produces a definition of each individual distribution to be conflated and provides the symbol for non-standard evaluation (x). Single quotes in dist are mandatory in order to avoid issues when calling expressions under the hood. See the documentation for each individual distribution to call their parameters adequately. Argument names and values should be used.
A call from the elements of the distribution to be used.
c("density_fun(x, 'dnorm', mean=0, sd=1)", "density_fun(x, 'dnorm', mean=-1, sd=1)", "density_fun(x, 'dnorm', mean=1, sd=1)")
c("density_fun(x, 'dnorm', mean=0, sd=1)", "density_fun(x, 'dnorm', mean=-1, sd=1)", "density_fun(x, 'dnorm', mean=1, sd=1)")
fasta2nexus (deprecated): Function for converting molecular alignments from fasta to nexus format
fasta2nexus(...)
fasta2nexus(...)
... |
A placeholder for any argument that used to be in this function. |
This function was deprecated from tbea v1.0.0 onwards. For using this function please install tbea v0.5.0.
This function returns an error because it has been deprecated.
Gustavo A. Ballen
Function for estimation of probability density function parameters through quadratic optimization
findParams(q, p, output = "complete", pdfunction, params, initVals = NULL)
findParams(q, p, output = "complete", pdfunction, params, initVals = NULL)
q |
A numeric vector of observed quantiles, might come from a HPD from a previous study (along with a median), or from other sources of prior information. See Details. |
p |
A numeric vector of percentiles. |
output |
One of two possible values: |
pdfunction |
A character vector (of length one) with the name of the PDF function of interest. Technically this argument supports any PDF function of the form pDIST (e.g., |
params |
A character vector with the name of the parameter(s) to optimize in the probability density function. These should match the parameter names of the respective PDF function, e.g., |
initVals |
A numeric vector with default value |
This function comes handy whenever we have some values of uncertainty, (e.g., confidence intervals, HPDs, biostratigraphic age constrains) and want to express it in the form of a probability density function of the form . As we have some values (the quantiles) already and their corresponding percentiles, all we need is a way to approximate the parameters
that produce the same combination of quantiles for the given percentiles under a given PDF. This is carried out through optimization of a quadratic error function. This is accomplished through the function
optim
. For instance, if the estimated age of a fossil is Lutetian, in the Eocene (41.2 to 47.8 Ma), and we want to model such uncertainty through a normal distribution, we could assume that these age boundaries are the quantiles for percentiles 0.025 and 0.975 respectively, and add a thir pair with the midpoint corresponding to the percentile 0.5. This is all the information needed in order to estimate the parameters mean
and sd
in the functiono pnorm
.
Either a list with the complete output of convergence, squared errors and parameter values, or just a vector of parameter values. Depends on the value of output
.
Warnings may be triggered by the function optim
since the optimization is a heuristic process, whenever a given iteration results in an invalid value for a given combination of parameters, the optim
function tries another combination of values but inform the user about the problem through a warning. In general these can be safely disregarded.
Main code by Gustavo A. Ballen with important contributions in expression call structure and vectorized design by Klaus Schliep ([email protected]).
# Find the best parameters for a standard normal density that fit the observed quantiles # -1.644854, 0, and 1.644854, providing full output for the calculations in the form of # a list findParams(q = c(-1.959964, 0.000000, 1.959964), p = c(0.025, 0.50, 0.975), output = "complete", pdfunction = "pnorm", params = c("mean", "sd")) # Given that we have prior on the age of a fossil to be 1 - 10 Ma and that we want to # model it with a lognormal distribution, fin the parameters of the PDF that best reflect # the uncertainty in question (i.e., the parameters for which the observed quantiles are # 1, 5.5, and 10, assuming that we want the midpoint to reflect the mean of the PDF. findParams(q = c(1, 5.5, 10), p = c(0.025, 0.50, 0.975), output = "complete", pdfunction = "plnorm", params = c("meanlog", "sdlog"))
# Find the best parameters for a standard normal density that fit the observed quantiles # -1.644854, 0, and 1.644854, providing full output for the calculations in the form of # a list findParams(q = c(-1.959964, 0.000000, 1.959964), p = c(0.025, 0.50, 0.975), output = "complete", pdfunction = "pnorm", params = c("mean", "sd")) # Given that we have prior on the age of a fossil to be 1 - 10 Ma and that we want to # model it with a lognormal distribution, fin the parameters of the PDF that best reflect # the uncertainty in question (i.e., the parameters for which the observed quantiles are # 1, 5.5, and 10, assuming that we want the midpoint to reflect the mean of the PDF. findParams(q = c(1, 5.5, 10), p = c(0.025, 0.50, 0.975), output = "complete", pdfunction = "plnorm", params = c("meanlog", "sdlog"))
A dataset containing geochronology data from several samples along the stratigraphic column of the Honda and Huila groups in the Tatacoa Desert area. The dataset was compiled from the Table 3.2 in Flynn et al. (1997).
data(laventa)
data(laventa)
A data frame with 87 rows and 7 variables:
Estimated age (in Ma) from a given rock sample
Standard deviation of the age estimate
Sample code as in Table 3.2
Stratigraphic unit in either the Honda Group or the Huila Group
Position in the stratigraphic column, in meters
The mineral used for dating the sample
Comments from footnotes in the original table
Flynn, J.J., Guerrero, J. & Swisher III, C.C. (1997) Geochronology of the Honda Group. In: R. F. Kay, R. H. Madden, R. L. Cifelli, and J. J. Flynn (Eds), Vertebrate Paleontology in the Neotropics: the Miocene Fauna of La Venta, Colombia. Smithsonian Institution Press, pp. 44–60.
Constructing a curve for the user-specified lognormal prior using Beast2 parameters
lognormalBeast( M, S, meanInRealSpace = TRUE, offset = 0, from = NULL, to = NULL, by = 0.05 )
lognormalBeast( M, S, meanInRealSpace = TRUE, offset = 0, from = NULL, to = NULL, by = 0.05 )
M |
Mean of the lognormal density in Beast2. |
S |
Standard deviation of the lognormal density in Beast2. |
meanInRealSpace |
Whether to plot the mean on the real- or log-space (i.e., apply log(M) before plotting). Please see under details. |
offset |
Hard lower bound. |
from , to , by
|
Starting and ending point to calculate considering the offset as zero. That is, from will affect produce a starting point of (offset + from) and an ending point of (offset + to). By sets the step size of the sequence from 'from' to 'to' each 'by' steps. |
This function creates a matrix of x,y values given parameters of a lognormal density as specified in the program Beast2. It's main purpose is for plotting but other uses such as similarity quantification are available. Please note that the value of mean depends on whether we expect it to be in real or log space. Please refer to Heath (2015) for more info: Heath, T. A. (2015). Divergence Time Estimation using BEAST v2.
A matrix of two columns consisting of the x and y values of the lognormal density.
# Generate a matrix for the lognormal density with mean 1 and standard deviation 1, with mean # in real space, and spanning values in x from 0 to 10 lognormalBeast(M = 1, S = 1, meanInRealSpace = TRUE, from = 0, to = 10) # The same as above but with an offset of 10, that is, the curve starts at 10 as if it was 0 # to values will start in (offset + from) and finish in (offset + to) lognormalBeast(M = 1, S = 1, meanInRealSpace = TRUE, offset = 10, from = 0, to = 10)
# Generate a matrix for the lognormal density with mean 1 and standard deviation 1, with mean # in real space, and spanning values in x from 0 to 10 lognormalBeast(M = 1, S = 1, meanInRealSpace = TRUE, from = 0, to = 10) # The same as above but with an offset of 10, that is, the curve starts at 10 as if it was 0 # to values will start in (offset + from) and finish in (offset + to) lognormalBeast(M = 1, S = 1, meanInRealSpace = TRUE, offset = 10, from = 0, to = 10)
Calculate the Intersection Between Two Densities
measureSimil( d1, d2, splits = 500, rawData = c(TRUE, TRUE), plot = TRUE, x_limit = "auto", colors = c("red", "blue", "gray"), ... )
measureSimil( d1, d2, splits = 500, rawData = c(TRUE, TRUE), plot = TRUE, x_limit = "auto", colors = c("red", "blue", "gray"), ... )
d1 , d2
|
Either two vectors of empirical (i.e., MCMC-produced) values OR a |
splits |
A numerical argument controling the number of subdivisions of the intersection area for numerical integration |
rawData |
Are d1 and/or d2 raw data for which a density should be calculated? A vector of length two containing logical values indicating whenther any of the arguments d1 or d2 are raw data or whether the user is inputing already calculated densities (e.g., the output from the density, curve, or dDIST functions, or any two-dimension object with x and y values) |
plot |
Should a plot be produced? |
x_limit |
Whether to define the xlim form the min-max of the combined density x-values |
colors |
A vector of three colors, namely, color of the |
... |
Further arguments to pass to the graphical functions such as |
Similarity is measured as the overlapping portion between two densities. It has a value between 0 and 1. The values of the vector rawData determine the behavior of the function and therefore attention must be paid to their consistence with the nature of arguments d1 and d2. Despite the function was designed in order to allow to quantify similarity between the posterior and the prior, this can be used to quantify any overlap between two given densities and for any other purpose.
A numeric vector with the value of the intersection between two densities. As a side effect, a plot is produced to an active (or new) graphical device.
# Set seed and colors to use in plots in the order: Prior, posterior, and intersection set.seed(1985) colors <- c("red", "blue", "lightgray") # Similarity between two identical distributions below <- measureSimil(d1 = rnorm(1000000, mean = 0, 1), d2 = rnorm(1000000, mean = 0, 1), main = "Comp. similarity", colors = colors) legend(x = "topright", legend = round(below, digits = 2)) # Similarity in two distributions partially overlapping below <- measureSimil(d1 = rnorm(1000000, mean = 3, 1), d2 = rnorm(1000000, mean = 0, 1), main = "Partial similarity", colors = colors) legend(x = "topright", legend = round(below, digits = 2)) # Similarity in two completely-different distributions below <- measureSimil(d1 = rnorm(1000000, mean = 8, 1), d2 = rnorm(1000000, mean = 0, 1), main = "Comp. dissimilarity", colors = colors) legend(x = "topright", legend = round(below, digits = 2)) # Don't plot, just return the intersection measureSimil(d1 = rnorm(1000000, mean = 3, 1), d2 = rnorm(1000000, mean = 0, 1), plot = FALSE)
# Set seed and colors to use in plots in the order: Prior, posterior, and intersection set.seed(1985) colors <- c("red", "blue", "lightgray") # Similarity between two identical distributions below <- measureSimil(d1 = rnorm(1000000, mean = 0, 1), d2 = rnorm(1000000, mean = 0, 1), main = "Comp. similarity", colors = colors) legend(x = "topright", legend = round(below, digits = 2)) # Similarity in two distributions partially overlapping below <- measureSimil(d1 = rnorm(1000000, mean = 3, 1), d2 = rnorm(1000000, mean = 0, 1), main = "Partial similarity", colors = colors) legend(x = "topright", legend = round(below, digits = 2)) # Similarity in two completely-different distributions below <- measureSimil(d1 = rnorm(1000000, mean = 8, 1), d2 = rnorm(1000000, mean = 0, 1), main = "Comp. dissimilarity", colors = colors) legend(x = "topright", legend = round(below, digits = 2)) # Don't plot, just return the intersection measureSimil(d1 = rnorm(1000000, mean = 3, 1), d2 = rnorm(1000000, mean = 0, 1), plot = FALSE)
Reduced chi-square test or mean square weighted deviation (mswd) test
mswd.test(age, sd)
mswd.test(age, sd)
age |
A vector of age radiometric age estimates |
sd |
A vector of the standard deviation corresponding to each element in |
From Ludwig (2003:646): "By convention, probabilities of fit greater than 0.05 are generally considered as arguably satisfying the mathematical assumptions of an isochron, while lower probabilities are generally taken as indicating the presence of “geological” scatter, and hence a significant possibility of bias in the isochron age.". The null hypothesis is that the isochron conditions hold.
A numeric vector of length one with the p-value corresponding to the test.
data(laventa) # Do the age estimates for the boundaries of the Honda Group (i.e., samples at meters 56.4 # and 675.0) conform to the isochron hypothesis? hondaIndex <- which(laventa$elevation == 56.4 | laventa$elevation == 675.0) mswd.test(age = laventa$age[hondaIndex], sd = laventa$one_sigma[hondaIndex]) # The p-value is smaller than the nominal alpha of 0.05, so we can reject the null # hypothesis of isochron conditions # Do the age estimates for the samples JG-R 88-2 and JG-R 89-2 conform to the isochron hypothesis? twoLevelsIndex <- which(laventa$sample == "JG-R 89-2" | laventa$sample == "JG-R 88-2") dataset <- laventa[twoLevelsIndex, ] # Remove the values 21 and 23 because of their abnormally large standard deviations mswd.test(age = dataset$age[c(-21, -23)], sd = dataset$one_sigma[c(-21, -23)]) # The p-value is larger than the nominal alpha of 0.05, so we can # not reject the null hypothesis of isochron conditions
data(laventa) # Do the age estimates for the boundaries of the Honda Group (i.e., samples at meters 56.4 # and 675.0) conform to the isochron hypothesis? hondaIndex <- which(laventa$elevation == 56.4 | laventa$elevation == 675.0) mswd.test(age = laventa$age[hondaIndex], sd = laventa$one_sigma[hondaIndex]) # The p-value is smaller than the nominal alpha of 0.05, so we can reject the null # hypothesis of isochron conditions # Do the age estimates for the samples JG-R 88-2 and JG-R 89-2 conform to the isochron hypothesis? twoLevelsIndex <- which(laventa$sample == "JG-R 89-2" | laventa$sample == "JG-R 88-2") dataset <- laventa[twoLevelsIndex, ] # Remove the values 21 and 23 because of their abnormally large standard deviations mswd.test(age = dataset$age[c(-21, -23)], sd = dataset$one_sigma[c(-21, -23)]) # The p-value is larger than the nominal alpha of 0.05, so we can # not reject the null hypothesis of isochron conditions
stratCI: Estimate the confidence intervals of endpoints in stratigraphic intervals
stratCI(times, method, nparams, C, endpoint, confidence, quantile)
stratCI(times, method, nparams, C, endpoint, confidence, quantile)
times |
a vector of occurrences in time for which we want to estimate the endpoints of the stratigraphic interval |
method |
a character describing which method to use, either 'Strauss-Sadler89' or 'Marshall94'. |
nparams |
A character indicating whether to estimate one or two parameters, possible values are 'one.par' and 'two.par'. |
C |
numeric indicating the confidence level, e.g. 0.95 |
endpoint |
used only for nparams = 'one.par'. Possible values are 'first' and 'last'. |
confidence |
the confidence interval level, usually 0.95. |
quantile |
the desired confidence level for a quantile representing the brackets around the confidence interval. |
For method='Strauss-Sadler89' we need to provide ‘nparams', 'C', and 'endpoint'. For method=’Marshall94' we need to provide 'confidence' and 'quantile'.
A named vector when using the 'Strauss-Sadler89' method, or an unnamed vector when using the 'Marshall94' method.
Gustavo A. Ballen.
data(andes) andes <- andes$ages # remove missing data andes <- andes[complete.cases(andes)] # remove outliers andes <- sort(andes[which(andes < 10)]) stratCI(andes, method="Strauss-Sadler89", nparams="one.par", C=0.95, endpoint="first") stratCI(andes, method="Strauss-Sadler89", nparams="one.par", C=0.95, endpoint="last") stratCI(andes, method="Strauss-Sadler89", nparams="two.par", C=0.95) stratCI(andes, method="Marshall94", confidence = 0.95, quantile = 0.8) stratCI(andes, method="Marshall94", confidence = 0.95, quantile = 0.95)
data(andes) andes <- andes$ages # remove missing data andes <- andes[complete.cases(andes)] # remove outliers andes <- sort(andes[which(andes < 10)]) stratCI(andes, method="Strauss-Sadler89", nparams="one.par", C=0.95, endpoint="first") stratCI(andes, method="Strauss-Sadler89", nparams="one.par", C=0.95, endpoint="last") stratCI(andes, method="Strauss-Sadler89", nparams="two.par", C=0.95) stratCI(andes, method="Marshall94", confidence = 0.95, quantile = 0.8) stratCI(andes, method="Marshall94", confidence = 0.95, quantile = 0.95)
summaryBrlen: Summarise branch lengths on trees with identical topology
summaryBrlen(mphy, method)
summaryBrlen(mphy, method)
mphy |
An list of objects of class multiPhylo. If a single object in this argument is of class multiPhylo, it is first enclosed in a list. |
method |
A character with the function name for the summary to be applied |
This function can be used on the output of topofreq from the $trees element in order to summarise the branch length on each topology set so that we have a single tree summarising both topology and branch lengths. Useful for depicting posterior tree density. Alternatively, it can be used with a single element provided that it is first enclosed in a list
A tree of class phylo with summary branch lengths in tree$edge.length.
set.seed(1) library(ape) trl <- ape::rmtree(10, 4) tpf <- topoFreq(unroot(trl), output="trees") sumtrees <- summaryBrlen(tpf$trees, method = "median") oldpar <- par(no.readonly = TRUE) par(mfrow=c(1,3)) plot(sumtrees[[1]], type="unrooted", show.node.label=FALSE, cex=1.5) plot(sumtrees[[2]], type="unrooted", show.node.label=FALSE, cex=1.5) plot(sumtrees[[3]], type="unrooted", show.node.label=FALSE, cex=1.5) par(oldpar)
set.seed(1) library(ape) trl <- ape::rmtree(10, 4) tpf <- topoFreq(unroot(trl), output="trees") sumtrees <- summaryBrlen(tpf$trees, method = "median") oldpar <- par(no.readonly = TRUE) par(mfrow=c(1,3)) plot(sumtrees[[1]], type="unrooted", show.node.label=FALSE, cex=1.5) plot(sumtrees[[2]], type="unrooted", show.node.label=FALSE, cex=1.5) plot(sumtrees[[3]], type="unrooted", show.node.label=FALSE, cex=1.5) par(oldpar)
table2nexus: Read a data matrix in delimited format and convert into a data matrix in nexus format
table2nexus( path, datatype = c("standard", "dna", "rna", "protein"), header = FALSE, sep = ",", con = stdout() )
table2nexus( path, datatype = c("standard", "dna", "rna", "protein"), header = FALSE, sep = ",", con = stdout() )
path |
a character vector of length 1 with the path to the table file. |
datatype |
a character vector of length 1 with the desired datatype. Possible values are STANDARD, DNA, RNA, or PROTEIN. Multicharacter types such as continuous or nucleotide are not supported. |
header |
a logical vector of length 1 indicating whether the table file has a header. Defaults to FALSE. |
sep |
a character vector of length 1 telling the kind of separator in the table file. Defaults to comma ",". |
con |
the connection to which the matrix should be returned. Defaults to stdout(), that is, return the text to the console. If writing to a file, then this should be the path to the output file. |
This function will concatenate matrices in nexus format (mandatory) and write to the disk the output and summary information on the partitions. It requires that the input matrices all share the same taxa in the same positions.
This function writes to the connected required a matrix in nexus format for a morphological dataset (that is, datatype=standard).
Gustavo A. Ballen
## Not run: # this will return the matrix to the console rather than to a file table2nexus(path="morpho.csv", datatype="standard", header=FALSE, sep=",") ## End(Not run)
## Not run: # this will return the matrix to the console rather than to a file table2nexus(path="morpho.csv", datatype="standard", header=FALSE, sep=",") ## End(Not run)
tnt2newick: Function for converting from TNT tree format to newick parenthetical format
tnt2newick( file, output = NULL, string = NULL, return = FALSE, subsetting = FALSE, name.sep = NULL )
tnt2newick( file, output = NULL, string = NULL, return = FALSE, subsetting = FALSE, name.sep = NULL )
file |
A vector of type 'character' with the path to the original TNT tree file. |
output |
A vector of type 'character' with the path to output files to contain the tree in newick format. |
string |
A vector of type 'character' which can be either an object in memory or a string for interactive transformation, in TNT format. Use file in case your tree(s) are stored in a file instead. |
return |
A 'logical' expression indicating whether to write the newick tree(s) to a file in 'output' (if FALSE, the default), or whether to return to the screen (if TRUE), potentially to be stored in a vector via the '<-' operator. |
subsetting |
A vector of type 'logical' indicating whether subsetting (i.e., chopping at once the first and last line of the TNT tree file) should be done. Otherwise, explicit text replacements removing such lines are used. The default is false because it does not play well with multi-tree TNT files. Only use subsetting = TRUE if you are sure that there is only one tree in the file with the commands, tread and proc as first and last lines. |
name.sep |
A vector of length 2 and type 'character' for carrying out separator conversion in the names of terminals. For instance, if the terminals have names composed of two words separated by an underscore (_) and you want them to be separated by space ( ) then use name.sep = c("_", " "). This does not support regular expressions. |
This function has been tested for cases where only one tree is in the original tnt tree file. Please be careful with files containing multiple trees.
This function writes to the disk a text file containing the tree converted to newick format. Alternatively, it returns the output to the screen or writes it to an object in memory thanks to the argument 'string'.
Gustavo A. Ballen
## Not run: tnt2newick(file = "someTrees.tre", return = TRUE) ## End(Not run)
## Not run: tnt2newick(file = "someTrees.tre", return = TRUE) ## End(Not run)
Frequency of topologies in a tree sample
topoFreq(mphy, output = "index", maxtrees = 10000)
topoFreq(mphy, output = "index", maxtrees = 10000)
mphy |
An object of class multiPhylo |
output |
A character indicating whether the tree indices or the actual trees should be returned. Defaults to "index" |
maxtrees |
A numeric indicating whether to warn about having more trees than the arbitrary threshold |
This function can be used e.w. with a posterior sample of trees from a Bayesian analysis where we want to explore the distribution of topologies in the posterior of trees. This way we can assess topological uncertainty in a more meaningful way than using a majority-rule consensus.
The use of 'maxtrees' is actually a convenience for keeping in mind that large amounts of trees can cause memory issues. This can end up in situations which are difficult to debug but that from personal experience have come from exactly that: More trees than memory can fit or which can be processed for calculating similarity. This number will _not_ break the function call but will return a warning. Try to avoid modifying its default value unless you are sure it will not cause any issues under your computing conditions (e.g. when lots of trees are being processed but also large RAM is available).
A list with an element containing the the different tree clusters (as multiPhylo) and the absolute, cumulative, and relative frequencies of each topology in the tree sample.
# tests set.seed(1) library(ape) trl <- ape::rmtree(10, 4) tpf <- topoFreq(ape::unroot(trl), output="trees")
# tests set.seed(1) library(ape) trl <- ape::rmtree(10, 4) tpf <- topoFreq(ape::unroot(trl), output="trees")
xintercept: Estimate the x-intercept of an empirical cdf
xintercept(x, method, alpha = 0.05, p = c(0.025, 0.975), R = 1000, robust)
xintercept(x, method, alpha = 0.05, p = c(0.025, 0.975), R = 1000, robust)
x |
A vector of type numeric with time data points. |
method |
Either "Draper-Smith" or "Bootstrap". The function will fail otherwise. |
alpha |
A vector of length one and type numeric with the nominal alpha value for the Draper-Smith method, defaults to 0.05. |
p |
A vector of length two and type numeric with the two-tail probability values for the CI. Defaults to 0.025 and 0.975. |
R |
The number of iterations to be used in the Bootstrap method. |
robust |
Logical value indicating whether to use robust regression using 'Rfit::rfit' ('robust = TRUE') or ordinary least squares 'lm' ('robust = FALSE'). |
This function will take a vector of time points, calculate the empirical cumulative density function, and regress its values in order to infer the x-intercept and its confidence interval. For plotting purposes, it will also return the x-y empirical cumulative density values.
A named list with three elements: 'param', the value of x_hat; 'ci', the lower and upper values of the confidence interval on x; 'ecdfxy', the x and y points for the empirical cumulative density curve.
Gustavo A. Ballen.
data(andes) ages <- andes$ages ages <- ages[complete.cases(ages)] # remove NAs ages <- ages[which(ages < 10)] # remove outliers # Draper-Smith, OLS draperSmithNormalX0 <- xintercept(x = ages, method = "Draper-Smith", alpha = 0.05, robust = FALSE) # Draper-Smith, Robust fit draperSmithRobustX0 <- xintercept(x = ages, method = "Draper-Smith", alpha = 0.05, robust = TRUE) # Bootstrap, OLS bootstrapNormalX0 <- xintercept(x = ages, method = "Bootstrap", p = c(0.025, 0.975), robust = FALSE) # Bootstrap, Robust fit bootstrapRobustX0 <- xintercept(x = ages, method = "Bootstrap", p = c(0.025, 0.975), robust = TRUE) # plot the estimations hist(ages, probability = TRUE, col = rgb(red = 0, green = 0, blue = 1, alpha = 0.3), xlim = c(0, 10), main = "CDF-based on confidence intervals", xlab = "Age (Ma)") # plot the lines for the estimator of Draper and Smith using lm arrows(x0 = draperSmithNormalX0$ci["upper"], y0 = 0.025, x1 = draperSmithNormalX0$ci["lower"], y1 = 0.025, code = 3, angle = 90, length = 0.1, lwd = 3, col = "darkblue") # plot the lines for the estimator of Draper and Smith using rfit arrows(x0 = draperSmithRobustX0$ci["upper"], y0 = 0.05, x1 = draperSmithRobustX0$ci["lower"], y1 = 0.05, code = 3, angle = 90, length = 0.1, lwd = 3, col = "darkgreen") # plot the lines for the estimator based on bootstrap arrows(x0 = bootstrapRobustX0$ci["upper"], y0 = 0.075, x1 = bootstrapRobustX0$ci["lower"], y1 = 0.075, code = 3, angle = 90, length = 0.1, lwd = 3, col = "darkred") # plot a legend legend(x = "topright", legend = c("Draper and Smith with lm", "Draper and Smith with rfit", "Bootstrap on x0"), col = c("darkblue", "darkgreen", "darkred"), lty = 1, lwd = 3)
data(andes) ages <- andes$ages ages <- ages[complete.cases(ages)] # remove NAs ages <- ages[which(ages < 10)] # remove outliers # Draper-Smith, OLS draperSmithNormalX0 <- xintercept(x = ages, method = "Draper-Smith", alpha = 0.05, robust = FALSE) # Draper-Smith, Robust fit draperSmithRobustX0 <- xintercept(x = ages, method = "Draper-Smith", alpha = 0.05, robust = TRUE) # Bootstrap, OLS bootstrapNormalX0 <- xintercept(x = ages, method = "Bootstrap", p = c(0.025, 0.975), robust = FALSE) # Bootstrap, Robust fit bootstrapRobustX0 <- xintercept(x = ages, method = "Bootstrap", p = c(0.025, 0.975), robust = TRUE) # plot the estimations hist(ages, probability = TRUE, col = rgb(red = 0, green = 0, blue = 1, alpha = 0.3), xlim = c(0, 10), main = "CDF-based on confidence intervals", xlab = "Age (Ma)") # plot the lines for the estimator of Draper and Smith using lm arrows(x0 = draperSmithNormalX0$ci["upper"], y0 = 0.025, x1 = draperSmithNormalX0$ci["lower"], y1 = 0.025, code = 3, angle = 90, length = 0.1, lwd = 3, col = "darkblue") # plot the lines for the estimator of Draper and Smith using rfit arrows(x0 = draperSmithRobustX0$ci["upper"], y0 = 0.05, x1 = draperSmithRobustX0$ci["lower"], y1 = 0.05, code = 3, angle = 90, length = 0.1, lwd = 3, col = "darkgreen") # plot the lines for the estimator based on bootstrap arrows(x0 = bootstrapRobustX0$ci["upper"], y0 = 0.075, x1 = bootstrapRobustX0$ci["lower"], y1 = 0.075, code = 3, angle = 90, length = 0.1, lwd = 3, col = "darkred") # plot a legend legend(x = "topright", legend = c("Draper and Smith with lm", "Draper and Smith with rfit", "Bootstrap on x0"), col = c("darkblue", "darkgreen", "darkred"), lty = 1, lwd = 3)