We provide functions for identifying the core community phylogeny in any microbiome, drawing phylogenetic Venn diagrams, calculating the core Faithâ s PD for a set of communities, and calculating the core UniFrac
distance between two sets of communities. All functions rely on construction of a core community phylogeny, which is a phylogeny where branches are defined based on their presence in multiple samples from a single type of habitat. Our package provides two options for constructing the core community phylogeny, a tip-based approach, where the core community phylogeny is identified based on incidence of leaf nodes and a branch-based approach, where the core community phylogeny is identified based on incidence of individual branches. We suggest use of the microViz
package.
We develop a novel matrix factorization tool named scINSIGHT
to jointly analyze multiple single-cell gene expression samples from biologically heterogeneous sources, such as different disease phases, treatment groups, or developmental stages. Given multiple gene expression samples from different biological conditions, scINSIGHT
simultaneously identifies common and condition-specific gene modules and quantify their expression levels in each sample in a lower-dimensional space. With the factorized results, the inferred expression levels and memberships of common gene modules can be used to cluster cells and detect cell identities, and the condition-specific gene modules can help compare functional differences in transcriptomes from distinct conditions. Please also see Qian K, Fu SW, Li HW, Li WV (2022) <doi:10.1186/s13059-022-02649-3>.
LINCS L1000 is a high-throughput technology that allows the gene expression measurement in a large number of assays. However, to fit the measurements of ~1000 genes in the ~500 color channels of LINCS L1000, every two landmark genes are designed to share a single channel. Thus, a deconvolution step is required to infer the expression values of each gene. Any errors in this step can be propagated adversely to the downstream analyses. We present a LINCS L1000 data peak calling R package l1kdeconv based on a new outlier detection method and an aggregate Gaussian mixture model. Upon the remove of outliers and the borrowing information among similar samples, l1kdeconv shows more stable and better performance than methods commonly used in LINCS L1000 data deconvolution.
Traditional and spatial capture-mark-recapture analysis with multiple non-invasive marks. The models implemented in multimark combine encounter history data arising from two different non-invasive "marks", such as images of left-sided and right-sided pelage patterns of bilaterally asymmetrical species, to estimate abundance and related demographic parameters while accounting for imperfect detection. Bayesian models are specified using simple formulae and fitted using Markov chain Monte Carlo. Addressing deficiencies in currently available software, multimark also provides a user-friendly interface for performing Bayesian multimodel inference using non-spatial or spatial capture-recapture data consisting of a single conventional mark or multiple non-invasive marks. See McClintock
(2015) <doi:10.1002/ece3.1676> and Maronde et al. (2020) <doi:10.1002/ece3.6990>.
This is a computational package designed to identify the most sensitive interactions within a network which must be estimated most accurately in order to produce qualitatively robust predictions to a press perturbation. This is accomplished by enumerating the number of sign switches (and their magnitude) in the net effects matrix when an edge experiences uncertainty. The package produces data and visualizations when uncertainty is associated to one or more edges in the network and according to a variety of distributions. The software requires the network to be described by a system of differential equations but only requires as input a numerical Jacobian matrix evaluated at an equilibrium point. This package is based on Koslicki, D., & Novak, M. (2017) <doi:10.1007/s00285-017-1163-0>.
Analysis of forest population structure and quantitative dynamics is the research and evaluation of the composition, distribution, age structure and changes in quantity over time of various populations in the forest. By deeply understanding these characteristics of forest populations, scientific basis can be provided for the management, protection and sustainable utilization of forest resources. This R package conducts a systematic analysis of forest population structure and quantitative dynamics through analyzing age structure, compiling life tables, population quantitative dynamic change indices and time series models, in order to provide support for forest population protection and sustainable management. References: Zhang Y, Wang J, Wang X, et al(2024)<doi:10.3390/plants13070946>. Yuan G, Guo Q, Xie N, et al(2023)<doi:10.1007/s11629-022-7429-z>.
This package provides functions for the computation of F-, f- and D-statistics (e.g., Fst, hierarchical F-statistics, Patterson's F2, F3, F3*, F4 and D parameters) in population genomics studies from allele count or Pool-Seq read count data and for the fitting, building and visualization of admixture graphs. The package also includes several utilities to manipulate Pool-Seq data stored in standard format (e.g., such as vcf files or rsync files generated by the the PoPoolation
software) and perform conversion to alternative format (as used in the BayPass
and SelEstim
software). As of version 2.0, the package also includes utilities to manipulate standard allele count data (e.g., stored in TreeMix
, BayPass
and SelEstim
format).
The biomarker data set by Vermeulen et al. (2009) <doi:10.1016/S1470-2045(09)70154-8> is provided. The data source, however, is by Ruijter et al. (2013) <doi:10.1016/j.ymeth.2012.08.011>. The original data set may be downloaded from <https://medischebiologie.nl/wp-content/uploads/2019/02/qpcrdatamethods.zip>. This data set is for a real-time quantitative polymerase chain reaction (PCR) experiment that comprises the raw fluorescence data of 24,576 amplification curves. This data set comprises 59 genes of interest and 5 reference genes. Each gene was assessed on 366 neuroblastoma complementary DNA (cDNA
) samples and on 18 standard dilution series samples (10-fold 5-point dilution series x 3 replicates + no template controls (NTC) x 3 replicates).
Analyze and compare conversations using various similarity measures including topic, lexical, semantic, structural, stylistic, sentiment, participant, and timing similarities. Supports both pairwise conversation comparisons and analysis of multiple dyads. Methods are based on established research: Topic modeling: Blei et al. (2003) <doi:10.1162/jmlr.2003.3.4-5.993>; Landauer et al. (1998) <doi:10.1080/01638539809545028>; Lexical similarity: Jaccard (1912) <doi:10.1111/j.1469-8137.1912.tb05611.x>; Semantic similarity: Salton & Buckley (1988) <doi:10.1016/0306-4573(88)90021-0>; Mikolov et al. (2013) <doi:10.48550/arXiv.1301.3781>
; Pennington et al. (2014) <doi:10.3115/v1/D14-1162>; Structural and stylistic analysis: Graesser et al. (2004) <doi:10.1075/target.21131.ryu>; Sentiment analysis: Rinker (2019) <https://github.com/trinker/sentimentr>.
This package contains functions carrying out adaptive procedures using mixed scaling approach to establish bioequivalence for in-vitro permeation test (IVPT) data. Currently, the package provides procedures based on parallel replicate design and balanced data, according to the U.S. Food and Drug Administration's "Draft Guidance on Acyclovir" <https://www.accessdata.fda.gov/drugsatfda_docs/psg/Acyclovir_topical%20cream_RLD%2021478_RV12-16.pdf>. Potvin et al. (2008) <doi:10.1002/pst.294> provides the basis for our adaptive design (see Method B). For a comprehensive overview of the method, refer to Lim et al. (2023) <doi:10.1002/pst.2333>. This package reflects the views of the authors and should not be construed to represent the views or policies of the U.S. Food and Drug Administration.
Compute a cyclist's Eddington number, including efficiently computing cumulative E over a vector. A cyclist's Eddington number <https://en.wikipedia.org/wiki/Arthur_Eddington#Eddington_number_for_cycling> is the maximum number satisfying the condition such that a cyclist has ridden E miles or greater on E distinct days. The algorithm in this package is an improvement over the conventional approach because both summary statistics and cumulative statistics can be computed in linear time, since it does not require initial sorting of the data. These functions may also be used for computing h-indices for authors, a metric described by Hirsch (2005) <doi:10.1073/pnas.0507655102>. Both are specific applications of computing the side length of a Durfee square <https://en.wikipedia.org/wiki/Durfee_square>.
This package provides a wrapper around the LIBLINEAR C/C++ library for machine learning (available at <https://www.csie.ntu.edu.tw/~cjlin/liblinear/>). LIBLINEAR is a simple library for solving large-scale regularized linear classification and regression. It currently supports L2-regularized classification (such as logistic regression, L2-loss linear SVM and L1-loss linear SVM) as well as L1-regularized classification (such as L2-loss linear SVM and logistic regression) and L2-regularized support vector regression (with L1- or L2-loss). The main features of LiblineaR
include multi-class classification (one-vs-the rest, and Crammer & Singer method), cross validation for model selection, probability estimates (logistic regression only) or weights for unbalanced data. The estimation of the models is particularly fast as compared to other libraries.
Fits Bayesian time-course models for model-based network meta-analysis (MBNMA) that allows inclusion of multiple time-points from studies. Repeated measures over time are accounted for within studies by applying different time-course functions, following the method of Pedder et al. (2019) <doi:10.1002/jrsm.1351>. The method allows synthesis of studies with multiple follow-up measurements that can account for time-course for a single or multiple treatment comparisons. Several general time-course functions are provided; others may be added by the user. Various characteristics can be flexibly added to the models, such as correlation between time points and shared class effects. The consistency of direct and indirect evidence in the network can be assessed using unrelated mean effects models and/or by node-splitting.
Statistical Analyses and Pooling after Multiple Imputation. A large variety of repeated statistical analysis can be performed and finally pooled. Statistical analysis that are available are, among others, Levene's test, Odds and Risk Ratios, One sample proportions, difference between proportions and linear and logistic regression models. Functions can also be used in combination with the Pipe operator. More and more statistical analyses and pooling functions will be added over time. Heymans (2007) <doi:10.1186/1471-2288-7-33>. Eekhout (2017) <doi:10.1186/s12874-017-0404-7>. Wiel (2009) <doi:10.1093/biostatistics/kxp011>. Marshall (2009) <doi:10.1186/1471-2288-9-57>. Sidi (2021) <doi:10.1080/00031305.2021.1898468>. Lott (2018) <doi:10.1080/00031305.2018.1473796>. Grund (2021) <doi:10.31234/osf.io/d459g>.
Full dynamic system to describe and forecast the spread and the severity of a developing pandemic, based on available data. These data are number of infections, hospitalizations, deaths and recoveries notified each day. The system consists of three transitions, infection-infection, infection-hospital and hospital-death/recovery. The intensities of these transitions are dynamic and estimated using non-parametric local linear estimators. The package can be used to provide forecasts and survival indicators such as the median time spent in hospital and the probability that a patient who has been in hospital for a number of days can leave it alive. Methods are described in Gámiz, Mammen, Martà nez-Miranda, and Nielsen (2024) <doi:10.48550/arXiv.2308.09918>
and <doi:10.48550/arXiv.2308.09919>
.
This package implements methods for obtaining kernel density estimates subject to a variety of shape constraints (unimodality, bimodality, symmetry, tail monotonicity, bounds, and constraints on the number of inflection points). Enforcing constraints can eliminate unwanted waves or kinks in the estimate, which improves its subjective appearance and can also improve statistical performance. The main function scdensity()
is very similar to the density()
function in stats', allowing shape-restricted estimates to be obtained with little effort. The methods implemented in this package are described in Wolters and Braun (2017) <doi:10.1080/03610918.2017.1288247>, Wolters (2012) <doi:10.18637/jss.v047.i06>, and Hall and Huang (2002) <https://www3.stat.sinica.edu.tw/statistica/j12n4/j12n41/j12n41.htm>. See the scdensity()
help for for full citations.
Several generalized / directional Fixed Sequence Multiple Testing Procedures (FSMTPs) are developed for testing a sequence of pre-ordered hypotheses while controlling the FWER, FDR and Directional Error (mdFWER
). All three FWER controlling generalized FSMTPs are designed under arbitrary dependence, which allow any number of acceptances. Two FDR controlling generalized FSMTPs are respectively designed under arbitrary dependence and independence, which allow more but a given number of acceptances. Two mdFWER
controlling directional FSMTPs are respectively designed under arbitrary dependence and independence, which can also make directional decisions based on the signs of the test statistics. The main functions for each proposed generalized / directional FSMTPs are designed to calculate adjusted p-values and critical values, respectively. For users convenience, the functions also provide the output option for printing decision rules.
This package provides the facility to perform the chi-square and G-square test of independence, calculates the retrospective power of the traditional chi-square test, compute permutation and Monte Carlo p-value, and provides measures of association for tables of any size such as Phi, Phi corrected, odds ratio with 95 percent CI and p-value, Yule Q and Y, adjusted contingency coefficient, Cramer's V, V corrected, V standardised, bias-corrected V, W, Cohen's w, Goodman-Kruskal's lambda, and tau. It also calculates standardised, moment-corrected standardised, and adjusted standardised residuals, and their significance, as well as the Quetelet Index, IJ association factor, and adjusted standardised counts. It also computes the chi-square-maximising version of the input table. Different outputs are returned in nicely formatted tables.
Vector autoregressive (VAR) model is a fundamental and effective approach for multivariate time series analysis. Shrinkage estimation methods can be applied to high-dimensional VAR models with dimensionality greater than the number of observations, contrary to the standard ordinary least squares method. This package is an integrative package delivering nonparametric, parametric, and semiparametric methods in a unified and consistent manner, such as the multivariate ridge regression in Golub, Heath, and Wahba (1979) <doi:10.2307/1268518>, a James-Stein type nonparametric shrinkage method in Opgen-Rhein and Strimmer (2007) <doi:10.1186/1471-2105-8-S2-S3>, and Bayesian estimation methods using noninformative and informative priors in Lee, Choi, and S.-H. Kim (2016) <doi:10.1016/j.csda.2016.03.007> and Ni and Sun (2005) <doi:10.1198/073500104000000622>.
The successor to the AlphaSim
software for breeding program simulation [Faux et al. (2016) <doi:10.3835/plantgenome2016.02.0013>]. Used for stochastic simulations of breeding programs to the level of DNA sequence for every individual. Contained is a wide range of functions for modeling common tasks in a breeding program, such as selection and crossing. These functions allow for constructing simulations of highly complex plant and animal breeding programs via scripting in the R software environment. Such simulations can be used to evaluate overall breeding program performance and conduct research into breeding program design, such as implementation of genomic selection. Included is the Markovian Coalescent Simulator ('MaCS
') for fast simulation of biallelic sequences according to a population demographic history [Chen et al. (2009) <doi:10.1101/gr.083634.108>].
This package provides access to word predictability estimates using large language models (LLMs) based on transformer architectures via integration with the Hugging Face ecosystem <https://huggingface.co/>. The package interfaces with pre-trained neural networks and supports both causal/auto-regressive LLMs (e.g., GPT-2') and masked/bidirectional LLMs (e.g., BERT') to compute the probability of words, phrases, or tokens given their linguistic context. For details on GPT-2 and causal models, see Radford et al. (2019) <https://storage.prod.researchhub.com/uploads/papers/2020/06/01/language-models.pdf>, for details on BERT and masked models, see Devlin et al. (2019) <doi:10.48550/arXiv.1810.04805>
. By enabling a straightforward estimation of word predictability, the package facilitates research in psycholinguistics, computational linguistics, and natural language processing (NLP).
This package provides methods for statistical analysis of count data and quantal data. For the analysis of count data an implementation of the Closure Principle Computational Approach Test ("CPCAT") is provided (Lehmann, R et al. (2016) <doi:10.1007/s00477-015-1079-4>), as well as an implementation of a "Dunnett GLM" approach using a Quasi-Poisson regression (Hothorn, L, Kluxen, F (2020) <doi:10.1101/2020.01.15.907881>). For the analysis of quantal data an implementation of the Closure Principle Fisherâ Freemanâ Halton test ("CPFISH") is provided (Lehmann, R et al. (2018) <doi:10.1007/s00477-017-1392-1>). P-values and no/lowest observed (adverse) effect concentration values are calculated. All implemented methods include further functions to evaluate the power and the minimum detectable difference using a bootstrapping approach.
Analysis of complex plant root system architectures (RSA) using the output files created by Data Analysis of Root Tracings (DART), an open-access software dedicated to the study of plant root architecture and development across time series (Le Bot et al (2010) "DART: a software to analyse root system architecture and development from captured images", Plant and Soil, <DOI:10.1007/s11104-009-0005-2>), and RSA data encoded with the Root System Markup Language (RSML) (Lobet et al (2015) "Root System Markup Language: toward a unified root architecture description language", Plant Physiology, <DOI:10.1104/pp.114.253625>). More information can be found in Delory et al (2016) "archiDART
: an R package for the automated computation of plant root architectural traits", Plant and Soil, <DOI:10.1007/s11104-015-2673-4>.
The primary goal of phase I clinical trials is to find the maximum tolerated dose (MTD). To reach this objective, we introduce a new design for phase I clinical trials, the posterior predictive (PoP
) design. The PoP
design is an innovative model-assisted design that is as simply as the conventional algorithmic designs as its decision rules can be pre-tabulated prior to the onset of trial, but is of more flexibility of selecting diverse target toxicity rates and cohort sizes. The PoP
design has desirable properties, such as coherence and consistency. Moreover, the PoP
design provides better empirical performance than the BOIN and Keyboard design with respect to high average probabilities of choosing the MTD and slightly lower risk of treating patients at subtherapeutic or overly toxic doses.