This package provides a lightweight tool that provides a reproducible workflow for selecting and executing appropriate statistical analysis in one-way or two-way experimental designs. The package automatically checks for data normality, conducts parametric (ANOVA) or non-parametric (Kruskal-Wallis) tests, performs post-hoc comparisons with Compact Letter Displays (CLD), and generates publication-ready boxplots, faceted plots, and heatmaps. It is designed for researchers seeking fast, automated statistical summaries and visualization. Based on established statistical methods including Shapiro and Wilk (1965) <doi:10.2307/2333709>, Kruskal and Wallis (1952) <doi:10.1080/01621459.1952.10483441>, Tukey (1949) <doi:10.2307/3001913>, Fisher (1925) <ISBN:0050021702>, and Wickham (2016) <ISBN:978-3-319-24277-4>.
Innovative Trend Analysis is a graphical method to examine the trends in time series data. Sequential Mann-Kendall test uses the intersection of prograde and retrograde series to indicate the possible change point in time series data. Distribution free cumulative sum charts indicate location and significance of the change point in time series. Zekai, S. (2011). <doi:10.1061/(ASCE)HE.1943-5584.0000556>. Grayson, R. B. et al. (1996). Hydrological Recipes: Estimation Techniques in Australian Hydrology. Cooperative Research Centre for Catchment Hydrology, Australia, p. 125. Sneyers, S. (1990). On the statistical analysis of series of observations. Technical note no 5 143, WMO No 725 415. Secretariat of the World Meteorological Organization, Geneva, 192 pp.
The package provides functionality that can be useful for the analysis of the high-density tiling microarray data (such as from Affymetrix genechips) or for measuring the transcript abundance and the architecture. The main functionalities of the package are:
the class segmentation for representing partitionings of a linear series of data;
the function segment for fitting piecewise constant models using a dynamic programming algorithm that is both fast and exact;
the function
confint
for calculating confidence intervals using thestrucchange
package;the function
plotAlongChrom
for generating pretty plots;the function
normalizeByReference
for probe-sequence dependent response adjustment from a (set of) reference hybridizations.
This package provides several Bayesian survival models for spatial/non-spatial survival data: proportional hazards (PH), accelerated failure time (AFT), proportional odds (PO), and accelerated hazards (AH), a super model that includes PH, AFT, PO and AH as special cases, Bayesian nonparametric nonproportional hazards (LDDPM), generalized accelerated failure time (GAFT), and spatially smoothed Polya tree density estimation. The spatial dependence is modeled via frailties under PH, AFT, PO, AH and GAFT, and via copulas under LDDPM and PH. Model choice is carried out via the logarithm of the pseudo marginal likelihood (LPML), the deviance information criterion (DIC), and the Watanabe-Akaike information criterion (WAIC). See Zhou, Hanson and Zhang (2020) <doi:10.18637/jss.v092.i09>.
Statistical performance measures used in the econometric literature to evaluate conditional covariance/correlation matrix estimates (MSE, MAE, Euclidean distance, Frobenius distance, Stein distance, asymmetric loss function, eigenvalue loss function and the loss function defined in Eq. (4.6) of Engle et al. (2016) <doi:10.2139/ssrn.2814555>). Additionally, compute Eq. (3.1) and (4.2) of Li et al. (2016) <doi:10.1080/07350015.2015.1092975> to compare the factor loading matrix. The statistical performance measures implemented have been previously used in, for instance, Laurent et al. (2012) <doi:10.1002/jae.1248>, Amendola et al. (2015) <doi:10.1002/for.2322> and Becker et al. (2015) <doi:10.1016/j.ijforecast.2013.11.007>.
Biologically relevant, yet mathematically sound constraints are used to compute the propensity and thence infer the dominant direction of reactions of a generic biochemical network. The reactions must be unique and their number must exceed that of the reactants,i.e., reactions >= reactants + 2. ReDirection
', computes the null space of a user-defined stoichiometry matrix. The spanning non-zero and unique reaction vectors (RVs) are combinatorially summed to generate one or more subspaces recursively. Every reaction is represented as a sequence of identical components across all RVs of a particular subspace. The terms are evaluated with (biologically relevant bounds, linear maps, tests of convergence, descriptive statistics, vector norms) and the terms are classified into forward-, reverse- and equivalent-subsets. Since, these are mutually exclusive the probability of occurrence is binary (all, 1; none, 0). The combined propensity of a reaction is the p1-norm of the sub-propensities, i.e., sum of the products of the probability and maximum numeric value of a subset (least upper bound, greatest lower bound). This, if strictly positive is the probable rate constant, is used to infer dominant direction and annotate a reaction as "Forward (f)", "Reverse (b)" or "Equivalent (e)". The inherent computational complexity (NP-hard) per iteration suggests that a suitable value for the number of reactions is around 20. Three functions comprise ReDirection
. These are check_matrix()
and reaction_vector()
which are internal, and calculate_reaction_vector()
which is external.
This package provides tools for the identification of unique of multilocus genotypes when both genotyping error and missing data may be present; targeted for use with large datasets and databases containing multiple samples of each individual (a common situation in conservation genetics, particularly in non-invasive wildlife sampling applications). Functions explicitly incorporate missing data and can tolerate allele mismatches created by genotyping error. If you use this package, please cite the original publication in Molecular Ecology Resources (Galpern et al., 2012), the details for which can be generated using citation('allelematch'). For a complete vignette, please access via the Data S1 Supplementary documentation and tutorials (PDF) located at <doi:10.1111/j.1755-0998.2012.03137.x>.
Fuzzy forests, a new algorithm based on random forests, is designed to reduce the bias seen in random forest feature selection caused by the presence of correlated features. Fuzzy forests uses recursive feature elimination random forests to select features from separate blocks of correlated features where the correlation within each block of features is high and the correlation between blocks of features is low. One final random forest is fit using the surviving features. This package fits random forests using the randomForest
package and allows for easy use of WGCNA to split features into distinct blocks. See D. Conn, Ngun, T., C. Ramirez, and G. Li (2019) <doi:10.18637/jss.v091.i09> for further details.
This package provides a collection of functions that calculate the log likelihood (support) for a range of statistical tests. Where possible the likelihood function and likelihood interval for the observed data are displayed. The evidential approach used here is based on the book "Likelihood" by A.W.F. Edwards (1992, ISBN-13 : 978-0801844430), "Statistical Evidence" by R. Royall (1997, ISBN-13 : 978-0412044113), S.N. Goodman & R. Royall (2011) <doi:10.2105/AJPH.78.12.1568>, "Understanding Psychology as a Science" by Z. Dienes (2008, ISBN-13 : 978-0230542310), S. Glover & P. Dixon <doi:10.3758/BF03196706> and others. This package accompanies "Evidence-Based Statistics" by P. Cahusac (2020, ISBN-13 : 978-1119549802) <doi:10.1002/9781119549833>.
Unequal granularity of cell type annotation makes it difficult to compare scRNA-seq
datasets at scale. Leveraging the ontology system for defining cell type hierarchy, scOntoMatch
aims to align cell type annotations to make them comparable across studies. The alignment involves two core steps: first is to trim the cell type tree within each dataset so each cell type does not have descendants, and then map cell type labels cross-studies by direct matching and mapping descendants to ancestors. Various functions for plotting cell type trees and manipulating ontology terms are also provided. In the Single Cell Expression Atlas hosted at EBI, a compendium of datasets with curated ontology labels are great inputs to this package.
This package provides methods (<doi:10.7717/peerj.11534>) are provided of calibrating and predicting shifts in allele frequencies through redundancy analysis ('vegan::rda()
') and generalized additive models ('mgcv::gam()
'). Visualization functions for predicted changes in allele frequencies include shift.dot.ggplot()
', shift.pie.ggplot()
', shift.moon.ggplot()
', shift.waffle.ggplot()
and shift.surf.ggplot()
that are made with input data sets that are prepared by helper functions for each visualization method. Examples in the documentation show how to prepare animated climate change graphics through a time series with the gganimate package. Function amova.rda()
shows how Analysis of Molecular Variance can be directly conducted with the results from redundancy analysis.
Various algorithms for segmentation of 2D and 3D images, such as computed tomography and satellite remote sensing. This package implements Bayesian image analysis using the hidden Potts model with external field prior of Moores et al. (2015) <doi:10.1016/j.csda.2014.12.001>. Latent labels are sampled using chequerboard updating or Swendsen-Wang. Algorithms for the smoothing parameter include pseudolikelihood, path sampling, the exchange algorithm, approximate Bayesian computation (ABC-MCMC and ABC-SMC), and the parametric functional approximate Bayesian (PFAB) algorithm. Refer to <doi:10.1007/978-3-030-42553-1_6> for an overview and also to <doi:10.1007/s11222-014-9525-6> and <doi:10.1214/18-BA1130> for further details of specific algorithms.
Spatial stratified heterogeneity (SSH), referring to the within strata are more similar than the between strata, a model with global parameters would be confounded if input data is SSH. Note that the "spatial" here can be either geospatial or the space in mathematical meaning. Geographical detector is a novel tool to investigate SSH: (1) measure and find SSH of a variable Y; (2) test the power of determinant X of a dependent variable Y according to the consistency between their spatial distributions; and (3) investigate the interaction between two explanatory variables X1 and X2 to a dependent variable Y (Wang et al 2014 <doi:10.1080/13658810802443457>, Wang, Zhang, and Fu 2016 <doi:10.1016/j.ecolind.2016.02.052>).
This package provides a collection of data processing, visualization, and export functions to support soil survey operations. Many of the functions build on the `SoilProfileCollection`
S4 class provided by the aqp package, extending baseline visualization to more elaborate depictions in the context of spatial and taxonomic data. While this package is primarily developed by and for the USDA-NRCS, in support of the National Cooperative Soil Survey, the authors strive for generalization sufficient to support any soil survey operation. Many of the included functions are used by the SoilWeb
suite of websites and movile applications. These functions are provided here, with additional documentation, to enable others to replicate high quality versions of these figures for their own purposes.
The workflow is a versatile R package designed for comprehensive feature selection in bulk RNAseq datasets. Its key innovation lies in the seamless integration of the Python scikit-learn (<https://scikit-learn.org/stable/index.html>) machine learning framework with R-based bioinformatics tools. GeneSelectR
performs robust Machine Learning-driven (ML) feature selection while leveraging Gene Ontology (GO) enrichment analysis as described by Thomas PD et al. (2022) <doi:10.1002/pro.4218>, using clusterProfiler
(Wu et al., 2021) <doi:10.1016/j.xinn.2021.100141> and semantic similarity analysis powered by simplifyEnrichment
(Gu, Huebschmann, 2021) <doi:10.1016/j.gpb.2022.04.008>. This combination of methodologies optimizes computational and biological insights for analyzing complex RNAseq datasets.
This package provides a systematic biology tool was developed to prioritize cancer subtype-specific drugs by integrating genetic perturbation, drug action, biological pathway, and cancer subtype. The capabilities of this tool include inferring patient-specific subpathway activity profiles in the context of gene expression profiles with subtype labels, calculating differentially expressed subpathways based on cultured human cells treated with drugs in the cMap
(connectivity map) database, prioritizing cancer subtype specific drugs according to drug-disease reverse association score based on subpathway, and visualization of results (Castelo (2013) <doi:10.1186/1471-2105-14-7>; Han et al (2019) <doi:10.1093/bioinformatics/btz894>; Lamb and Justin (2006) <doi:10.1126/science.1132939>). Please cite using <doi:10.1093/bioinformatics/btab011>.
This package provides a simulation model and accompanying functions that support assessing silvicultural concepts on the forest estate level with a focus on the CO2 uptake by wood growth and CO2 emissions by forest operations. For achieving this, a virtual forest estate area is split into the areas covered by typical phases of the silvicultural concept of interest. Given initial area shares of these phases, the dynamics of these areas is simulated. The typical carbon stocks and flows which are known for all phases are attributed post-hoc to the areas and upscaled to the estate level. CO2 emissions by forest operations are estimated based on the amounts and dimensions of the harvested timber. Probabilities of damage events are taken into account.
This package provides a multi-core R package that allows for the statistical modeling of multi-group multivariate mixed data using Gaussian graphical models. Combining the Gaussian copula framework with the fused graphical lasso penalty, the heteromixgm package can handle a wide variety of datasets found in various sciences. The package also includes an option to perform model selection using the AIC, BIC and EBIC information criteria, a function that plots partial correlation graphs based on the selected precision matrices, as well as simulate mixed heterogeneous data for exploratory or simulation purposes and one multi-group multivariate mixed agricultural dataset pertaining to maize yields. The package implements the methodological developments found in Hermes et al. (2024) <doi:10.1080/10618600.2023.2289545>.
An R implementation of methods employed in the field of pedometrics, soil science discipline dedicated to studying the spatial, temporal, and spatio-temporal variation of soil using statistical and computational methods. The methods found here include the calibration of linear regression models using covariate selection strategies, computation of summary validation statistics for predictions, generation of summary plots, evaluation of the local quality of a geostatistical model of uncertainty, and so on. Other functions simply extend the functionalities of or facilitate the usage of functions from other packages that are commonly used for the analysis of soil data. Formerly available versions of suggested packages no longer available from CRAN can be obtained from the CRAN archive <https://cran.r-project.org/src/contrib/Archive/>.
Computation of sparse eigenvectors of a matrix (aka sparse PCA) with running time 2-3 orders of magnitude lower than existing methods and better final performance in terms of recovery of sparsity pattern and estimation of numerical values. Can handle covariance matrices as well as data matrices with real or complex-valued entries. Different levels of sparsity can be specified for each individual ordered eigenvector and the method is robust in parameter selection. See vignette for a detailed documentation and comparison, with several illustrative examples. The package is based on the paper: K. Benidis, Y. Sun, P. Babu, and D. P. Palomar (2016). "Orthogonal Sparse PCA and Covariance Estimation via Procrustes Reformulation," IEEE Transactions on Signal Processing <doi:10.1109/TSP.2016.2605073>.
Increasingly powerful techniques for high-throughput sequencing open the possibility to comprehensively characterize microbial communities, including rare species. However, a still unresolved issue are the substantial error rates in the experimental process generating these sequences. To overcome these limitations we propose an approach, where each sample is split and the same amplification and sequencing protocol is applied to both halves. This procedure should allow to detect likely PCR and sequencing artifacts, and true rare species by comparison of the results of both parts. The AmpliconDuo
package, whereas amplicon duo from here on refers to the two amplicon data sets of a split sample, is intended to help interpret the obtained read frequency distribution across split samples, and to filter the false positive reads.
This package provides a Bayesian framework to estimate the Student's t-distribution's degrees of freedom is developed. Markov Chain Monte Carlo sampling routines are developed as in <doi:10.3390/axioms11090462> to sample from the posterior distribution of the degrees of freedom. A random walk Metropolis algorithm is used for sampling when Jeffrey's and Gamma priors are endowed upon the degrees of freedom. In addition, the Metropolis-adjusted Langevin algorithm for sampling is used under the Jeffrey's prior specification. The Log-normal prior over the degrees of freedom is posed as a viable choice with comparable performance in simulations and real-data application, against other prior choices, where an Elliptical Slice Sampler is used to sample from the concerned posterior.
Meta-package for statistical and machine learning with a unified interface for model fitting, prediction, performance assessment, and presentation of results. Approaches for model fitting and prediction of numerical, categorical, or censored time-to-event outcomes include traditional regression models, regularization methods, tree-based methods, support vector machines, neural networks, ensembles, data preprocessing, filtering, and model tuning and selection. Performance metrics are provided for model assessment and can be estimated with independent test sets, split sampling, cross-validation, or bootstrap resampling. Resample estimation can be executed in parallel for faster processing and nested in cases of model tuning and selection. Modeling results can be summarized with descriptive statistics; calibration curves; variable importance; partial dependence plots; confusion matrices; and ROC, lift, and other performance curves.
This package provides a package for selecting the most relevant features (genes) in the high-dimensional binary classification problems. The discriminative features are identified using analyzing the overlap between the expression values across both classes. The package includes functions for measuring the proportional overlapping score for each gene avoiding the outliers effect. The used measure for the overlap is the one defined in the "Proportional Overlapping Score (POS)" technique for feature selection. A gene mask which represents a gene's classification power can also be produced for each gene (feature). The set size of the selected genes might be set by the user. The minimum set of genes that correctly classify the maximum number of the given tissue samples (observations) can be also produced.