Increasingly powerful techniques for high-throughput sequencing open the possibility to comprehensively characterize microbial communities, including rare species. However, a still unresolved issue are the substantial error rates in the experimental process generating these sequences. To overcome these limitations we propose an approach, where each sample is split and the same amplification and sequencing protocol is applied to both halves. This procedure should allow to detect likely PCR and sequencing artifacts, and true rare species by comparison of the results of both parts. The AmpliconDuo
package, whereas amplicon duo from here on refers to the two amplicon data sets of a split sample, is intended to help interpret the obtained read frequency distribution across split samples, and to filter the false positive reads.
This package provides a Bayesian framework to estimate the Student's t-distribution's degrees of freedom is developed. Markov Chain Monte Carlo sampling routines are developed as in <doi:10.3390/axioms11090462> to sample from the posterior distribution of the degrees of freedom. A random walk Metropolis algorithm is used for sampling when Jeffrey's and Gamma priors are endowed upon the degrees of freedom. In addition, the Metropolis-adjusted Langevin algorithm for sampling is used under the Jeffrey's prior specification. The Log-normal prior over the degrees of freedom is posed as a viable choice with comparable performance in simulations and real-data application, against other prior choices, where an Elliptical Slice Sampler is used to sample from the concerned posterior.
Meta-package for statistical and machine learning with a unified interface for model fitting, prediction, performance assessment, and presentation of results. Approaches for model fitting and prediction of numerical, categorical, or censored time-to-event outcomes include traditional regression models, regularization methods, tree-based methods, support vector machines, neural networks, ensembles, data preprocessing, filtering, and model tuning and selection. Performance metrics are provided for model assessment and can be estimated with independent test sets, split sampling, cross-validation, or bootstrap resampling. Resample estimation can be executed in parallel for faster processing and nested in cases of model tuning and selection. Modeling results can be summarized with descriptive statistics; calibration curves; variable importance; partial dependence plots; confusion matrices; and ROC, lift, and other performance curves.
This package provides a package for selecting the most relevant features (genes) in the high-dimensional binary classification problems. The discriminative features are identified using analyzing the overlap between the expression values across both classes. The package includes functions for measuring the proportional overlapping score for each gene avoiding the outliers effect. The used measure for the overlap is the one defined in the "Proportional Overlapping Score (POS)" technique for feature selection. A gene mask which represents a gene's classification power can also be produced for each gene (feature). The set size of the selected genes might be set by the user. The minimum set of genes that correctly classify the maximum number of the given tissue samples (observations) can be also produced.
This package provides tools for using null models to analyse ecological networks (e.g. food webs, flower-visitation networks, seed-dispersal networks) and detect resource preferences or non-random interactions among network nodes. Tools are provided to run null models, test for and plot preferences, plot and analyse bipartite networks, and export null model results in a form compatible with other network analysis packages. The underlying null model was developed by Agusti et al. (2003) Molecular Ecology <doi:10.1046/j.1365-294X.2003.02014.x> and the full application to ecological networks by Vaughan et al. (2018) econullnetr: an R package using null models to analyse the structure of ecological networks and identify resource selection. Methods in Ecology & Evolution, <doi:10.1111/2041-210X.12907>.
General purpose toolbox for simulating quantum versions of game theoretic models (Flitney and Abbott 2002) <arXiv:quant-ph/0208069>
. Quantum (Nielsen and Chuang 2010, ISBN:978-1-107-00217-3) versions of models that have been handled are: Penny Flip Game (David A. Meyer 1998) <arXiv:quant-ph/9804010>
, Prisoner's Dilemma (J. Orlin Grabbe 2005) <arXiv:quant-ph/0506219>
, Two Person Duel (Flitney and Abbott 2004) <arXiv:quant-ph/0305058>
, Battle of the Sexes (Nawaz and Toor 2004) <arXiv:quant-ph/0110096>
, Hawk and Dove Game (Nawaz and Toor 2010) <arXiv:quant-ph/0108075>
, Newcomb's Paradox (Piotrowski and Sladkowski 2002) <arXiv:quant-ph/0202074>
and Monty Hall Problem (Flitney and Abbott 2002) <arXiv:quant-ph/0109035>
.
Visualization of next generation sequencing (NGS) data is essential for interpreting high-throughput genomics experiment results. GenomicPlot
facilitates plotting of NGS data in various formats (bam, bed, wig and bigwig); both coverage and enrichment over input can be computed and displayed with respect to genomic features (such as UTR, CDS, enhancer), and user defined genomic loci or regions. Statistical tests on signal intensity within user defined regions of interest can be performed and represented as boxplots or bar graphs. Parallel processing is used to speed up computation on multicore platforms. In addition to genomic plots which is suitable for displaying of coverage of genomic DNA (such as ChIPseq
data), metagenomic (without introns) plots can also be made for RNAseq or CLIPseq data as well.
This package provides comprehensive cytokine profiling analysis through quality control using biologically meaningful cutoffs on raw cytokine measurements and by testing for distributional symmetry to recommend appropriate transformations. Offers exploratory data analysis with summary statistics, enhanced boxplots, and barplots, along with univariate and multivariate analytical capabilities for in-depth cytokine profiling such as Principal Component Analysis based on Andrzej MaÄ kiewicz and Waldemar Ratajczak (1993) <doi:10.1016/0098-3004(93)90090-R>, Sparse Partial Least Squares Discriminant Analysis based on Lê Cao K-A, Boitard S, and Besse P (2011) <doi:10.1186/1471-2105-12-253>, Random Forest based on Breiman, L. (2001) <doi:10.1023/A:1010933404324>, and Extreme Gradient Boosting based on Tianqi Chen and Carlos Guestrin (2016) <doi:10.1145/2939672.2939785>.
Also abbreviates to "CCSeq". Finds clusters of colocalized sequences in .bed annotation files up to a specified cut-off distance. Two sequences are colocalized if they are within the cut-off distance of each other, and clusters are sets of sequences where each sequence is colocalized to at least one other sequence in the cluster. For a set of .bed annotation tables provided in a list along with a cut-off distance, the program will output a file containing the locations of each cluster. Annotated .bed files are from the pwmscan application at <https://ccg.epfl.ch/pwmtools/pwmscan.php>. Personal machines might crash or take excessively long depending on the number of annotated sequences in each file and whether chromsearch()
or gensearch()
is used.
This package provides a regression framework for response variables which are continuous self-rating scales such as the Visual Analog Scale (VAS) used in pain assessment, or the Linear Analog Self-Assessment (LASA) scales in quality of life studies. These scales measure subjects perception of an intangible quantity, and cannot be handled as ratio variables because of their inherent non-linearity. We treat them as ordinal variables, measured on a continuous scale. A function (the g function) connects the scale with an underlying continuous latent variable. The link function is the inverse of the CDF of the assumed underlying distribution of the latent variable. A variety of link functions are currently implemented. Such models are described in Manuguerra et al (2020) <doi:10.18637/jss.v096.i08>.
This package provides a shiny app-based GUI wrapper for ggplot with built-in statistical analysis. Import data from file and use dropdown menus and checkboxes to specify the plotting variables, graph type, and look of your plots. Once created, plots can be saved independently or stored in a report that can be saved as a pdf. If new data are added to the file, the report can be refreshed to include new data. Statistical tests can be selected and added to the graphs. Analysis of flow cytometry data is especially integrated with plotGrouper
. Count data can be transformed to return the absolute number of cells in a sample (this feature requires inclusion of the number of beads per sample and information about any dilution performed).
Includes functions for the construction of matched samples that are balanced and representative by design. Among others, these functions can be used for matching in observational studies with treated and control units, with cases and controls, in related settings with instrumental variables, and in discontinuity designs. Also, they can be used for the design of randomized experiments, for example, for matching before randomization. By default, designmatch uses the highs optimization solver, but its performance is greatly enhanced by the Gurobi optimization solver and its associated R interface. For their installation, please follow the instructions at <https://www.gurobi.com/documentation/quickstart.html> and <https://www.gurobi.com/documentation/7.0/refman/r_api_overview.html>. We have also included directions in the gurobi_installation file in the inst folder.
High performance trainers for parameterizing and clustering weighted data. The Gaussian mixture (GM) module includes the conventional EM (expectation maximization) trainer, the component-wise EM trainer, the minimum-message-length EM trainer by Figueiredo and Jain (2002) <doi:10.1109/34.990138>. These trainers accept additional constraints on mixture weights, covariance eigen ratios and on which mixture components are subject to update. The K-means (KM) module offers clustering with the options of (i) deterministic and stochastic K-means++ initializations, (ii) upper bounds on cluster weights (sizes), (iii) Minkowski distances, (iv) cosine dissimilarity, (v) dense and sparse representation of data input. The package improved the typical implementations of GM and KM algorithms in various aspects. It is carefully crafted in multithreaded C++ for modeling large data for industry use.
Interlinearized glossed texts (IGT) are used in descriptive linguistics for representing a morphological analysis of a text through a morpheme-by-morpheme gloss. InterlineaR
provide a set of functions that targets several popular formats of IGT ('SIL Toolbox', EMELD XML') and that turns an IGT into a set of data frames following a relational model (the tables represent the different linguistic units: texts, sentences, word, morphems). The same pieces of software ('SIL FLEX', SIL Toolbox') typically produce dictionaries of the morphemes used in the glosses. InterlineaR
provide a function for turning the LIFT XML dictionary format into a set of data frames following a relational model in order to represent the dictionary entries, the sense(s) attached to the entries, the example(s) attached to senses, etc.
This package provides an efficient method to recover the missing block of an approximately low-rank matrix. Current literature on matrix completion focuses primarily on independent sampling models under which the individual observed entries are sampled independently. Motivated by applications in genomic data integration, we propose a new framework of structured matrix completion (SMC) to treat structured missingness by design [Cai T, Cai TT, Zhang A (2016) <doi:10.1080/01621459.2015.1021005>]. Specifically, our proposed method aims at efficient matrix recovery when a subset of the rows and columns of an approximately low-rank matrix are observed. The main function in our package, smc.FUN()
, is for recovery of the missing block A22 of an approximately low-rank matrix A given the other blocks A11, A12, A21.
This package provides a framework to infer causality on binary data using techniques in frequent pattern mining and estimation statistics. Given a set of individual vectors S=x where x(i) is a realization value of binary variable i, the framework infers empirical causal relations of binary variables i,j from S in a form of causal graph G=(V,E) where V is a set of nodes representing binary variables and there is an edge from i to j in E if the variable i causes j. The framework determines dependency among variables as well as analyzing confounding factors before deciding whether i causes j. The publication of this package is at Chainarong Amornbunchornvej, Navaporn Surasvadi, Anon Plangprasopchok, and Suttipong Thajchayapong (2023) <doi:10.1016/j.heliyon.2023.e15947>.
Fits single-species (univariate) and multi-species (multivariate) non-spatial and spatial abundance models in a Bayesian framework using Markov Chain Monte Carlo (MCMC). Spatial models are fit using Nearest Neighbor Gaussian Processes (NNGPs). Details on NNGP models are given in Datta, Banerjee, Finley, and Gelfand (2016) <doi:10.1080/01621459.2015.1044091> and Finley, Datta, and Banerjee (2022) <doi:10.18637/jss.v103.i05>. Fits single-species and multi-species spatial and non-spatial versions of generalized linear mixed models (Gaussian, Poisson, Negative Binomial), N-mixture models (Royle 2004 <doi:10.1111/j.0006-341X.2004.00142.x>) and hierarchical distance sampling models (Royle, Dawson, Bates (2004) <doi:10.1890/03-3127>). Multi-species spatial models are fit using a spatial factor modeling approach with NNGPs for computational efficiency.
In gene therapy, stem cells are modified using viral vectors to deliver the therapeutic transgene and replace functional properties since the genetic modification is stable and inherited in all cell progeny. The retrieval and mapping of the sequences flanking the virus-host DNA junctions allows the identification of insertion sites (IS), essential for monitoring the evolution of genetically modified cells in vivo. A comprehensive toolkit for the analysis of IS is required to foster clonal trackign studies and supporting the assessment of safety and long term efficacy in vivo. This package is aimed at (1) supporting automation of IS workflow, (2) performing base and advance analysis for IS tracking (clonal abundance, clonal expansions and statistics for insertional mutagenesis, etc.), (3) providing basic biology insights of transduced stem cells in vivo.
Implementation of the bootstrapping approach for the estimation of clustering stability and its application in estimating the number of clusters, as introduced by Yu et al (2016)<doi:10.1142/9789814749411_0007>. Implementation of the non-parametric bootstrap approach to assessing the stability of module detection in a graph, the extension for the selection of a parameter set that defines a graph from data in a way that optimizes stability and the corresponding visualization functions, as introduced by Tian et al (2021) <doi:10.1002/sam.11495>. Implemented out-of-bag stability estimation function and k-select Smin-based k-selection function as introduced by Liu et al (2022) <doi:10.1002/sam.11593>. Implemented ensemble clustering method based-on k-means clustering method, spectral clustering method and hierarchical clustering method.
Pharmacokinetics is the study of drug absorption, distribution, metabolism, and excretion. The pharmacokinetics model explains that how the drug concentration change as the drug moves through the different compartments of the body. For pharmacokinetic modeling and analysis, it is essential to understand the basic pharmacokinetic parameters. All parameters are considered, but only some of parameters are used in the model. Therefore, we need to convert the estimated parameters to the other parameters after fitting the specific pharmacokinetic model. This package is developed to help this converting work. For more detailed explanation of pharmacokinetic parameters, see "Gabrielsson and Weiner" (2007), "ISBN-10: 9197651001"; "Benet and Zia-Amirhosseini" (1995) <DOI: 10.1177/019262339502300203>; "Mould and Upton" (2012) <DOI: 10.1038/psp.2012.4>; "Mould and Upton" (2013) <DOI: 10.1038/psp.2013.14>.
This package performs repeated nested cross-validation for Cox Proportionate Hazards, Cox Lasso, Survival Random Forest, and their ensemble. Returns internally validated concordance index, time-dependent area under the curve, Brier score, calibration slope, and statistical testing of non-linear ensemble outperforming the baseline Cox model. In this, it helps researchers to quantify the gain of using a more complex survival model, or justify its redundancy. Equally, it shows the performance value of the non-linear and interaction terms, and may highlight the need of further feature transformation. Further details can be found in Shamsutdinova, Stamate, Roberts, & Stahl (2022) "Combining Cox Model and Tree-Based Algorithms to Boost Performance and Preserve Interpretability for Health Outcomes" <doi:10.1007/978-3-031-08337-2_15>, where the method is described as Ensemble 1.
hiAnnotator
contains set of functions which allow users to annotate a GRanges object with custom set of annotations. The basic philosophy of this package is to take two GRanges objects (query & subject) with common set of seqnames (i.e. chromosomes) and return associated annotation per seqnames and rows from the query matching seqnames and rows from the subject (i.e. genes or cpg islands). The package comes with three types of annotation functions which calculates if a position from query is: within a feature, near a feature, or count features in defined window sizes. Moreover, each function is equipped with parallel backend to utilize the foreach package. In addition, the package is equipped with wrapper functions, which finds appropriate columns needed to make a GRanges object from a common data frame.
This package provides robust and efficient methods for estimating causal effects in a target population using a multi-source dataset, including those of Dahabreh et al. (2019) <doi:10.1111/biom.13716>, Robertson et al. (2021) <doi:10.48550/arXiv.2104.05905>
, and Wang et al. (2024) <doi:10.48550/arXiv.2402.02684>
. The multi-source data can be a collection of trials, observational studies, or a combination of both, which have the same data structure (outcome, treatment, and covariates). The target population can be based on an internal dataset or an external dataset where only covariate information is available. The causal estimands available are average treatment effects and subgroup treatment effects. See Wang et al. (2025) <doi:10.1017/rsm.2025.5> for a detailed guide on using the package.
Computation of dendrometric and structural parameters from forest inventory data. The objective is to provide an user-friendly R package for researchers, ecologists, foresters, statisticians, loggers and others persons who deal with forest inventory data. Useful conversion of angle value from degree to radian, conversion from angle to slope (in percentage) and their reciprocals as well as principal angle determination are also included. Position and dispersion parameters usually found in forest studies are implemented. The package contains Fibonacci series, its extensions and the Golden Number computation. Useful references are Arcadius Y. J. Akossou, Soufianou Arzouma, Eloi Y. Attakpa, Noël H. Fonton and Kouami Kokou (2013) <doi:10.3390/d5010099> and W. Bonou, R. Glele Kakaï, A.E. Assogbadjo, H.N. Fonton, B. Sinsin (2009) <doi:10.1016/j.foreco.2009.05.032> .