This package provides tools for the calculation of common biodiversity indices from count data. Additionally, it incorporates bootstrapping techniques to generate multiple samples, facilitating the estimation of confidence intervals around these indices. Furthermore, the package allows for the exploration of how variation in these indices changes with differing numbers of sites, making it a useful tool with which to begin an ecological analysis. Methods are based on the following references: Chao et al. (2014) <doi:10.1890/13-0133.1>, Chao and Colwell (2022) <doi:10.1002/9781119902911.ch2>, Hsieh, Ma,` and Chao (2016) <doi:10.1111/2041-210X.12613>.
This package provides a collection of functions for outlier detection in functional data analysis. Methods implemented include directional outlyingness by Dai and Genton (2019) <doi:10.1016/j.csda.2018.03.017>, MS-plot by Dai and Genton (2018) <doi:10.1080/10618600.2018.1473781>, total variation depth and modified shape similarity index by Huang and Sun (2019) <doi:10.1080/00401706.2019.1574241>, and sequential transformations by Dai et al. (2020) <doi:10.1016/j.csda.2020.106960 among others. Additional outlier detection tools and depths for functional data like functional boxplot, (modified) band depth etc., are also available.
This package provides a suite of tools for specifying and examining experimental designs related to choice response time models (e.g., the Diffusion Decision Model). This package allows users to define how experimental factors influence one or more model parameters using R-style formula syntax, while also checking the logical consistency of these associations. Additionally, it integrates with the ggdmc package, which employs Differential Evolution Markov Chain Monte Carlo (DE-MCMC) sampling to optimise model parameters. For further details on the model-building approach, see Heathcote, Lin, Reynolds, Strickland, Gretton, and Matzke (2019) <doi:10.3758/s13428-018-1067-y>.
Two distinct but related statistical approaches to the problem of identifying the combinations of medication error characteristics that are more likely to result in harm are implemented in this package: 1) a Bayesian hierarchical model with optimal Bayesian ranking on the log odds of harm, and 2) an empirical Bayes model that estimates the ratio of the observed count of harm to the count that would be expected if error characteristics and harm were independent. In addition, for the Bayesian hierarchical model, the package provides functions to assess the sensitivity of results to different specifications of the random effects distributions.
General implementation of core function from phase-type theory. PhaseTypeR can be used to model continuous and discrete phase-type distributions, both univariate and multivariate. The package includes functions for outputting the mean and (co)variance of phase-type distributions; their density, probability and quantile functions; functions for random draws; functions for reward-transformation; and functions for plotting the distributions as networks. For more information on these functions please refer to Bladt and Nielsen (2017, ISBN: 978-1-4939-8377-3) and Campillo Navarro (2019) <https://orbit.dtu.dk/en/publications/order-statistics-and-multivariate-discrete-phase-type-distributio>.
This package provides a spatial population can be generated based on spatially varying regression model under the assumption that observations are collected from a uniform two-dimensional grid consist of (m * m) lattice points with unit distance between any two neighbouring points. For method details see Chao, Liu., Chuanhua, Wei. and Yunan, Su. (2018).<DOI:10.1080/10485252.2018.1499907>. This spatially generated data can be used to test different issues related to the statistical analysis of spatial data. This generated spatial data can be utilized in geographically weighted regression analysis for studying the spatially varying relationships among the variables.
Social risks are increasingly becoming a critical component of health care research. One of the most common ways to identify social needs is by using ICD-10-CM "Z-codes." This package identifies social risks using varying taxonomies of ICD-10-CM Z-codes from administrative health care data. The conceptual taxonomies come from: Centers for Medicare and Medicaid Services (2021) <https://www.cms.gov/files/document/zcodes-infographic.pdf>, Reidhead (2018) <https://web.mhanet.com/>, A Arons, S DeSilvey, C Fichtenberg, L Gottlieb (2018) <https://sirenetwork.ucsf.edu/tools-resources/resources/compendium-medical-terminology-codes-social-risk-factors>.
The CNVMetrics package calculates similarity metrics to facilitate copy number variant comparison among samples and/or methods. Similarity metrics can be employed to compare CNV profiles of genetically unrelated samples as well as those with a common genetic background. Some metrics are based on the shared amplified/deleted regions while other metrics rely on the level of amplification/deletion. The data type used as input is a plain text file containing the genomic position of the copy number variations, as well as the status and/or the log2 ratio values. Finally, a visualization tool is provided to explore resulting metrics.
Peptide Set Test (PepSetTest) is a peptide-centric strategy to infer differentially expressed proteins in LC-MS/MS proteomics data. This test detects coordinated changes in the expression of peptides originating from the same protein and compares these changes against the rest of the peptidome. Compared to traditional aggregation-based approaches, the peptide set test demonstrates improved statistical power, yet controlling the Type I error rate correctly in most cases. This test can be valuable for discovering novel biomarkers and prioritizing drug targets, especially when the direct application of statistical analysis to protein data fails to provide substantial insights.
The MsFeature package defines functionality for Mass Spectrometry features. This includes functions to group (LC-MS) features based on some of their properties, such as retention time (coeluting features), or correlation of signals across samples. This package hence can be used to group features, and its results can be used as an input for the QFeatures package which allows aggregating abundance levels of features within each group. This package defines concepts and functions for base and common data types, implementations for more specific data types are expected to be implemented in the respective packages (such as e.g. xcms).
Prognostic Enrichment is a strategy of enriching a clinical trial for testing an intervention intended to prevent or delay an unwanted clinical event. A prognostically enriched trial enrolls only patients who are more likely to experience the unwanted clinical event than the broader patient population (R. Temple (2010) <doi:10.1038/clpt.2010.233>). By testing the intervention in an enriched study population, the trial may be adequately powered with a smaller sample size, which can have both practical and ethical advantages. This package provides tools to evaluate biomarkers for prognostic enrichment of clinical trials with survival/time-to-event outcomes.
The function takes a DNA sequence, a start point, an end point in the sequence, dot size and dot color and draws a fractal image of the sequence. The fractal starts in the center of the canvas. The image is drawn by moving base by base along the sequence and dropping a midpoint between the actual point and the corner designated by the actual base. For more details see Jeffrey (1990) <doi:10.1093/nar/18.8.2163>, Hill, Schisler, and Singh (1992) <doi:10.1007/BF00178602>, and Löchel and Heider (2021) <doi:10.1016/j.csbj.2021.11.008>.
We provide three distance metrics for measuring the separation between two clusters in high-dimensional spaces. The first metric is the centroid distance, which calculates the Euclidean distance between the centers of the two groups. The second is a ridge Mahalanobis distance, which incorporates a ridge correction constant, alpha, to ensure that the covariance matrix is invertible. The third metric is the maximal data piling distance, which computes the orthogonal distance between the affine spaces spanned by each class. These three distances are asymptotically interconnected and are applicable in tasks such as discrimination, clustering, and outlier detection in high-dimensional settings.
Power and sample size calculations for a variety of study designs and outcomes. Methods include t tests, ANOVA (including tests for interactions, simple effects and contrasts), proportions, categorical data (chi-square tests and proportional odds), linear, logistic and Poisson regression, alternative and coprimary endpoints, power for confidence intervals, correlation coefficient tests, cluster randomized trials, individually randomized group treatment trials, multisite trials, treatment-by-covariate interaction effects and nonparametric tests of location. Utilities are provided for computing various effect sizes. Companion package to the book "Power and Sample Size in R", Crespi (2025, ISBN:9781138591622). Further resources available at <https://powerandsamplesize.org/>.
This package provides implementation of the "Topic SCORE" algorithm that is proposed by Tracy Ke and Minzhe Wang. The singular value decomposition step is optimized through the usage of svds() function in RSpectra package, on a dgRMatrix sparse matrix. Also provides a column-wise error measure in the word-topic matrix A, and an algorithm for recovering the topic-document matrix W given A and D based on quadratic programming. The details about the techniques are explained in the paper "A new SVD approach to optimal topic estimation" by Tracy Ke and Minzhe Wang (2017) <arXiv:1704.07016>.
The basecallQC package provides tools to work with Illumina bcl2Fastq (versions >= 2.1.7) software.Prior to basecalling and demultiplexing using the bcl2Fastq software, basecallQC functions allow the user to update Illumina sample sheets from versions <= 1.8.9 to >= 2.1.7 standards, clean sample sheets of common problems such as invalid sample names and IDs, create read and index basemasks and the bcl2Fastq command. Following the generation of basecalled and demultiplexed data, the basecallQC packages allows the user to generate HTML tables, plots and a self contained report of summary metrics from Illumina XML output files.
This package provides functionality for users who are learning R or the techniques of data analysis. Written as a collection of wrapper functions, the DTwrapper package facilitates many core operations of data processing. This is achieved with relatively few requirements about the order of the processing steps or knowledge of specialized syntax. DTwrappers creates coding results along with translations to data.table's code. This enables users to benefit from the speed and efficiency of data.table's calculations. Furthermore, the package also provides the translated code for educational purposes so that users can review working examples of coding syntax and calculations.
An implementation of the methodology described in Petersen and Mueller (2016) <doi:10.1214/15-AOS1363> for the functional data analysis of samples of density functions. Densities are first transformed to their corresponding log quantile densities, followed by ordinary Functional Principal Components Analysis (FPCA). Transformation modes of variation yield improved interpretation of the variability in the data as compared to FPCA on the densities themselves. The standard fraction of variance explained (FVE) criterion commonly used for functional data is adapted to the transformation setting, also allowing for an alternative quantification of variability for density data through the Wasserstein metric of optimal transport.
Likelihood evaluations for stationary Gaussian time series are typically obtained via the Durbin-Levinson algorithm, which scales as O(n^2) in the number of time series observations. This package provides a "superfast" O(n log^2 n) algorithm written in C++, crossing over with Durbin-Levinson around n = 300. Efficient implementations of the score and Hessian functions are also provided, leading to superfast versions of inference algorithms such as Newton-Raphson and Hamiltonian Monte Carlo. The C++ code provides a Toeplitz matrix class packaged as a header-only library, to simplify low-level usage in other packages and outside of R.
Generates/modifies RNA-seq data for use in simulations. We provide a suite of functions that will add a known amount of signal to a real RNA-seq dataset. The advantage of using this approach over simulating under a theoretical distribution is that common/annoying aspects of the data are more preserved, giving a more realistic evaluation of your method. The main functions are select_counts(), thin_diff(), thin_lib(), thin_gene(), thin_2group(), thin_all(), and effective_cor(). See Gerard (2020) <doi:10.1186/s12859-020-3450-9> for details on the implemented methods.
An implementation of the representation-dependent gene level operations of grammar-based genetic programming with genes which are derivation trees of a context-free grammar: Initialization of a gene with a complete random derivation tree, decoding of a derivation tree. Crossover is implemented by exchanging subtrees. Depth-bounds for the minimal and the maximal depth of the roots of the subtrees exchanged by crossover can be set. Mutation is implemented by replacing a subtree by a random subtree. The depth of the random subtree and the insertion node are configurable. For details, see Geyer-Schulz (1997, ISBN:978-3-7908-0830-X).
This package is built to perform GWAS analysis using Bayesian techniques. Currently, GWAS.BAYES has functionality for the implementation of BICOSS (Williams, J., Ferreira, M. A., and Ji, T. (2022). BICOSS: Bayesian iterative conditional stochastic search for GWAS. BMC Bioinformatics), BGWAS (Williams, J., Xu, S., Ferreira, M. A.. (2023) "BGWAS: Bayesian variable selection in linear mixed models with nonlocal priors for genome-wide association studies." BMC Bioinformatics), and GINA. All methods currently are for the analysis of Gaussian phenotypes The research related to this package was supported in part by National Science Foundation awards DMS 1853549, DMS 1853556, and DMS 2054173.
KnowYourCG (KYCG) is a supervised learning framework designed for the functional analysis of DNA methylation data. Unlike existing tools that focus on genes or genomic intervals, KnowYourCG directly targets CpG dinucleotides, featuring automated supervised screenings of diverse biological and technical influences, including sequence motifs, transcription factor binding, histone modifications, replication timing, cell-type-specific methylation, and trait-epigenome associations. KnowYourCG addresses the challenges of data sparsity in various methylation datasets, including low-pass Nanopore sequencing, single-cell DNA methylomes, 5-hydroxymethylation profiles, spatial DNA methylation maps, and array-based datasets for epigenome-wide association studies and epigenetic clocks.
This is an R package for interfacing with the BIOM format. This package includes basic tools for reading biom-format files, accessing and subsetting data tables from a biom object (which is more complex than a single table), as well as limited support for writing a biom-object back to a biom-format file. The design of this API is intended to match the Python API and other tools included with the biom-format project, but with a decidedly "R flavor" that should be familiar to R users. This includes S4 classes and methods, as well as extensions of common core functions/methods.