Supports systematic scrutiny, modification, and integration of data. The function status()
counts rows that have missing values in grouping columns (returned by na()
), have non-unique combinations of grouping columns (returned by dup()
), and that are not locally sorted (returned by unsorted()
). Functions enumerate()
and itemize()
give sorted unique combinations of columns, with or without occurrence counts, respectively. Function ignore()
drops columns in x that are present in y, and informative()
drops columns in x that are entirely NA; constant()
returns values that are constant, given a key. Data that have defined unique combinations of grouping values behave more predictably during merge operations.
We provide tools to estimate two prediction accuracy metrics, the average positive predictive values (AP) as well as the well-known AUC (the area under the receiver operator characteristic curve) for risk scores. The outcome of interest is either binary or censored event time. Note that for censored event time, our functions estimates, the AP and the AUC, are time-dependent for pre-specified time interval(s). A function that compares the APs of two risk scores/markers is also included. Optional outputs include positive predictive values and true positive fractions at the specified marker cut-off values, and a plot of the time-dependent AP versus time (available for event time data).
This package provides a specialized tool is designed for assessing contextual bandit algorithms, particularly those aimed at handling overdispersed and zero-inflated count data. It offers a simulated testing environment that includes various models like Poisson, Overdispersed Poisson, Zero-inflated Poisson, and Zero-inflated Overdispersed Poisson. The package is capable of executing five specific algorithms: Linear Thompson sampling with log transformation on the outcome, Thompson sampling Poisson, Thompson sampling Negative Binomial, Thompson sampling Zero-inflated Poisson, and Thompson sampling Zero-inflated Negative Binomial. Additionally, it can generate regret plots to evaluate the performance of contextual bandit algorithms. This package is based on the algorithms by Liu et al. (2023) <arXiv:2311.14359>
.
Doubly censored data, as described in Chang and Yang (1987) <doi: 10.1214/aos/1176350608>), are commonly seen in many fields. We use EM algorithm to compute the non-parametric MLE (NPMLE) of the cummulative probability function/survival function and the two censoring distributions. One can also specify a constraint F(T)=C, it will return the constrained NPMLE and the -2 log empirical likelihood ratio for this constraint. This can be used to test the hypothesis about the constraint and, by inverting the test, find confidence intervals for probability or quantile via empirical likelihood ratio theorem. Influence functions of hat F may also be calculated, but currently, the it may be slow.
Computations of Fisher's z-tests concerning different kinds of correlation differences. The diffpwr family entails approaches to estimating statistical power via Monte Carlo simulations. Important to note, the Pearson correlation coefficient is sensitive to linear association, but also to a host of statistical issues such as univariate and bivariate outliers, range restrictions, and heteroscedasticity (e.g., Duncan & Layard, 1973 <doi:10.1093/BIOMET/60.3.551>; Wilcox, 2013 <doi:10.1016/C2010-0-67044-1>). Thus, every power analysis requires that specific statistical prerequisites are fulfilled and can be invalid if the prerequisites do not hold. To this end, the bootcor family provides bootstrapping confidence intervals for the incorporated correlation difference tests.
Simulation and estimation of Exponential Random Graph Models (ERGMs) for small networks using exact statistics as shown in Vega Yon et al. (2020) <DOI:10.1016/j.socnet.2020.07.005>. As a difference from the ergm package, ergmito circumvents using Markov-Chain Maximum Likelihood Estimator (MC-MLE) and instead uses Maximum Likelihood Estimator (MLE) to fit ERGMs for small networks. As exhaustive enumeration is computationally feasible for small networks, this R package takes advantage of this and provides tools for calculating likelihood functions, and other relevant functions, directly, meaning that in many cases both estimation and simulation of ERGMs for small networks can be faster and more accurate than simulation-based algorithms.
This package provides a set of simplified functions for creating funnel plots for proportion data. This package supports user defined benchmarks, confidence limits and estimation methods (i.e. exact or approximate) based on Spiegelhalter (2005) <doi:10.1002/sim.1970>. Additional routines for returning scored unit level data according to a set of specifications is also implemented for convenience. Specifically, both a categorical and a continuous score variable is returned to the sample data frame, which identifies which observations are deemed extreme or in control. Typically, such variables are useful as stratifications or covariates in further exploratory analyses. Lastly, the plotting routine returns a base funnel plot ('ggplot2'), which can also be tailored.
Generalized competing event model based on Cox PH model and Fine-Gray model. This function is designed to develop optimized risk-stratification methods for competing risks data, such as described in: 1. Carmona R, Gulaya S, Murphy JD, Rose BS, Wu J, Noticewala S,McHale
MT, Yashar CM, Vaida F, and Mell LK (2014) <DOI:10.1016/j.ijrobp.2014.03.047>. 2. Carmona R, Zakeri K, Green G, Hwang L, Gulaya S, Xu B, Verma R, Williamson CW, Triplett DP, Rose BS, Shen H, Vaida F, Murphy JD, and Mell LK (2016) <DOI:10.1200/JCO.2015.65.0739>. 3. Lunn, Mary, and Don McNeil
(1995) <DOI:10.2307/2532940>.
This package provides a tool for Hierarchical Climate Regionalization applicable to any correlation-based clustering. It adds several features and a new clustering method (called, regional linkage) to hierarchical clustering in R ('hclust function in stats library): data regridding, coarsening spatial resolution, geographic masking, contiguity-constrained clustering, data filtering by mean and/or variance thresholds, data preprocessing (detrending, standardization, and PCA), faster correlation function with preliminary big data support, different clustering methods, hybrid hierarchical clustering, multivariate clustering (MVC), cluster validation, visualization of regionalization results, and exporting region map and mean timeseries into NetCDF-4
file. The technical details are described in Badr et al. (2015) <doi:10.1007/s12145-015-0221-7>.
This package provides access to the Idea Data Center (IDC) application for conducting nonresponse bias analysis (NRBA). The IDC NRBA app is an interactive, browser-based Shiny application that can be used to analyze survey data with respect to response rates, representativeness, and nonresponse bias. This app provides a user-friendly interface to statistical methods implemented by the nrba package. Krenzke, Van de Kerckhove, and Mohadjer (2005) <http://www.asasrms.org/Proceedings/y2005/files/JSM2005-000572.pdf> and Lohr and Riddles (2016) <https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2016002/article/14677-eng.pdf?st=q7PyNsGR>
provide an overview of the statistical methods implemented in the application.
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq
) is the premier technology for profiling genome-wide localization of chromatin-binding proteins, including transcription factors and histones with various modifications. This package provides a robust method for normalizing ChIP-seq
signals across individual samples or groups of samples. It also designs a self-contained system of statistical models for calling differential ChIP-seq
signals between two or more biological conditions as well as for calling hypervariable ChIP-seq
signals across samples. Refer to Tu et al. (2021) <doi:10.1101/gr.262675.120> and Chen et al. (2022) <doi:10.1186/s13059-022-02627-9> for associated statistical details.
Read depth data from genotyping-by-sequencing (GBS) or restriction site-associated DNA sequencing (RAD-seq) are imported and used to make Bayesian probability estimates of genotypes in polyploids or diploids. The genotype probabilities, posterior mean genotypes, or most probable genotypes can then be exported for downstream analysis. polyRAD
is described by Clark et al. (2019) <doi:10.1534/g3.118.200913>, and the Hind/He statistic for marker filtering is described by Clark et al. (2022) <doi:10.1186/s12859-022-04635-9>. A variant calling pipeline for highly duplicated genomes is also included and is described by Clark et al. (2020, Version 1) <doi:10.1101/2020.01.11.902890>.
Evaluates moments of ratios (and products) of quadratic forms in normal variables, specifically using recursive algorithms developed by Bao and Kan (2013) <doi:10.1016/j.jmva.2013.03.002> and Hillier et al. (2014) <doi:10.1017/S0266466613000364>. Also provides distribution, quantile, and probability density functions of simple ratios of quadratic forms in normal variables with several algorithms. Originally developed as a supplement to Watanabe (2023) <doi:10.1007/s00285-023-01930-8> for evaluating average evolvability measures in evolutionary quantitative genetics, but can be used for a broader class of statistics. Generating functions for these moments are also closely related to the top-order zonal and invariant polynomials of matrix arguments.
Fits time trend models for routine disease surveillance tasks and returns probability distributions for a variety of quantities of interest, including age-standardized rates, period and cumulative percent change, and measures of health inequality. The models are appropriate for count data such as disease incidence and mortality data, employing a Poisson or binomial likelihood and the first-difference (random-walk) prior for unknown risk. Optionally add a covariance matrix for multiple, correlated time series models. Inference is completed using Markov chain Monte Carlo via the Stan modeling language. References: Donegan, Hughes, and Lee (2022) <doi:10.2196/34589>; Stan Development Team (2021) <https://mc-stan.org>; Theil (1972, ISBN:0-444-10378-3).
This package provides functions that solve initial value problems of a system of first-order ordinary differential equations (ODE), of partial differential equations (PDE), of differential algebraic equations (DAE), and of delay differential equations. The functions provide an interface to the FORTRAN functions lsoda
, lsodar
, lsode
, lsodes
of the ODEPACK collection, to the FORTRAN functions dvode
and daspk
and a C-implementation of solvers of the Runge-Kutta family with fixed or variable time steps. The package contains routines designed for solving ODEs resulting from 1-D, 2-D and 3-D partial differential equations that have been converted to ODEs by numerical differencing.
This package provides a recent method proposed by Yi and Chen (2023) <doi:10.1177/09622802221146308> is used to estimate the average treatment effects using noisy data containing both measurement error and spurious variables. The package AteMeVs
contains a set of functions that provide a step-by-step estimation procedure, including the correction of the measurement error effects, variable selection for building the model used to estimate the propensity scores, and estimation of the average treatment effects. The functions contain multiple options for users to implement, including different ways to correct for the measurement error effects, distinct choices of penalty functions to do variable selection, and various regression models to characterize propensity scores.
Assigns standardized diagnoses using the Banff Classification (Category 1 to 6 diagnoses, including Acute and Chronic active T-cell mediated rejection as well as Active, Chronic active, and Chronic antibody mediated rejection). The main function considers a minimal dataset containing biopsies information in a specific format (described by a data dictionary), verifies its content and format (based on the data dictionary), assigns diagnoses, and creates a summary report. The package is developed on the reference guide to the Banff classification of renal allograft pathology Roufosse C, Simmonds N, Clahsen-van Groningen M, et al. A (2018) <doi:10.1097/TP.0000000000002366>. The full description of the Banff classification is available at <https://banfffoundation.org/>.
Multiple comparison techniques are typically applied following an F test from an ANOVA to decide which means are significantly different from one another. As an alternative to traditional methods, cluster analysis can be performed to group the means of different treatments into non-overlapping clusters. Treatments in different groups are considered statistically different. Several approaches have been proposed, with varying clustering methods and cut-off criteria. This package implements cluster-based multiple comparisons tests and also provides a visual representation in the form of a dendrogram. Di Rienzo, J. A., Guzman, A. W., & Casanoves, F. (2002) <jstor.org/stable/1400690>. Bautista, M. G., Smith, D. W., & Steiner, R. L. (1997) <doi:10.2307/1400402>.
This package implements a flexible, versatile, and computationally tractable model for density regression based on a single-weights dependent Dirichlet process mixture of normal distributions model for univariate continuous responses. The model assumes an additive structure for the mean of each mixture component and the effects of continuous covariates are captured through smooth nonlinear functions. The key components of our modelling approach are penalised B-splines and their bivariate tensor product extension. The proposed method can also easily deal with parametric effects of categorical covariates, linear effects of continuous covariates, interactions between categorical and/or continuous covariates, varying coefficient terms, and random effects. Please see Rodriguez-Alvarez, Inacio et al. (2025) for more details.
Nonparametric unfolding item response theory (IRT) model for dichotomous data (see W.H. Van Schuur (1984). Structure in Political Beliefs: A New Model for Stochastic Unfolding with Application to European Party Activists, and W.J.Post (1992). Nonparametric Unfolding Models: A Latent Structure Approach). The package implements MUDFOLD (Multiple UniDimensional
unFOLDing
), an iterative item selection algorithm that constructs unfolding scales from dichotomous preferential-choice data without explicitly assuming a parametric form of the item response functions. Scale diagnostics from Post(1992) and estimates for the person locations proposed by Johnson(2006) and Van Schuur(1984) are also available. This model can be seen as the unfolding variant of Mokken(1971) scaling method.
This package provides SHAP explanations of machine learning models. In applied machine learning, there is a strong belief that we need to strike a balance between interpretability and accuracy. However, in field of the Interpretable Machine Learning, there are more and more new ideas for explaining black-box models. One of the best known method for local explanations is SHapley Additive exPlanations
(SHAP) introduced by Lundberg, S., et al., (2016) <arXiv:1705.07874>
The SHAP method is used to calculate influences of variables on the particular observation. This method is based on Shapley values, a technique used in game theory. The R package shapper is a port of the Python library shap'.
This package provides a RangedSummarizedExperiment
object of read counts in genes for a time course RNA-Seq experiment of fission yeast (Schizosaccharomyces pombe) in response to oxidative stress (1M sorbitol treatment) at 0, 15, 30, 60, 120 and 180 mins. The samples are further divided between a wild-type group and a group with deletion of atf21. The read count matrix was prepared and provided by the author of the study: Leong HS, Dawson K, Wirth C, Li Y, Connolly Y, Smith DL, Wilkinson CR, Miller CJ. "A global non-coding RNA system modulates fission yeast protein levels in response to stress". Nat Commun 2014 May 23;5:3947. PMID: 24853205. GEO: GSE56761.
This package implements species distribution modeling and ecological niche modeling, including: bias correction, spatial cross-validation, model evaluation, raster interpolation, biotic "velocity" (speed and direction of movement of a "mass" represented by a raster), interpolating across a time series of rasters, and use of spatially imprecise records. The heart of the package is a set of "training" functions which automatically optimize model complexity based number of available occurrences. These algorithms include MaxEnt
, MaxNet
, boosted regression trees/gradient boosting machines, generalized additive models, generalized linear models, natural splines, and random forests. To enhance interoperability with other modeling packages, no new classes are created. The package works with PROJ6 geodetic objects and coordinate reference systems.
This package provides ability to create color palettes from image files. It offers control over the type of color palette to derive from an image (qualitative, sequential or divergent) and other palette properties. Quantiles of an image color distribution can be trimmed. Near-black or near-white colors can be trimmed in RGB color space independent of trimming brightness or saturation distributions in HSV color space. Creating sequential palettes also offers control over the order of HSV color dimensions to sort by. This package differs from other related packages like RImagePalette
in approaches to quantizing and extracting colors in images to assemble color palettes and the level of user control over palettes construction.