RStudio as of recently offers the option to define addins and assign shortcuts to them. This package contains addins for a few most frequently used functions in a data scientist's (at least mine) daily work (like str()
, example()
, plot()
, head()
, view()
, Desc()
). Most of these functions will use the current selection in the editor window and send the specific command to the console while instantly executing it. Assigning shortcuts to these addins will save you quite a few keystrokes.
This package provides implementations of some of the most important outlier detection algorithms. Includes a tutorial mode option that shows a description of each algorithm and provides a step-by-step execution explanation of how it identifies outliers from the given data with the specified input parameters. References include the works of Azzedine Boukerche, Lining Zheng, and Omar Alfandi (2020) <doi:10.1145/3381028>, Abir Smiti (2020) <doi:10.1016/j.cosrev.2020.100306>, and Xiaogang Su, Chih-Ling Tsai (2011) <doi:10.1002/widm.19>.
This library is a collection of pseudo random number generators.
While Common Lisp does provide a RANDOM
function, it does not allow the user to pass an explicit SEED
, nor to portably exchange the random state between implementations. This can be a headache in cases like games, where a controlled seeding process can be very useful.
For both curiosity and convenience, this library offers multiple algorithms to generate random numbers, as well as a bunch of generally useful methods to produce desired ranges.
This package provides the heuristics miner algorithm for process discovery as proposed by Weijters et al. (2011) <doi:10.1109/CIDM.2011.5949453>. The algorithm builds a causal net from an event log created with the bupaR
package. Event logs are a set of ordered sequences of events for which bupaR
provides the S3 class eventlog()
. The discovered causal nets can be visualised as htmlwidgets and it is possible to annotate them with the occurrence frequency or processing and waiting time of process activities.
Sample size requirements calculation using three different Bayesian criteria in the context of designing an experiment to estimate a normal mean or the difference between two normal means. Functions for calculation of required sample sizes for the Average Length Criterion, the Average Coverage Criterion and the Worst Outcome Criterion in the context of normal means are provided. Functions for both the fully Bayesian and the mixed Bayesian/likelihood approaches are provided. For reference see Joseph L. and Bélisle P. (1997) <https://www.jstor.org/stable/2988525>.
This package provides a set of tools and methods for making and manipulating transcript centric annotations. With these tools the user can easily download the genomic locations of the transcripts, exons and cds of a given organism, from either the UCSC Genome Browser or a BioMart database (more sources will be supported in the future). This information is then stored in a local database that keeps track of the relationship between transcripts, exons, cds and genes. Flexible methods are provided for extracting the desired features in a convenient format.
Efficient simulation-based power and sample size calculations are supported for a broad class of late-stage clinical trials. The following modules are included in the package: Adaptive designs with data-driven sample size or event count re-estimation, Adaptive designs with data-driven treatment selection, Adaptive designs with data-driven population selection, Optimal selection of a futility stopping rule, Event prediction in event-driven trials, Adaptive trials with response-adaptive randomization (experimental module), Traditional trials with multiple objectives (experimental module). Traditional trials with cluster-randomized designs (experimental module).
Managing and exploring parameter estimation results derived from Maximum Likelihood Estimation (MLE) using the likelihood package. It provides functions for organizing, visualizing, and summarizing MLE outcomes, streamlining statistical analysis workflows. By improving interpretation and facilitating model evaluation, it helps users gain deeper insights into parameter estimation and model fitting, making MLE result exploration more efficient and accessible. See Goffe et al. (1994) <doi:10.1016/0304-4076(94)90038-8> for details on MLE, and Canham and Uriarte (2006) <doi:10.1890/04-0657> for application of MLE using likelihood'.
The multispatial convergent cross mapping algorithm can be used as a test for causal associations between pairs of processes represented by time series. This is a combination of convergent cross mapping (CCM), described in Sugihara et al., 2012, Science, 338, 496-500, and dew-drop regression, described in Hsieh et al., 2008, American Naturalist, 171, 71â 80. The algorithm allows CCM to be implemented on data that are not from a single long time series. Instead, data can come from many short time series, which are stitched together using bootstrapping.
This package implements methods for inference on potential waning of vaccine efficacy and for estimation of vaccine efficacy at a user-specified time after vaccination based on data from a randomized, double-blind, placebo-controlled vaccine trial in which participants may be unblinded and placebo subjects may be crossed over to the study vaccine. The methods also for variant stratification and allow adjustment for possible confounding via inverse probability weighting through specification of models for the trial entry process, unblinding mechanisms, and the probability an unblinded placebo participant accepts study vaccine.
This package aggregateBioVar
contains tools to summarize single cell gene expression profiles at the level of subject for single cell RNA-seq data collected from more than one subject (e.g. biological sample or technical replicates). A SingleCellExperiment
object is taken as input and converted to a list of SummarizedExperiment
objects, where each list element corresponds to an assigned cell type. The SummarizedExperiment
objects contain aggregate gene-by-subject count matrices and inter-subject column metadata for individual subjects that can be processed using downstream bulk RNA-seq tools.
This package provides generic data structures and algorithms for use with forest mensuration data in a consistent framework. The functions and objects included are a collection of broadly applicable tools. More specialized applications should be implemented in separate packages that build on this foundation. Documentation about ForestElementsR
is provided by three vignettes included in this package. For an introduction to the field of forest mensuration, refer to the textbooks by Kershaw et al. (2017) <doi:10.1002/9781118902028>, and van Laar and Akca (2007) <doi:10.1007/978-1-4020-5991-9>.
Programs for detecting and cleaning outliers in single time series and in time series from homogeneous and heterogeneous databases using an Orthogonal Greedy Algorithm (OGA) for saturated linear regression models. The programs implement the procedures presented in the paper entitled "Efficient Outlier Detection for Large Time Series Databases" by Pedro Galeano, Daniel Peña and Ruey S. Tsay (2025), working paper, Universidad Carlos III de Madrid. Version 1.0.1 contains some improvements to the algorithm, so the results may vary slightly compared to those obtained with version 0.0.1.
This package provides methods for analysis of compositional data including robust methods (<doi:10.1007/978-3-319-96422-5>), imputation of missing values (<doi:10.1016/j.csda.2009.11.023>), methods to replace rounded zeros (<doi:10.1080/02664763.2017.1410524>, <doi:10.1016/j.chemolab.2016.04.011>, <doi:10.1016/j.csda.2012.02.012>), count zeros (<doi:10.1177/1471082X14535524>), methods to deal with essential zeros (<doi:10.1080/02664763.2016.1182135>), (robust) outlier detection for compositional data, (robust) principal component analysis for compositional data, (robust) factor analysis for compositional data, (robust) discriminant analysis for compositional data (Fisher rule), robust regression with compositional predictors, functional data analysis (<doi:10.1016/j.csda.2015.07.007>) and p-splines (<doi:10.1016/j.csda.2015.07.007>), contingency (<doi:10.1080/03610926.2013.824980>) and compositional tables (<doi:10.1111/sjos.12326>, <doi:10.1111/sjos.12223>, <doi:10.1080/02664763.2013.856871>) and (robust) Anderson-Darling normality tests for compositional data as well as popular log-ratio transformations (addLR
, cenLR
, isomLR
, and their inverse transformations). In addition, visualisation and diagnostic tools are implemented as well as high and low-level plot functions for the ternary diagram.
This package provides implementations of functions that can be used to test multivariate integration routines. The package covers six different integration domains (unit hypercube, unit ball, unit sphere, standard simplex, non-negative real numbers and R^n). For each domain several functions with different properties (smooth, non-differentiable, ...) are available. The functions are available in all dimensions n >= 1. For each function the exact value of the integral is known and implemented to allow testing the accuracy of multivariate integration routines. Details on the available test functions can be found at on the development website.
In population management, data come at more or less regular intervals over time in sampling batches (bouts) and decisions should be made with the minimum number of samples and as quickly as possible. This package provides tools to implement, produce charts with stop lines, summarize results and assess sequential analyses that test hypotheses about population sizes. Two approaches are included: the sequential test of Bayesian posterior probabilities (Rincon, D.F. et al. 2025 <doi:10.1111/2041-210X.70053>), and the sequential probability ratio test (Wald, A. 1945 <http://www.jstor.org/stable/2235829>).
Tests for block-diagonal structure in symmetric matrices (e.g. correlation matrices) under the null hypothesis of exchangeable off-diagonal elements. As described in Segal et al. (2019), these tests can be useful for construct validation either by themselves or as a complement to confirmatory factor analysis. Monte Carlo methods are used to approximate the permutation p-value with Hubert's Gamma (Hubert, 1976) and a t-statistic. This package also implements the chi-squared statistic described by Steiger (1980). Please see Segal, et al. (2019) <doi:10.1007/s11336-018-9647-4> for more information.
The package uses collectbox
to define variants of common box related macros which read the content as real box and not as macro argument. This enables the use of verbatim or other special material as part of this content. The provided macros have the same names as the original versions but start with an upper-case letter instead. The long-form macros, like \Makebox
, can also be used as environments, but not the short-form macros, like \Mbox
. However, normally the long form uses the short form anyway when no optional arguments are used.
Discrete event simulation using both R and C++ (Karlsson et al 2016; <doi:10.1109/eScience.2016.7870915>
). The C++ code is adapted from the SSIM library <https://www.inf.usi.ch/carzaniga/ssim/>, allowing for event-oriented simulation. The code includes a SummaryReport
class for reporting events and costs by age and other covariates. The C++ code is available as a static library for linking to other packages. A priority queue implementation is given in C++ together with an S3 closure and a reference class implementation. Finally, some tools are provided for cost-effectiveness analysis.
Takes the outputs of a caret confusion matrix and allows for the quick conversion of these list items to lists. The intended usage is to allow the tool to work with the outputs of machine learning classification models. This tool works with classification problems for binary and multi-classification problems and allows for the record level conversion of the confusion matrix outputs. This is useful, as it allows quick conversion of these objects for storage in database systems and to track ML model performance over time. Traditionally, this approach has been used for highlighting model representation and feature slippage.
The normal process of creating clinical study slides is that a statistician manually type in the numbers from outputs and a separate statistician to double check the typed in numbers. This process is time consuming, resource intensive, and error prone. Automatic slide generation is a solution to address these issues. It reduces the amount of work and the required time when creating slides, and reduces the risk of errors from manually typing or copying numbers from the output to slides. It also helps users to avoid unnecessary stress when creating large amounts of slide decks in a short time window.
Machine learning algorithms for predictor variables that are compositional data and the response variable is either continuous or categorical. Specifically, the Boruta variable selection algorithm, random forest, support vector machines and projection pursuit regression are included. Relevant papers include: Tsagris M.T., Preston S. and Wood A.T.A. (2011). "A data-based power transformation for compositional data". Fourth International International Workshop on Compositional Data Analysis. <doi:10.48550/arXiv.1106.1451>
and Alenazi, A. (2023). "A review of compositional data analysis and recent advances". Communications in Statistics--Theory and Methods, 52(16): 5535--5567. <doi:10.1080/03610926.2021.2014890>.
Convert one biological ID to another of rice (Oryza sativa). Rice(Oryza sativa) has more than one form gene ID for the genome. The two main gene ID for rice genome are the RAP (The Rice Annotation Project, <https://rapdb.dna.affrc.go.jp/>, and the MSU(The Rice Genome Annotation Project, <http://rice.plantbiology.msu.edu/>. All RAP rice gene IDs are of the form Os##g####### as explained on the website <https://rapdb.dna.affrc.go.jp/>. All MSU rice gene IDs are of the form LOC_Os##g##### as explained on the website <http://rice.plantbiology.msu.edu/analyses_nomenclature.shtml>. All SYMBOL rice gene IDs are the unique name on the NCBI(National Center for Biotechnology Information, <https://www.ncbi.nlm.nih.gov/>. The TRANSCRIPTID, is the transcript id of rice, are of the form Os##t#######. The researchers usually need to converter between various IDs. Such as converter RAP to SYMBOLS for function searching on NCBI. There are a lot of websites with the function for converting RAP to MSU or MSU to RA, such as ID Converter <https://rapdb.dna.affrc.go.jp/tools/converter>. But it is difficult to convert super multiple IDs on these websites. The package can convert all IDs between the three IDs (RAP, MSU and SYMBOL) regardless of the number.
Perform fast and memory efficient time-weighted averaging of values measured over intervals into new arbitrary intervals. This package is useful in the context of data measured or represented as constant values over intervals on a one-dimensional discrete axis (e.g. time-integrated averages of a curve over defined periods). This package was written specifically to deal with air pollution data recorded or predicted as averages over sampling periods. Data in this format often needs to be shifted to non-aligned periods or averaged up to periods of longer duration (e.g. averaging data measured over sequential non-overlapping periods to calendar years).