Combining Predictive Analytics and Experimental Design to Optimize Results. To be utilized to select a test data calibrated training population in high dimensional prediction problems and assumes that the explanatory variables are observed for all of the individuals. Once a "good" training set is identified, the response variable can be obtained only for this set to build a model for predicting the response in the test set. The algorithms in the package can be tweaked to solve some other subset selection problems.
This package provides functions that automate accessing, downloading and exploring Soil Moisture and Ocean Salinity (SMOS) Level 4 (L4) data developed by Barcelona Expert Center (BEC). Particularly, it includes functions to search for, acquire, extract, and plot BEC-SMOS L4 soil moisture data downscaled to ~1 km spatial resolution. Note that SMOS is one of Earth Explorer Opportunity missions by the European Space Agency (ESA). More information about SMOS products can be found at <https://earth.esa.int/eogateway/missions/smos/data>.
The smurf package contains the implementation of the Sparse Multi-type Regularized Feature (SMuRF) modeling algorithm to fit generalized linear models (GLMs) with multiple types of predictors via regularized maximum likelihood. Next to the fitting procedure, following functionality is available:
Selection of the regularization tuning parameter lambda using three different approaches: in-sample, out-of-sample or using cross-validation.
S3 methods to handle the fitted object including visualization of the coefficients and a model summary.
dplyr is the next iteration of plyr. It is focused on tools for working with data frames. It has three main goals: 1) identify the most important data manipulation tools needed for data analysis and make them easy to use in R; 2) provide fast performance for in-memory data by writing key pieces of code in C++; 3) use the same code interface to work with data no matter where it is stored, whether in a data frame, a data table or database.
Iteratively Adjusted Surrogate Variable Analysis (IA-SVA) is a statistical framework to uncover hidden sources of variation even when these sources are correlated. IA-SVA provides a flexible methodology to i) identify a hidden factor for unwanted heterogeneity while adjusting for all known factors; ii) test the significance of the putative hidden factor for explaining the unmodeled variation in the data; and iii), if significant, use the estimated factor as an additional known factor in the next iteration to uncover further hidden factors.
This package provides a comprehensive set of functions designed for multivariate mean monitoring using the Critical-to-X Control Chart. These functions enable the determination of optimal control limits based on a specified in-control Average Run Length (ARL), the calculation of out-of-control ARL for a given control limit, and post-signal analysis to identify the specific variable responsible for a detected shift in the mean. This suite of tools provides robust support for precise and effective process monitoring and analysis.
This package implements the Fixed Effect Jackknife Instrumental Variables ('FEJIV') estimator of Chao, Swanson, and Woutersen (2023) <doi:10.1016/j.jeconom.2022.12.011>, allowing consistent IV estimation with many (possibly weak) instruments, cluster fixed effects, heteroskedastic errors, and many exogenous covariates. The estimator is recommended by SÅ oczyÅ ski (2024) <doi:10.48550/arXiv.2011.06695> as an alternative to two-stage least squares when estimating the interacted specification of Angrist and Imbens (1995) <doi:10.1080/01621459.1995.10476535>.
Kernel regularized least squares, also known as kernel ridge regression, is a flexible machine learning method. This package implements this method by providing a smooth term for use with mgcv and uses random sketching to facilitate scalable estimation on large datasets. It provides additional functions for calculating marginal effects after estimation and for use with ensembles ('SuperLearning'), double/debiased machine learning ('DoubleML'), and robust/clustered standard errors ('sandwich'). Chang and Goplerud (2024) <doi:10.1017/pan.2023.27> provide further details.
Generates efficient balanced non-aliased multi-level k-circulant supersaturated designs by interchanging the elements of the generator vector. Attempts to generate a supersaturated design that has chisquare efficiency more than user specified efficiency level (mef). Displays the progress of generation of an efficient multi-level k-circulant design through a progress bar. The progress of 100% means that one full round of interchange is completed. More than one full round (typically 4-5 rounds) of interchange may be required for larger designs.
This package performs multivariate nonparametric regression/classification by the method of sieves (using orthogonal basis). The method is suitable for moderate high-dimensional features (dimension < 100). The l1-penalized sieve estimator, a nonparametric generalization of Lasso, is adaptive to the feature dimension with provable theoretical guarantees. We also include a nonparametric stochastic gradient descent estimator, Sieve-SGD, for online or large scale batch problems. Details of the methods can be found in: <arXiv:2206.02994> <arXiv:2104.00846><arXiv:2310.12140>.
Parsing (R)Markdown files with numerous regular expressions can be fraught with peril, but it does not have to be this way. Converting (R)Markdown files to XML using the commonmark package allows in-memory editing via of markdown elements via XPath through the extensible R6 class called yarn'. These modified XML representations can be written to (R)Markdown documents via an xslt stylesheet which implements an extended version of GitHub'-flavoured markdown so that you can tinker to your hearts content.
This package provides a framework for statistical analysis in content analysis. In addition to a pipeline for preprocessing text corpora and linking to the latent Dirichlet allocation from the lda package, plots are offered for the descriptive analysis of text corpora and topic models. In addition, an implementation of Chang's intruder words and intruder topics is provided. Sample data for the vignette is included in the toscaData package, which is available on gitHub: <https://github.com/Docma-TU/toscaData>.
Define and use graphical elements of corporate design manuals in R. The unikn package provides color functions (by defining dedicated colors and color palettes, and commands for finding, changing, viewing, and using them) and styled text elements (e.g., for marking, underlining, or plotting colored titles). The pre-defined range of colors and text decoration functions is based on the corporate design of the University of Konstanz <https://www.uni-konstanz.de/>, but can be adapted and extended for other purposes or institutions.
The xtdml package implements partially linear panel regression (PLPR) models with high-dimensional confounding variables and an exogenous treatment variable within the double machine learning framework. The package is used to estimate the structural parameter (treatment effect) in static panel data models with fixed effects using the approaches established in Clarke and Polselli (2025) <doi:10.1093/ectj/utaf011>. xtdml is built on the object-oriented package DoubleML (Bach et al., 2024) <doi:10.18637/jss.v108.i03> using the mlr3 ecosystem.
RStudio is an integrated development environment (IDE) for the R programming language. Some of its features include: Customizable workbench with all of the tools required to work with R in one place (console, source, plots, workspace, help, history, etc.); syntax highlighting editor with code completion; execute code directly from the source editor (line, selection, or file); full support for authoring Sweave and TeX documents. RStudio can also be run as a server, enabling multiple users to access the RStudio IDE using a web browser.
The objective of AGDEX is to evaluate whether the results of a pair of two-group differential expression analysis comparisons show a level of agreement that is greater than expected if the group labels for each two-group comparison are randomly assigned. The agreement is evaluated for the entire transcriptome and (optionally) for a collection of pre-defined gene-sets. Additionally, the procedure performs permutation-based differential expression and meta analysis at both gene and gene-set levels of the data from each experiment.
This package provides a normalization and copy number variation calling procedure for whole exome DNA sequencing data. CODEX relies on the availability of multiple samples processed using the same sequencing pipeline for normalization, and does not require matched controls. The normalization model in CODEX includes terms that specifically remove biases due to GC content, exon length and targeting and amplification efficiency, and latent systemic artifacts. CODEX also includes a Poisson likelihood-based recursive segmentation procedure that explicitly models the count-based exome sequencing data.
The cmgnd implements the constrained mixture of generalized normal distributions model, a flexible statistical framework for modelling univariate data exhibiting non-normal features such as skewness, multi-modality, and heavy tails. By imposing constraints on model parameters, the cmgnd reduces estimation complexity while maintaining high descriptive power, offering an efficient solution in the presence of distributional irregularities. For more details see Duttilo and Gattone (2025) <doi:10.1007/s00180-025-01638-x> and Duttilo et al (2025) <doi:10.48550/arXiv.2506.03285>.
Interactive tools to explore topographic-like data sets. Such data sets take the form of a matrix in which the rows and columns provide location/frequency information, and the matrix elements contain altitude/response information. Such data is found in cartography, 2D spectroscopy and chemometrics. The functions in this package create interactive web pages showing the contoured data, possibly with slices from the original matrix parallel to each dimension. The interactive behavior is created using the D3.js JavaScript library by Mike Bostock.
Implementation of Energy Trees, a statistical model to perform classification and regression with structured and mixed-type data. The model has a similar structure to Conditional Trees, but brings in Energy Statistics to test independence between variables that are possibly structured and of different nature. Currently, the package covers functions and graphs as structured covariates. It builds upon partykit to provide functionalities for fitting, printing, plotting, and predicting with Energy Trees. Energy Trees are described in Giubilei et al. (2022) <arXiv:2207.04430>.
This package implements readers and writers for file formats associated with genetics data. Reading and writing Plink BED/BIM/FAM and GCTA binary GRM formats is fully supported, including a lightning-fast BED reader and writer implementations. Other functions are readr wrappers that are more constrained, user-friendly, and efficient for these particular applications; handles Plink and Eigenstrat tables (FAM, BIM, IND, and SNP files). There are also make functions for FAM and BIM tables with default values to go with simulated genotype data.
This package provides functions for the analysis of occupational and environmental data with non-detects. Maximum likelihood (ML) methods for censored log-normal data and non-parametric methods based on the product limit estimate (PLE) for left censored data are used to calculate all of the statistics recommended by the American Industrial Hygiene Association (AIHA) for the complete data case. Functions for the analysis of complete samples using exact methods are also provided for the lognormal model. Revised from 2007-11-05 survfit~1'.
XBSeq is a novel algorithm for testing RNA-seq differential expression (DE), where a statistical model was established based on the assumption that observed signals are the convolution of true expression signals and sequencing noises. The mapped reads in non-exonic regions are considered as sequencing noises, which follows a Poisson distribution. Given measurable observed signal and background noise from RNA-seq data, true expression signals, assuming governed by the negative binomial distribution, can be delineated and thus the accurate detection of differential expressed genes.
The AnVIL is a cloud computing resource developed in part by the National Human Genome Research Institute. The AnVIL package provides end-user and developer functionality. AnVIL provides fast binary package installation, utilities for working with Terra/AnVIL table and data resources, and convenient functions for file movement to and from Google cloud storage. For developers, AnVIL provides programmatic access to the Terra, Leonardo, Rawls, Dockstore, and Gen3 RESTful programming interface, including helper functions to transform JSON responses to formats more amenable to manipulation in R.