Kernel regularized least squares, also known as kernel ridge regression, is a flexible machine learning method. This package implements this method by providing a smooth term for use with mgcv and uses random sketching to facilitate scalable estimation on large datasets. It provides additional functions for calculating marginal effects after estimation and for use with ensembles ('SuperLearning
'), double/debiased machine learning ('DoubleML
'), and robust/clustered standard errors ('sandwich'). Chang and Goplerud (2024) <doi:10.1017/pan.2023.27> provide further details.
Allows for the computation of mSHAP
values on two-part models as proposed by Matthews, S. and Hartman, B. (2021) <arXiv:2106.08990>
. Also contains functions for simple plotting of the results (or any SHAP values). For information about the TreeSHAP
algorithm that mSHAP
builds on, see Lundberg, S.M., Erion, G., Chen, H., DeGrave
, A., Prutkin, J.M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., Lee, S.I. (2020) <doi:10.1038/s42256-019-0138-9>.
Generates efficient balanced non-aliased multi-level k-circulant supersaturated designs by interchanging the elements of the generator vector. Attempts to generate a supersaturated design that has chisquare efficiency more than user specified efficiency level (mef). Displays the progress of generation of an efficient multi-level k-circulant design through a progress bar. The progress of 100% means that one full round of interchange is completed. More than one full round (typically 4-5 rounds) of interchange may be required for larger designs.
This package performs multivariate nonparametric regression/classification by the method of sieves (using orthogonal basis). The method is suitable for moderate high-dimensional features (dimension < 100). The l1-penalized sieve estimator, a nonparametric generalization of Lasso, is adaptive to the feature dimension with provable theoretical guarantees. We also include a nonparametric stochastic gradient descent estimator, Sieve-SGD, for online or large scale batch problems. Details of the methods can be found in: <arXiv:2206.02994>
<arXiv:2104.00846><arXiv:2310.12140>
.
Parsing (R)Markdown files with numerous regular expressions can be fraught with peril, but it does not have to be this way. Converting (R)Markdown files to XML using the commonmark package allows in-memory editing via of markdown elements via XPath through the extensible R6 class called yarn'. These modified XML representations can be written to (R)Markdown documents via an xslt stylesheet which implements an extended version of GitHub'-flavoured
markdown so that you can tinker to your hearts content.
This package provides a framework for statistical analysis in content analysis. In addition to a pipeline for preprocessing text corpora and linking to the latent Dirichlet allocation from the lda package, plots are offered for the descriptive analysis of text corpora and topic models. In addition, an implementation of Chang's intruder words and intruder topics is provided. Sample data for the vignette is included in the toscaData
package, which is available on gitHub
: <https://github.com/Docma-TU/toscaData>
.
Define and use graphical elements of corporate design manuals in R. The unikn package provides color functions (by defining dedicated colors and color palettes, and commands for finding, changing, viewing, and using them) and styled text elements (e.g., for marking, underlining, or plotting colored titles). The pre-defined range of colors and text decoration functions is based on the corporate design of the University of Konstanz <https://www.uni-konstanz.de/>, but can be adapted and extended for other purposes or institutions.
Iteratively Adjusted Surrogate Variable Analysis (IA-SVA) is a statistical framework to uncover hidden sources of variation even when these sources are correlated. IA-SVA provides a flexible methodology to i) identify a hidden factor for unwanted heterogeneity while adjusting for all known factors; ii) test the significance of the putative hidden factor for explaining the unmodeled variation in the data; and iii), if significant, use the estimated factor as an additional known factor in the next iteration to uncover further hidden factors.
This package wires together large collections of single-cell RNA-seq datasets, which allows for both the identification of recurrent cell clusters and the propagation of information between datasets in multi-sample or atlas-scale collections. Conos focuses on the uniform mapping of homologous cell types across heterogeneous sample collections. For instance, users could investigate a collection of dozens of peripheral blood samples from cancer patients combined with dozens of controls, which perhaps includes samples of a related tissue such as lymph nodes.
The ggbio package extends and specializes the grammar of graphics for biological data. The graphics are designed to answer common scientific questions, in particular those often asked of high throughput genomics data. All core Bioconductor data structures are supported, where appropriate. The package supports detailed views of particular genomic regions, as well as genome-wide overviews. Supported overviews include ideograms and grand linear views. High-level plots include sequence fragment length, edge-linked interval to data view, mismatch pileup, and several splicing summaries.
SGSeq is a package for analyzing splice events from RNA-seq data. Input data are RNA-seq reads mapped to a reference genome in BAM format. Genes are represented as a splice graph, which can be obtained from existing annotation or predicted from the mapped sequence reads. Splice events are identified from the graph and are quantified locally using structurally compatible reads at the start or end of each splice variant. The software includes functions for splice event prediction, quantification, visualization and interpretation.
Implementation of Energy Trees, a statistical model to perform classification and regression with structured and mixed-type data. The model has a similar structure to Conditional Trees, but brings in Energy Statistics to test independence between variables that are possibly structured and of different nature. Currently, the package covers functions and graphs as structured covariates. It builds upon partykit to provide functionalities for fitting, printing, plotting, and predicting with Energy Trees. Energy Trees are described in Giubilei et al. (2022) <arXiv:2207.04430>
.
Interactive tools to explore topographic-like data sets. Such data sets take the form of a matrix in which the rows and columns provide location/frequency information, and the matrix elements contain altitude/response information. Such data is found in cartography, 2D spectroscopy and chemometrics. The functions in this package create interactive web pages showing the contoured data, possibly with slices from the original matrix parallel to each dimension. The interactive behavior is created using the D3.js JavaScript
library by Mike Bostock.
This package implements a Bayesian-like approach to the high-dimensional sparse linear regression problem based on an empirical or data-dependent prior distribution, which can be used for estimation/inference on the model parameters, variable selection, and prediction of a future response. The method was first presented in Martin, Ryan and Mess, Raymond and Walker, Stephen G (2017) <doi:10.3150/15-BEJ797>. More details focused on the prediction problem are given in Martin, Ryan and Tang, Yiqi (2019) <arXiv:1903.00961>
.
This package implements readers and writers for file formats associated with genetics data. Reading and writing Plink BED/BIM/FAM and GCTA binary GRM formats is fully supported, including a lightning-fast BED reader and writer implementations. Other functions are readr wrappers that are more constrained, user-friendly, and efficient for these particular applications; handles Plink and Eigenstrat tables (FAM, BIM, IND, and SNP files). There are also make functions for FAM and BIM tables with default values to go with simulated genotype data.
This package provides functions for the analysis of occupational and environmental data with non-detects. Maximum likelihood (ML) methods for censored log-normal data and non-parametric methods based on the product limit estimate (PLE) for left censored data are used to calculate all of the statistics recommended by the American Industrial Hygiene Association (AIHA) for the complete data case. Functions for the analysis of complete samples using exact methods are also provided for the lognormal model. Revised from 2007-11-05 survfit~1'.
This package provides a normalization and copy number variation calling procedure for whole exome DNA sequencing data. CODEX relies on the availability of multiple samples processed using the same sequencing pipeline for normalization, and does not require matched controls. The normalization model in CODEX includes terms that specifically remove biases due to GC content, exon length and targeting and amplification efficiency, and latent systemic artifacts. CODEX also includes a Poisson likelihood-based recursive segmentation procedure that explicitly models the count-based exome sequencing data.
dplyr is the next iteration of plyr. It is focused on tools for working with data frames. It has three main goals: 1) identify the most important data manipulation tools needed for data analysis and make them easy to use in R; 2) provide fast performance for in-memory data by writing key pieces of code in C++; 3) use the same code interface to work with data no matter where it is stored, whether in a data frame, a data table or database.
The smurf
package contains the implementation of the Sparse Multi-type Regularized Feature (SMuRF) modeling algorithm to fit generalized linear models (GLMs) with multiple types of predictors via regularized maximum likelihood. Next to the fitting procedure, following functionality is available:
Selection of the regularization tuning parameter lambda using three different approaches: in-sample, out-of-sample or using cross-validation.
S3 methods to handle the fitted object including visualization of the coefficients and a model summary.
Designed for the development and application of hidden Markov models and profile HMMs for biological sequence analysis. Contains functions for multiple and pairwise sequence alignment, model construction and parameter optimization, file import/export, implementation of the forward, backward and Viterbi algorithms for conditional sequence probabilities, tree-based sequence weighting, and sequence simulation. Features a wide variety of potential applications including database searching, gene-finding and annotation, phylogenetic analysis and sequence classification. Based on the models and algorithms described in Durbin et al (1998, ISBN: 9780521629713).
This package provides the functions for planning and conducting a clinical trial with adaptive sample size determination. Maximal statistical efficiency will be exploited even when dramatic or multiple adaptations are made. Such a trial consists of adaptive determination of sample size at an interim analysis and implementation of frequentist statistical test at the interim and final analysis with a prefixed significance level. The required assumptions for the stage-wise test statistics are independent and stationary increments and normality. Predetermination of adaptation rule is not required.
Distances on dual-weighted directed graphs using priority-queue shortest paths (Padgham (2019) <doi:10.32866/6945>). Weighted directed graphs have weights from A to B which may differ from those from B to A. Dual-weighted directed graphs have two sets of such weights. A canonical example is a street network to be used for routing in which routes are calculated by weighting distances according to the type of way and mode of transport, yet lengths of routes must be calculated from direct distances.
Computes the power and sample size (PASS) required to test for the difference in the mean function between two groups under a repeatedly measured longitudinal or sparse functional design. See the manuscript by Koner and Luo (2023) <https://salilkoner.github.io/assets/PASS_manuscript.pdf> for details of the PASS formula and computational details. The details of the testing procedure for univariate and multivariate response are presented in Wang (2021) <doi:10.1214/21-EJS1802> and Koner and Luo (2023) <arXiv:2302.05612>
respectively.
This package implements methods developed by Ding, Feller, and Miratrix (2016) <doi:10.1111/rssb.12124> <arXiv:1412.5000>
, and Ding, Feller, and Miratrix (2018) <doi:10.1080/01621459.2017.1407322> <arXiv:1605.06566>
for testing whether there is unexplained variation in treatment effects across observations, and for characterizing the extent of the explained and unexplained variation in treatment effects. The package includes wrapper functions implementing the proposed methods, as well as helper functions for analyzing and visualizing the results of the test.