Facilitates the use of data mining algorithms in classification and regression (including time series forecasting) tasks by presenting a short and coherent set of functions. Versions: 1.4.9 / 1.4.8 improved help, several warning and error code fixes (more stable version, all examples run correctly); 1.4.7 - improved Importance function and examples, minor error fixes; 1.4.6 / 1.4.5 / 1.4.4 new automated machine learning (AutoML
) and ensembles, via improved fit()
, mining()
and mparheuristic()
functions, and new categorical preprocessing, via improved delevels()
function; 1.4.3 new metrics (e.g., macro precision, explained variance), new "lssvm" model and improved mparheuristic()
function; 1.4.2 new "NMAE" metric, "xgboost" and "cv.glmnet" models (16 classification and 18 regression models); 1.4.1 new tutorial and more robust version; 1.4 - new classification and regression models, with a total of 14 classification and 15 regression methods, including: Decision Trees, Neural Networks, Support Vector Machines, Random Forests, Bagging and Boosting; 1.3 and 1.3.1 - new classification and regression metrics; 1.2 - new input importance methods via improved Importance()
function; 1.0 - first version.
Interface package for sala', the spatial network analysis library from the depthmapX
software application. The R parts of the code are based on the rdepthmap package. Allows for the analysis of urban and building-scale networks and provides metrics and methods usually found within the Space Syntax domain. Methods in this package are described by K. Al-Sayed, A. Turner, B. Hillier, S. Iida and A. Penn (2014) "Space Syntax methodology", and also by A. Turner (2004) <https://discovery.ucl.ac.uk/id/eprint/2651> "Depthmap 4: a researcher's handbook".
This package provides a tool for analyzing conjoint experiments using Bayesian Additive Regression Trees ('BART'), a machine learning method developed by Chipman, George and McCulloch
(2010) <doi:10.1214/09-AOAS285>. This tool focuses specifically on estimating, identifying, and visualizing the heterogeneity within marginal component effects, at the observation- and individual-level. It uses a variable importance measure ('VIMP') with delete-d jackknife variance estimation, following Ishwaran and Lu (2019) <doi:10.1002/sim.7803>, to obtain bias-corrected estimates of which variables drive heterogeneity in the predicted individual-level effects.
Tool for the development of multi-linear QSPR/QSAR models (Quantitative structure-property/activity relationship). Theses models are used in chemistry, biology and pharmacy to find a relationship between the structure of a molecule and its property (such as activity, toxicology but also physical properties). The various functions of this package allows: selection of descriptors based of variances, intercorrelation and user expertise; selection of the best multi-linear regression in terms of correlation and robustness; methods of internal validation (Leave-One-Out, Leave-Many-Out, Y-scrambling) and external using test sets.
This package provides tools for motif analysis in multi-level networks. Multi-level networks combine multiple networks in one, e.g. social-ecological networks. Motifs are small configurations of nodes and edges (subgraphs) occurring in networks. motifr can visualize multi-level networks, count multi-level network motifs and compare motif occurrences to baseline models. It also identifies contributions of existing or potential edges to motifs to find critical or missing edges. The package is in many parts an R wrapper for the excellent SESMotifAnalyser
Python package written by Tim Seppelt.
This package provides a flexible framework for power analysis using Monte Carlo simulation for settings in which considerations of the correlations between predictors are important. Users can set up a data generative model that preserves dependence structures among predictors given existing data (continuous, binary, or ordinal). Users can also generate power curves to assess the trade-offs between sample size, effect size, and power of a design. This package includes several statistical models common in environmental mixtures studies. For more details and tutorials, see Nguyen et al. (2022) <arXiv:2209.08036>
.
Extends multiverse package (Sarma A., Kale A., Moon M., Taback N., Chevalier F., Hullman J., Kay M., 2021) <doi:10.31219/osf.io/yfbwm>, which allows users perform to create explorable multiverse analysis in R. This extension provides an additional level of abstraction to the multiverse package with the aim of creating user friendly syntax to researchers, educators, and students in statistics. The mverse syntax is designed to allow piping and takes hints from the tidyverse grammar. The package allows users to define and inspect multiverse analysis using familiar syntax in R.
Parallel Constraint Satisfaction (PCS) models are an increasingly common class of models in Psychology, with applications to reading and word recognition (McClelland
& Rumelhart, 1981), judgment and decision making (Glöckner & Betsch, 2008; Glöckner, Hilbig, & Jekel, 2014), and several other fields (e.g. Read, Vanman, & Miller, 1997). In each of these fields, they provide a quantitative model of psychological phenomena, with precise predictions regarding choice probabilities, decision times, and often the degree of confidence. This package provides the necessary functions to create and simulate basic Parallel Constraint Satisfaction networks within R.
This package provides a comprehensive, user-friendly package for label-free proteomics data analysis and machine learning-based modeling. Data generated from MaxQuant
can be easily used to conduct differential expression analysis, build predictive models with top protein candidates, and assess model performance. promor includes a suite of tools for quality control, visualization, missing data imputation (Lazar et. al. (2016) <doi:10.1021/acs.jproteome.5b00981>), differential expression analysis (Ritchie et. al. (2015) <doi:10.1093/nar/gkv007>), and machine learning-based modeling (Kuhn (2008) <doi:10.18637/jss.v028.i05>).
The semiparametric accelerated failure time (AFT) model is an attractive alternative to the Cox proportional hazards model. This package provides a suite of functions for fitting one popular estimator of the semiparametric AFT model, the regularized Gehan estimator. Specifically, we provide functions for cross-validation, prediction, coefficient extraction, and visualizing both trace plots and cross-validation curves. For further details, please see Suder, P. M. and Molstad, A. J., (2022+) Scalable algorithms for semiparametric accelerated failure time models in high dimensions, to appear in Statistics in Medicine <doi:10.1002/sim.9264>.
Probability mass (d), distribution (p), quantile (q), and random number generating (r and rt) functions for the time-varying right-truncated geometric (tvgeom) distribution. Also provided are functions to calculate the first and second central moments of the distribution. The tvgeom distribution is similar to the geometric distribution, but the probability of success is allowed to vary at each time step, and there are a limited number of trials. This distribution is essentially a Markov chain, and it is useful for modeling Markov chain systems with a set number of time steps.
Detection of rare aberrant splicing events in transcriptome profiles. Read count ratio expectations are modeled by an autoencoder to control for confounding factors in the data. Given these expectations, the ratios are assumed to follow a beta-binomial distribution with a junction specific dispersion. Outlier events are then identified as read-count ratios that deviate significantly from this distribution. FRASER is able to detect alternative splicing, but also intron retention. The package aims to support diagnostics in the field of rare diseases where RNA-seq is performed to identify aberrant splicing defects.
For studying recurrent disease and death with competing risks, comparisons based on the well-known cumulative incidence function can be confounded by different prevalence rates of the competing events. Alternatively, comparisons of the conditional distribution of the survival time given the failure event type are more relevant for investigating the prognosis of different patterns of recurrence disease. This package implements a nonparametric estimator for the conditional cumulative incidence function and a nonparametric conditional bivariate cumulative incidence function for the bivariate gap times proposed in Huang et al. (2016) <doi:10.1111/biom.12494>.
API Client for the Climate Hazards Center CHIRPS and CHIRTS'. The CHIRPS data is a quasi-global (50°S â 50°N) high-resolution (0.05 arc-degrees) rainfall data set, which incorporates satellite imagery and in-situ station data to create gridded rainfall time series for trend analysis and seasonal drought monitoring. CHIRTS is a quasi-global (60°S â 70°N), high-resolution data set of daily maximum and minimum temperatures. For more details on CHIRPS and CHIRTS data please visit its official home page <https://www.chc.ucsb.edu/data>.
DataSHIELD
is an infrastructure and series of R packages that enables the remote and non-disclosive analysis of sensitive research data. This package is the DataSHIELD
interface implementation for Opal', which is the data integration application for biobanks by OBiBa
'. Participant data, once collected from any data source, must be integrated and stored in a central data repository under a uniform model. Opal is such a central repository. It can import, process, validate, query, analyze, report, and export data. Opal is the reference implementation of the DataSHIELD
infrastructure.
Fast and flexible Kalman filtering and smoothing implementation utilizing sequential processing, designed for efficient parameter estimation through maximum likelihood estimation. Sequential processing is a univariate treatment of a multivariate series of observations and can benefit from computational efficiency over traditional Kalman filtering when independence is assumed in the variance of the disturbances of the measurement equation. Sequential processing is described in the textbook of Durbin and Koopman (2001, ISBN:978-0-19-964117-8). FKF.SP was built upon the existing FKF package and is, in general, a faster Kalman filter/smoother.
This package implements the Generalized Method of Wavelet Moments with Exogenous Inputs estimator (GMWMX) presented in Voirol, L., Xu, H., Zhang, Y., Insolia, L., Molinari, R. and Guerrier, S. (2024) <doi:10.48550/arXiv.2409.05160>
. The GMWMX estimator allows to estimate functional and stochastic parameters of linear models with correlated residuals in presence of missing data. The gmwmx2 package provides functions to load and plot Global Navigation Satellite System (GNSS) data from the Nevada Geodetic Laboratory and functions to estimate linear model model with correlated residuals in presence of missing data.
This package provides statistical methods to check if a parametric family of conditional density functions fits to some given dataset of covariates and response variables. Different test statistics can be used to determine the goodness-of-fit of the assumed model, see Andrews (1997) <doi:10.2307/2171880>, Bierens & Wang (2012) <doi:10.1017/S0266466611000168>, Dikta & Scheer (2021) <doi:10.1007/978-3-030-73480-0> and Kremling & Dikta (2024) <doi:10.48550/arXiv.2409.20262>
. As proposed in these papers, the corresponding p-values are approximated using a parametric bootstrap method.
Optimized for handling complex datasets in environmental and ecological research, this package offers functionality that is not fully met by general-purpose packages. It provides two key functions, summarize_data()
', which summarizes datasets, and plot_means()
', which creates plots with error bars. The plot_means()
function incorporates error bars by default, allowing quick visualization of uncertainties, crucial in ecological studies. It also streamlines workflows for grouped datasets (e.g., by species or treatment), making it particularly user-friendly and reducing the complexity and time required for data summarization and visualization.
R interface to PRIMME <https://www.cs.wm.edu/~andreas/software/>, a C library for computing a few eigenvalues and their corresponding eigenvectors of a real symmetric or complex Hermitian matrix, or generalized Hermitian eigenproblem. It can also compute singular values and vectors of a square or rectangular matrix. PRIMME finds largest, smallest, or interior singular/eigenvalues and can use preconditioning to accelerate convergence. General description of the methods are provided in the papers Stathopoulos (2010, <doi:10.1145/1731022.1731031>) and Wu (2017, <doi:10.1137/16M1082214>). See citation("PRIMME") for details.
The R implementation of TIGER. TIGER integrates random forest algorithm into an innovative ensemble learning architecture. Benefiting from this advanced architecture, TIGER is resilient to outliers, free from model tuning and less likely to be affected by specific hyperparameters. TIGER supports targeted and untargeted metabolomics data and is competent to perform both intra- and inter-batch technical variation removal. TIGER can also be used for cross-kit adjustment to ensure data obtained from different analytical assays can be effectively combined and compared. Reference: Han S. et al. (2022) <doi:10.1093/bib/bbab535>.
This package provides classes and functions for quality control, filtering, normalization and differential expression analysis of pre-processed `RNA-seq` data. Data can be imported from `SummarizedExperiment`
as well as `matrix` objects and can be annotated from `BioMart`
. Filtering for genes without too low expression or containing required annotations, as well as filtering for samples with sufficient correlation to other samples or total number of reads is supported. The standard normalization methods including cpm, rpkm and tpm can be used, and DESeq2` as well as voom differential expression analyses are available.
This software is meant to be used for classification of images of cell-based assays for neuronal surface autoantibody detection or similar techniques. It takes imaging files as input and creates a composite score from these, that for example can be used to classify samples as negative or positive for a certain antibody-specificity. The reason for its name is that I during its creation have thought about the individual picture as an archielago where we with different filters control the water level as well as ground characteristica, thereby finding islands of interest.
BASiCS is an integrated Bayesian hierarchical model to perform statistical analyses of single-cell RNA sequencing datasets in the context of supervised experiments (where the groups of cells of interest are known a priori. BASiCS performs built-in data normalisation (global scaling) and technical noise quantification (based on spike-in genes). BASiCS provides an intuitive detection criterion for highly (or lowly) variable genes within a single group of cells. Additionally, BASiCS can compare gene expression patterns between two or more pre-specified groups of cells.