This k-means algorithm is able to cluster data with missing values and as a by-product completes the data set. The implementation can deal with missing values in multiple variables and is computationally efficient since it iteratively uses the current cluster assignment to define a plausible distribution for missing value imputation. Weights are used to shrink early random draws for missing values (i.e., draws based on the cluster assignments after few iterations) towards the global mean of each feature. This shrinkage slowly fades out after a fixed number of iterations to reflect the increasing credibility of cluster assignments. See the vignette for details.
This package provides several functions to simplify using the glmnet package: converting data frames into matrices ready for glmnet'; b) imputing missing variables multiple times; c) fitting and applying prediction models straightforwardly; d) assigning observations to folds in a balanced way; e) cross-validate the models; f) selecting the most representative model across imputations and folds; and g) getting the relevance of the model regressors; as described in several publications: Solanes et al. (2022) <doi:10.1038/s41537-022-00309-w>, Palau et al. (2023) <doi:10.1016/j.rpsm.2023.01.001>, Sobregrau et al. (2024) <doi:10.1016/j.jpsychores.2024.111656>.
This package provides a flexible and easy-to use interface for the soil vegetation atmosphere transport (SVAT) model LWF-BROOK90, written in Fortran. The model simulates daily transpiration, interception, soil and snow evaporation, streamflow and soil water fluxes through a soil profile covered with vegetation, as described in Hammel & Kennel (2001, ISBN:978-3-933506-16-0) and Federer et al. (2003) <doi:10.1175/1525-7541(2003)004%3C1276:SOAETS%3E2.0.CO;2>. A set of high-level functions for model set up, execution and parallelization provides easy access to plot-level SVAT simulations, as well as multi-run and large-scale applications.
The landmark approach allows survival predictions to be updated dynamically as new measurements from an individual are recorded. The idea is to set predefined time points, known as "landmark times", and form a model at each landmark time using only the individuals in the risk set. This package allows the longitudinal data to be modelled either using the last observation carried forward or linear mixed effects modelling. There is also the option to model competing risks, either through cause-specific Cox regression or Fine-Gray regression. To find out more about the methods in this package, please see <https://isobelbarrott.github.io/Landmarking/articles/Landmarking>.
Programs to find the sample size or power of studies using the Sequential Parallel Comparison Design (SPCD) and programs to analyze such studies. This is a clinical trial design where patients initially on placebo who did not respond are re-randomized between placebo and active drug in a second phase and the results of the two phases are pooled. The method of analyzing binary data with this design is described in Fava,Evins, Dorer and Schoenfeld(2003) <doi:10.1159/000069738>, and the method of analyzing continuous data is described in Chen, Yang, Hung and Wang (2011) <doi:10.1016/j.cct.2011.04.006>.
This package provides tools to assess the association between two spatial processes. Currently, several methodologies are implemented: A modified t-test to perform hypothesis testing about the independence between the processes, a suitable nonparametric correlation coefficient, the codispersion coefficient, and an F test for assessing the multiple correlation between one spatial process and several others. Functions for image processing and computing the spatial association between images are also provided. Functions contained in the package are intended to accompany Vallejos, R., Osorio, F., Bevilacqua, M. (2020). Spatial Relationships Between Two Georeferenced Variables: With Applications in R. Springer, Cham <doi:10.1007/978-3-030-56681-4>.
This package provides a conditional independence test that can be applied both to univariate and multivariate random variables. The test is based on a weighted form of the sample covariance of the residuals after a nonlinear regression on the conditioning variables. Details are described in Scheidegger, Hoerrmann and Buehlmann (2021) "The Weighted Generalised Covariance Measure" <arXiv:2111.04361>. The test is a generalisation of the Generalised Covariance Measure (GCM) implemented in the R package GeneralisedCovarianceMeasure by Jonas Peters and Rajen D. Shah based on Shah and Peters (2020) "The Hardness of Conditional Independence Testing and the Generalised Covariance Measure" <arXiv:1804.07203>.
The main aim is to further facilitate the creation of exercises based on the package exams by Grün, B., and Zeileis, A. (2009) <doi:10.18637/jss.v029.i10>. Creating effective student exercises involves challenges such as creating appropriate data sets and ensuring access to intermediate values for accurate explanation of solutions. The functionality includes the generation of univariate and bivariate data including simple time series, functions for theoretical distributions and their approximation, statistical and mathematical calculations for tasks in basic statistics courses as well as general tasks such as string manipulation, LaTeX/HTML formatting and the editing of XML task files for Moodle'.
Function and support for medication and dosing information extraction from free-text clinical notes. Medication entities for the basic medExtractR implementation that can be extracted include drug name, strength, dose amount, dose, frequency, intake time, dose change, and time of last dose. The basic medExtractR is outlined in Weeks, Beck, McNeer, Williams, Bejan, Denny, Choi (2020) <doi: 10.1093/jamia/ocz207>. The extended medExtractR_tapering implementation is intended to extract dosing information for more tapering schedules, which are far more complex. The tapering extension allows for the extraction of additional entities including dispense amount, refills, dose schedule, time keyword, transition, and preposition.
This package provides a collection of routines for finding reference limits using, where appropriate, QQ methodology. All use a data vector X of cases from the reference population. The default is to get the central 95% reference range of the population, namely the 2.5 and 97.5 percentile, with optional adjustment of the range. Along with the reference limits, we want confidence intervals which, for historical reasons, are typically at 90% confidence. A full analysis provides six numbers: â the upper and the lower reference limits, and - each of their confidence intervals. For application details, see Hawkins and Esquivel (2024) <doi:10.1093/jalm/jfad109>.
We consider the problem where we observe k vectors (possibly of different lengths), each representing an independent multinomial random vector. For a given function that takes in the concatenated vector of multinomial probabilities and outputs a real number, this is a Monte Carlo estimation procedure of an exact p-value and confidence interval. The resulting inference is valid even in small samples, when the parameter is on the boundary, and when the function is not differentiable at the parameter value, all situations where asymptotic methods and the bootstrap would fail. For more details see Sachs, Fay, and Gabriel (2025) <doi:10.48550/arXiv.2406.19141>.
Get z-scores, percentiles, absolute values, and percent of predicted of a reference cohort. Functionality requires installing the data packages adiposerefdata and musclerefdata'. For more information on the underlying research, please visit our website which also includes a graphical interface. The models and underlying data are described in Marquardt JP et al.(planned publication 2025; reserved doi 10.1097/RLI.0000000000001104), "Subcutaneous and Visceral adipose tissue Reference Values from Framingham Heart Study Thoracic and Abdominal CT", *Investigative Radiology* and Tonnesen PE et al. (2023), "Muscle Reference Values from Thoracic and Abdominal CT for Sarcopenia Assessment [column] The Framingham Heart Study", *Investigative Radiology*, <doi:10.1097/RLI.0000000000001012>.
Efficiently implements the Graphical Lasso algorithm, utilizing the Armadillo C++ library for rapid computation. This algorithm introduces an L1 penalty to derive sparse inverse covariance matrices from observations of multivariate normal distributions. Features include the generation of random and structured sparse covariance matrices, beneficial for simulations, statistical method testing, and educational purposes in graphical modeling. A unique function for regularization parameter selection based on predefined sparsity levels is also offered, catering to users with specific sparsity requirements in their models. The methodology for sparse inverse covariance estimation implemented in this package is based on the work of Friedman, Hastie, and Tibshirani (2008) <doi:10.1093/biostatistics/kxm045>.
Automate the explanatory analysis of machine learning predictive models. Generate advanced interactive model explanations in the form of a serverless HTML site with only one line of code. This tool is model-agnostic, therefore compatible with most of the black-box predictive models and frameworks. The main function computes various (instance and model-level) explanations and produces a customisable dashboard, which consists of multiple panels for plots with their short descriptions. It is possible to easily save the dashboard and share it with others. modelStudio facilitates the process of Interactive Explanatory Model Analysis introduced in Baniecki et al. (2023) <doi:10.1007/s10618-023-00924-w>.
The SparseArray package is an infrastructure package that provides an array-like container for efficient in-memory representation of multidimensional sparse data in R. The package defines the SparseArray virtual class and two concrete subclasses: COO_SparseArray and SVT_SparseArray. Each subclass uses its own internal representation of the nonzero multidimensional data, the "COO layout" and the "SVT layout", respectively. SVT_SparseArray objects mimic as much as possible the behavior of ordinary matrix and array objects in base R. In particular, they support most of the "standard matrix and array API" defined in base R and in the matrixStats package from CRAN.
This package provides a comprehensive framework for batch effect diagnostics, harmonization, and post-harmonization downstream analysis. Features include interactive visualization tools, robust statistical tests, and a range of harmonization techniques. Additionally, ComBatFamQC enables the creation of life-span age trend plots with estimated age-adjusted centiles and facilitates the generation of covariate-corrected residuals for analytical purposes. Methods for harmonization are based on approaches described in Johnson et al., (2007) <doi:10.1093/biostatistics/kxj037>, Beer et al., (2020) <doi:10.1016/j.neuroimage.2020.117129>, Pomponio et al., (2020) <doi:10.1016/j.neuroimage.2019.116450>, and Chen et al., (2021) <doi:10.1002/hbm.25688>.
Calculates the probabilities of k successes given n trials of a binomial random variable with non-negative correlation across trials. The function takes as inputs the scalar values the level of correlation or association between trials, the success probability, the number of trials, an optional input specifying the number of bits of precision used in the calculation, and an optional input specifying whether the calculation approach to be used is from Witt (2014) <doi:10.1080/03610926.2012.725148> or from Kuk (2004) <doi:10.1046/j.1467-9876.2003.05369.x>. The output is a (trials+1)-dimensional vector containing the likelihoods of 0, 1, ..., trials successes.
The stepwise variable selection procedure (with iterations between the forward and backward steps) can be used to obtain the best candidate final regression model in regression analysis. All the relevant covariates are put on the variable list to be selected. The significance levels for entry (SLE) and for stay (SLS) are usually set to 0.15 (or larger) for being conservative. Then, with the aid of substantive knowledge, the best candidate final regression model is identified manually by dropping the covariates with p value > 0.05 one at a time until all regression coefficients are significantly different from 0 at the chosen alpha level of 0.05.
Implementations of the multiple testing procedures for discrete tests described in the paper Döhler, Durand and Roquain (2018) "New FDR bounds for discrete and heterogeneous tests" <doi:10.1214/18-EJS1441>. The main procedures of the paper (HSU and HSD), their adaptive counterparts (AHSU and AHSD), and the HBR variant are available and are coded to take as input the results of a test procedure from package DiscreteTests', or a set of observed p-values and their discrete support under their nulls. A shortcut function to obtain such p-values and supports is also provided, along with a wrapper allowing to apply discrete procedures directly to data.
This package implements the Ebrahim-Farrington goodness-of-fit test for logistic regression models, particularly effective for sparse data and binary outcomes. This test provides an improved alternative to the traditional Hosmer-Lemeshow test by using a modified Pearson chi-square statistic with data-dependent grouping. The test is based on Farrington (1996) theoretical framework but simplified for practical implementation with binary data. Includes functions for both the original Farrington test (for grouped data) and the new Ebrahim-Farrington test (for binary data with automatic grouping). For more details see Hosmer (1980) <doi:10.1080/03610928008827941> and Farrington (1996) <doi:10.1111/j.2517-6161.1996.tb02086.x>.
Pooling estimates reported in meta-analyses (literature-based, LB) and estimates based on individual participant data (IPD) is not straight-forward as the details of the LB nonlinear function estimate are not usually reported. This package pools the nonlinear IPD dose-response estimates based on a natural cubic spline from lm or glm with the pointwise LB estimates and their estimated variances. Details will be presented in Härkänen, Tapanainen, Sares-Jäske, Männistö, Kaartinen and Paalanen (2025) "Novel pooling method for nonlinear cohort analysis and meta-analysis estimates: Predicting health outcomes based on climate-friendly diets" (under revision) <https://journals.lww.com/epidem/pages/default.aspx>.
It provides functions to generate a correlation matrix from a genetic dataset and to use this matrix to predict the phenotype of an individual by using the phenotypes of the remaining individuals through kriging. Kriging is a geostatistical method for optimal prediction or best unbiased linear prediction. It consists of predicting the value of a variable at an unobserved location as a weighted sum of the variable at observed locations. Intuitively, it works as a reverse linear regression: instead of computing correlation (univariate regression coefficients are simply scaled correlation) between a dependent variable Y and independent variables X, it uses known correlation between X and Y to predict Y.
It estimates the parameters of a partially linear regression censored model via maximum penalized likelihood through of ECME algorithm. The model belong to the semiparametric class, that including a parametric and nonparametric component. The error term considered belongs to the scale-mixture of normal (SMN) distribution, that includes well-known heavy tails distributions as the Student-t distribution, among others. To examine the performance of the fitted model, case-deletion and local influence techniques are provided to show its robust aspect against outlying and influential observations. This work is based in Ferreira, C. S., & Paula, G. A. (2017) <doi:10.1080/02664763.2016.1267124> but considering the SMN family.
This package simulates regulations of ceRNA (Competing Endogenous) expression levels after a expression level change in one or more miRNA/mRNAs. The methodolgy adopted by the package has potential to incorparate any ceRNA (circRNA, lincRNA, etc.) into miRNA:target interaction network. The package basically distributes miRNA expression over available ceRNAs where each ceRNA attracks miRNAs proportional to its amount. But, the package can utilize multiple parameters that modify miRNA effect on its target (seed type, binding energy, binding location, etc.). The functions handle the given dataset as graph object and the processes progress via edge and node variables.