Randomly reassigns the group identifications to one of the variables of the database, say Treatment, and randomly reassigns the observation numbers of the dataset. Reorders the observations according to these new numbers. Centers each group of Treatment at the grand mean in order to further mask the treatment. An unmasking function is provided so that the user can identify the potential outliers in terms of their original values when blinding is no longer needed. It is suggested that a forward search procedure be performed on the masked data. Details of some forward search functions may be found in <https://CRAN.R-project.org/package=forsearch>.
Generates a list, with a size defined by the user, containing the main scientific references and the frequency distribution of authors and journals in the list obtained. The database is a dataframe with academic production metadata made available by bibliographic collections such as Scopus, Web of Science, etc. The temporal evolution of scientific production on a given topic is presented and ordered lists of articles are constructed by number of citations and of authors and journals by level of productivity. Massimo Aria, Corrado Cuccurullo. (2017) <doi:10.1016/j.joi.2017.08.007>. Caibo Zhou, Wenyan Song. (2021) <doi:10.1016/j.jclepro.2021.126943>.
This k-means algorithm is able to cluster data with missing values and as a by-product completes the data set. The implementation can deal with missing values in multiple variables and is computationally efficient since it iteratively uses the current cluster assignment to define a plausible distribution for missing value imputation. Weights are used to shrink early random draws for missing values (i.e., draws based on the cluster assignments after few iterations) towards the global mean of each feature. This shrinkage slowly fades out after a fixed number of iterations to reflect the increasing credibility of cluster assignments. See the vignette for details.
This package provides several functions to simplify using the glmnet package: converting data frames into matrices ready for glmnet'; b) imputing missing variables multiple times; c) fitting and applying prediction models straightforwardly; d) assigning observations to folds in a balanced way; e) cross-validate the models; f) selecting the most representative model across imputations and folds; and g) getting the relevance of the model regressors; as described in several publications: Solanes et al. (2022) <doi:10.1038/s41537-022-00309-w>, Palau et al. (2023) <doi:10.1016/j.rpsm.2023.01.001>, Sobregrau et al. (2024) <doi:10.1016/j.jpsychores.2024.111656>.
The landmark approach allows survival predictions to be updated dynamically as new measurements from an individual are recorded. The idea is to set predefined time points, known as "landmark times", and form a model at each landmark time using only the individuals in the risk set. This package allows the longitudinal data to be modelled either using the last observation carried forward or linear mixed effects modelling. There is also the option to model competing risks, either through cause-specific Cox regression or Fine-Gray regression. To find out more about the methods in this package, please see <https://isobelbarrott.github.io/Landmarking/articles/Landmarking>.
This package provides a flexible and easy-to use interface for the soil vegetation atmosphere transport (SVAT) model LWF-BROOK90, written in Fortran. The model simulates daily transpiration, interception, soil and snow evaporation, streamflow and soil water fluxes through a soil profile covered with vegetation, as described in Hammel & Kennel (2001, ISBN:978-3-933506-16-0) and Federer et al. (2003) <doi:10.1175/1525-7541(2003)004%3C1276:SOAETS%3E2.0.CO;2>. A set of high-level functions for model set up, execution and parallelization provides easy access to plot-level SVAT simulations, as well as multi-run and large-scale applications.
This package provides tools to assess the association between two spatial processes. Currently, several methodologies are implemented: A modified t-test to perform hypothesis testing about the independence between the processes, a suitable nonparametric correlation coefficient, the codispersion coefficient, and an F test for assessing the multiple correlation between one spatial process and several others. Functions for image processing and computing the spatial association between images are also provided. Functions contained in the package are intended to accompany Vallejos, R., Osorio, F., Bevilacqua, M. (2020). Spatial Relationships Between Two Georeferenced Variables: With Applications in R. Springer, Cham <doi:10.1007/978-3-030-56681-4>.
Programs to find the sample size or power of studies using the Sequential Parallel Comparison Design (SPCD) and programs to analyze such studies. This is a clinical trial design where patients initially on placebo who did not respond are re-randomized between placebo and active drug in a second phase and the results of the two phases are pooled. The method of analyzing binary data with this design is described in Fava,Evins, Dorer and Schoenfeld(2003) <doi:10.1159/000069738>, and the method of analyzing continuous data is described in Chen, Yang, Hung and Wang (2011) <doi:10.1016/j.cct.2011.04.006>.
This package provides a conditional independence test that can be applied both to univariate and multivariate random variables. The test is based on a weighted form of the sample covariance of the residuals after a nonlinear regression on the conditioning variables. Details are described in Scheidegger, Hoerrmann and Buehlmann (2021) "The Weighted Generalised Covariance Measure" <arXiv:2111.04361>
. The test is a generalisation of the Generalised Covariance Measure (GCM) implemented in the R package GeneralisedCovarianceMeasure
by Jonas Peters and Rajen D. Shah based on Shah and Peters (2020) "The Hardness of Conditional Independence Testing and the Generalised Covariance Measure" <arXiv:1804.07203>
.
The main aim is to further facilitate the creation of exercises based on the package exams by Grün, B., and Zeileis, A. (2009) <doi:10.18637/jss.v029.i10>. Creating effective student exercises involves challenges such as creating appropriate data sets and ensuring access to intermediate values for accurate explanation of solutions. The functionality includes the generation of univariate and bivariate data including simple time series, functions for theoretical distributions and their approximation, statistical and mathematical calculations for tasks in basic statistics courses as well as general tasks such as string manipulation, LaTeX/HTML
formatting and the editing of XML task files for Moodle'.
Function and support for medication and dosing information extraction from free-text clinical notes. Medication entities for the basic medExtractR
implementation that can be extracted include drug name, strength, dose amount, dose, frequency, intake time, dose change, and time of last dose. The basic medExtractR
is outlined in Weeks, Beck, McNeer
, Williams, Bejan, Denny, Choi (2020) <doi: 10.1093/jamia/ocz207>. The extended medExtractR_tapering
implementation is intended to extract dosing information for more tapering schedules, which are far more complex. The tapering extension allows for the extraction of additional entities including dispense amount, refills, dose schedule, time keyword, transition, and preposition.
We consider the problem where we observe k vectors (possibly of different lengths), each representing an independent multinomial random vector. For a given function that takes in the concatenated vector of multinomial probabilities and outputs a real number, this is a Monte Carlo estimation procedure of an exact p-value and confidence interval. The resulting inference is valid even in small samples, when the parameter is on the boundary, and when the function is not differentiable at the parameter value, all situations where asymptotic methods and the bootstrap would fail. For more details see Sachs, Fay, and Gabriel (2025) <doi:10.48550/arXiv.2406.19141>
.
Get z-scores, percentiles, absolute values, and percent of predicted of a reference cohort. Functionality requires installing the data packages adiposerefdata and musclerefdata'. For more information on the underlying research, please visit our website which also includes a graphical interface. The models and underlying data are described in Marquardt JP et al.(planned publication 2025; reserved doi 10.1097/RLI.0000000000001104), "Subcutaneous and Visceral adipose tissue Reference Values from Framingham Heart Study Thoracic and Abdominal CT", *Investigative Radiology* and Tonnesen PE et al. (2023), "Muscle Reference Values from Thoracic and Abdominal CT for Sarcopenia Assessment [column] The Framingham Heart Study", *Investigative Radiology*, <doi:10.1097/RLI.0000000000001012>.
Efficiently implements the Graphical Lasso algorithm, utilizing the Armadillo C++ library for rapid computation. This algorithm introduces an L1 penalty to derive sparse inverse covariance matrices from observations of multivariate normal distributions. Features include the generation of random and structured sparse covariance matrices, beneficial for simulations, statistical method testing, and educational purposes in graphical modeling. A unique function for regularization parameter selection based on predefined sparsity levels is also offered, catering to users with specific sparsity requirements in their models. The methodology for sparse inverse covariance estimation implemented in this package is based on the work of Friedman, Hastie, and Tibshirani (2008) <doi:10.1093/biostatistics/kxm045>.
Automate the explanatory analysis of machine learning predictive models. Generate advanced interactive model explanations in the form of a serverless HTML site with only one line of code. This tool is model-agnostic, therefore compatible with most of the black-box predictive models and frameworks. The main function computes various (instance and model-level) explanations and produces a customisable dashboard, which consists of multiple panels for plots with their short descriptions. It is possible to easily save the dashboard and share it with others. modelStudio
facilitates the process of Interactive Explanatory Model Analysis introduced in Baniecki et al. (2023) <doi:10.1007/s10618-023-00924-w>.
This package provides a comprehensive framework for batch effect diagnostics, harmonization, and post-harmonization downstream analysis. Features include interactive visualization tools, robust statistical tests, and a range of harmonization techniques. Additionally, ComBatFamQC
enables the creation of life-span age trend plots with estimated age-adjusted centiles and facilitates the generation of covariate-corrected residuals for analytical purposes. Methods for harmonization are based on approaches described in Johnson et al., (2007) <doi:10.1093/biostatistics/kxj037>, Beer et al., (2020) <doi:10.1016/j.neuroimage.2020.117129>, Pomponio et al., (2020) <doi:10.1016/j.neuroimage.2019.116450>, and Chen et al., (2021) <doi:10.1002/hbm.25688>.
Calculates the probabilities of k successes given n trials of a binomial random variable with non-negative correlation across trials. The function takes as inputs the scalar values the level of correlation or association between trials, the success probability, the number of trials, an optional input specifying the number of bits of precision used in the calculation, and an optional input specifying whether the calculation approach to be used is from Witt (2014) <doi:10.1080/03610926.2012.725148> or from Kuk (2004) <doi:10.1046/j.1467-9876.2003.05369.x>. The output is a (trials+1)-dimensional vector containing the likelihoods of 0, 1, ..., trials successes.
The stepwise variable selection procedure (with iterations between the forward and backward steps) can be used to obtain the best candidate final regression model in regression analysis. All the relevant covariates are put on the variable list to be selected. The significance levels for entry (SLE) and for stay (SLS) are usually set to 0.15 (or larger) for being conservative. Then, with the aid of substantive knowledge, the best candidate final regression model is identified manually by dropping the covariates with p value > 0.05 one at a time until all regression coefficients are significantly different from 0 at the chosen alpha level of 0.05.
The SparseArray
package is an infrastructure package that provides an array-like container for efficient in-memory representation of multidimensional sparse data in R. The package defines the SparseArray
virtual class and two concrete subclasses: COO_SparseArray
and SVT_SparseArray
. Each subclass uses its own internal representation of the nonzero multidimensional data, the "COO layout" and the "SVT layout", respectively. SVT_SparseArray
objects mimic as much as possible the behavior of ordinary matrix and array objects in base R. In particular, they support most of the "standard matrix and array API" defined in base R and in the matrixStats
package from CRAN.
Implementations of the multiple testing procedures for discrete tests described in the paper Döhler, Durand and Roquain (2018) "New FDR bounds for discrete and heterogeneous tests" <doi:10.1214/18-EJS1441>. The main procedures of the paper (HSU and HSD), their adaptive counterparts (AHSU and AHSD), and the HBR variant are available and are coded to take as input the results of a test procedure from package DiscreteTests
', or a set of observed p-values and their discrete support under their nulls. A shortcut function to obtain such p-values and supports is also provided, along with a wrapper allowing to apply discrete procedures directly to data.
It provides functions to generate a correlation matrix from a genetic dataset and to use this matrix to predict the phenotype of an individual by using the phenotypes of the remaining individuals through kriging. Kriging is a geostatistical method for optimal prediction or best unbiased linear prediction. It consists of predicting the value of a variable at an unobserved location as a weighted sum of the variable at observed locations. Intuitively, it works as a reverse linear regression: instead of computing correlation (univariate regression coefficients are simply scaled correlation) between a dependent variable Y and independent variables X, it uses known correlation between X and Y to predict Y.
It estimates the parameters of a partially linear regression censored model via maximum penalized likelihood through of ECME algorithm. The model belong to the semiparametric class, that including a parametric and nonparametric component. The error term considered belongs to the scale-mixture of normal (SMN) distribution, that includes well-known heavy tails distributions as the Student-t distribution, among others. To examine the performance of the fitted model, case-deletion and local influence techniques are provided to show its robust aspect against outlying and influential observations. This work is based in Ferreira, C. S., & Paula, G. A. (2017) <doi:10.1080/02664763.2016.1267124> but considering the SMN family.
This package simulates regulations of ceRNA
(Competing Endogenous) expression levels after a expression level change in one or more miRNA/mRNAs
. The methodolgy adopted by the package has potential to incorparate any ceRNA
(circRNA
, lincRNA
, etc.) into miRNA:target
interaction network. The package basically distributes miRNA
expression over available ceRNAs
where each ceRNA
attracks miRNAs
proportional to its amount. But, the package can utilize multiple parameters that modify miRNA
effect on its target (seed type, binding energy, binding location, etc.). The functions handle the given dataset as graph object and the processes progress via edge and node variables.
Database search is the most widely used approach for peptide and protein identification in mass spectrometry-based proteomics studies. Our previous study showed that sample-specific protein databases derived from RNA-Seq data can better approximate the real protein pools in the samples and thus improve protein identification. More importantly, single nucleotide variations, short insertion and deletions and novel junctions identified from RNA-Seq data make protein database more complete and sample-specific. Here, we report an R package customProDB
that enables the easy generation of customized databases from RNA-Seq data for proteomics search. This work bridges genomics and proteomics studies and facilitates cross-omics data integration.