This package provides functions for data manipulation, imputing missing values in an approximate Bayesian framework, diagnostics of the models used to generate the imputations, confidence-building mechanisms to validate some of the assumptions of the imputation algorithm, and functions to analyze multiply imputed data sets with the appropriate degree of sampling uncertainty.
Estimation/multiple imputation programs for mixed categorical and continuous data.
The mia package implements tools for microbiome analysis based on the SummarizedExperiment
, SingleCellExperiment
and TreeSummarizedExperiment
infrastructure. Data wrangling and analysis in the context of taxonomic data is the main scope. Additional functions for common task are implemented such as community indices calculation and summarization.
This package provides utilities for estimation for the multivariate inverse Gaussian distribution of Minami (2003) <doi:10.1081/STA-120025379>, including random vector generation and explicit estimators of the location vector and scale matrix. The package implements kernel density estimators discussed in Belzile, Desgagnes, Genest and Ouimet (2024) <doi:10.48550/arXiv.2209.04757>
for smoothing multivariate data on half-spaces.
Analyse, plot, and tabulate antimicrobial minimum inhibitory concentration (MIC) data. Validate the results of an MIC experiment by comparing observed MIC values to a gold standard assay, in line with standards from the International Organization for Standardization (2021) <https://www.iso.org/standard/79377.html>. Perform MIC prediction from whole genome sequence data stored in the Pathosystems Resource Integration Center (2013) <doi:10.1093/nar/gkt1099> database or locally.
This package finds optimal sets of genes that seperate samples into two or more classes.
This package guesses the MIME type from a filename extension using the data derived from /etc/mime.types in UNIX-type systems.
Implementation of the mid-n algorithms presented in Wellek S (2015) <DOI:10.1111/stan.12063> Statistica Neerlandica 69, 358-373 for exact sample size calculation for superiority trials with binary outcome.
Allows users to produce estimates and MSE for multivariate variables using Linear Mixed Model. The package follows the approach of Datta, Day and Basawa (1999) <doi:10.1016/S0378-3758(98)00147-5>.
This package provides a framework for multiple imputation for proteomics is proposed by Marie Chion, Christine Carapito and Frederic Bertrand (2021) <doi:10.1371/journal.pcbi.1010420>. It is dedicated to dealing with multiple imputation for proteomics.
Optimization algorithms implemented in R, including conjugate gradient (CG), Broyden-Fletcher-Goldfarb-Shanno (BFGS) and the limited memory BFGS (L-BFGS) methods. Most internal parameters can be set through the call interface. The solvers hold up quite well for higher-dimensional problems.
Specification and estimation of multinomial logit models. Large datasets and complex models are supported, with an intuitive syntax. Multinomial Logit Models, Mixed models, random coefficients and Hybrid Choice are all supported. For more information, see Molloy et al. (2021) <https://www.research-collection.ethz.ch/handle/20.500.11850/477416>.
Impute the covariance matrix of incomplete data so that factor analysis can be performed. Imputations are made using multiple imputation by Multivariate Imputation with Chained Equations (MICE) and combined with Rubin's rules. Parametric Fieller confidence intervals and nonparametric bootstrap confidence intervals can be obtained for the variance explained by different numbers of principal components. The method is described in Nassiri et al. (2018) <doi:10.3758/s13428-017-1013-4>.
This package performs maximum likelihood estimation for finite mixture models for families including Normal, Weibull, Gamma and Lognormal by using EM algorithm, together with Newton-Raphson algorithm or bisection method when necessary. It also conducts mixture model selection by using information criteria or bootstrap likelihood ratio test. The data used for mixture model fitting can be raw data or binned data. The model fitting process is accelerated by using R package Rcpp'.
Generalized low-rank models for mixed and incomplete data frames. The main function may be used for dimensionality reduction of imputation of numeric, binary and count data (simultaneously). Main effects such as column means, group effects, or effects of row-column side information (e.g. user/item attributes in recommendation system) may also be modelled in addition to the low-rank model. Geneviève Robin, Olga Klopp, Julie Josse, à ric Moulines, Robert Tibshirani (2018) <arXiv:1806.09734>
.
Modified functions of the package pcalg and some additional functions to run the PC and the FCI (Fast Causal Inference) algorithm for constraint-based causal discovery in incomplete and multiply imputed datasets. Foraita R, Friemel J, Günther K, Behrens T, Bullerdiek J, Nimzyk R, Ahrens W, Didelez V (2020) <doi:10.1111/rssa.12565>; Andrews RM, Foraita R, Didelez V, Witte J (2021) <arXiv:2108.13395>
; Witte J, Foraita R, Didelez V (2022) <doi:10.1002/sim.9535>.
DNA methylation contains information about the regulatory state of the cell. MIRA aggregates genome-scale DNA methylation data into a DNA methylation profile for a given region set with shared biological annotation. Using this profile, MIRA infers and scores the collective regulatory activity for the region set. MIRA facilitates regulatory analysis in situations where classical regulatory assays would be difficult and allows public sources of region sets to be leveraged for novel insight into the regulatory state of DNA methylation datasets.
Multiple imputation using Fully Conditional Specification (FCS) implemented by the MICE algorithm as described in http://doi.org/10.18637/jss.v045.i03. Each variable has its own imputation model. Built-in imputation models are provided for continuous data (predictive mean matching, normal), binary data (logistic regression), unordered categorical data (polytomous logistic regression) and ordered categorical data (proportional odds). MICE can also impute continuous two-level data (normal model, pan, second-level variables). Passive imputation can be used to maintain consistency between variables. Various diagnostic plots are available to inspect the quality of the imputations.
This is a package for the analysis of discrete response data using unidimensional and multidimensional item analysis models under the Item Response Theory paradigm (Chalmers (2012) <doi:10.18637/jss.v048.i06>). Exploratory and confirmatory item factor analysis models are estimated with quadrature (EM) or stochastic (MHRM) methods. Confirmatory bi-factor and two-tier models are available for modeling item testlets using dimension reduction EM algorithms, while multiple group analyses and mixed effects designs are included for detecting differential item, bundle, and test functioning, and for modeling item and person covariates. Finally, latent class models such as the DINA, DINO, multidimensional latent class, mixture IRT models, and zero-inflated response models are supported.
The multiple instance data set consists of many independent subjects (called bags) and each subject is composed of several components (called instances). The outcomes of such data set are binary or categorical responses, and, we can only observe the subject-level outcomes. For example, in manufacturing processes, a subject is labeled as "defective" if at least one of its own components is defective, and otherwise, is labeled as "non-defective". The milr package focuses on the predictive model for the multiple instance data set with binary outcomes and performs the maximum likelihood estimation with the Expectation-Maximization algorithm under the framework of logistic regression. Moreover, the LASSO penalty is attached to the likelihood function for simultaneous parameter estimation and variable selection.
Single-cell RNA-sequencing (scRNA-seq
) has made it possible to profile gene expression in tissues at high resolution. An important preprocessing step prior to performing downstream analyses is to identify and remove cells with poor or degraded sample quality using quality control (QC) metrics. Two widely used QC metrics to identify a ‘low-quality’ cell are (i) if the cell includes a high proportion of reads that map to mitochondrial DNA encoded genes (mtDNA
) and (ii) if a small number of genes are detected. miQC
is data-driven QC metric that jointly models both the proportion of reads mapping to mtDNA
and the number of detected genes with mixture models in a probabilistic framework to predict the low-quality cells in a given dataset.
An increasing number of microbiome datasets have been generated and analyzed with the help of rapidly developing sequencing technologies. At present, analysis of taxonomic profiling data is mainly conducted using composition-based methods, which ignores interactions between community members. Besides this, a lack of efficient ways to compare microbial interaction networks limited the study of community dynamics. To better understand how community diversity is affected by complex interactions between its members, we developed a framework (Microbial community dIversity
and Network Analysis, mina), a comprehensive framework for microbial community diversity analysis and network comparison. By defining and integrating network-derived community features, we greatly reduce noise-to-signal ratio for diversity analyses. A bootstrap and permutation-based method was implemented to assess community network dissimilarities and extract discriminative features in a statistically principled way.
An implementation of a taxonomy of models of restricted diffusion in biological tissues parametrized by the tissue geometry (axis, diameter, density, etc.). This is primarily used in the context of diffusion magnetic resonance (MR) imaging to model the MR signal attenuation in the presence of diffusion gradients. The goal is to provide tools to simulate the MR signal attenuation predicted by these models under different experimental conditions. The package feeds a companion shiny app available at <https://midi-pastrami.apps.math.cnrs.fr> that serves as a graphical interface to the models and tools provided by the package. Models currently available are the ones in Neuman (1974) <doi:10.1063/1.1680931>, Van Gelderen et al. (1994) <doi:10.1006/jmrb.1994.1038>, Stanisz et al. (1997) <doi:10.1002/mrm.1910370115>, Soderman & Jonsson (1995) <doi:10.1006/jmra.1995.0014> and Callaghan (1995) <doi:10.1006/jmra.1995.1055>.
Count data is prevalent and informative, with widespread application in many fields such as social psychology, personality, and public health. Classical statistical methods for the analysis of count outcomes are commonly variants of the log-linear model, including Poisson regression and Negative Binomial regression. However, a typical problem with count data modeling is inflation, in the sense that the counts are evidently accumulated on some integers. Such an inflation problem could distort the distribution of the observed counts, further bias estimation and increase error, making the classic methods infeasible. Traditional inflated value selection methods based on histogram inspection are easy to neglect true points and computationally expensive in addition. Therefore, we propose a multiple-inflated negative binomial model to handle count data modeling with multiple inflated values, achieving data-driven inflated value selection. The proposed approach provides simultaneous identification of important regression predictors on the target count response as well. More details about the proposed method are described in Li, Y., Wu, M., Wu, M., & Ma, S. (2023) <arXiv:2309.15585>
.