The straightforward filtering index (SFINX) identifies true positive protein interactions in a fast, user-friendly, and highly accurate way. It is not only useful for the filtering of affinity purification - mass spectrometry (AP-MS) data, but also for similar types of data resulting from other co-complex interactomics technologies, such as TAP-MS, Virotrap and BioID
. SFINX can also be used via the website interface at <http://sfinx.ugent.be>.
Utilities for rapidly loading specified rows and/or columns of data from large tab-separated value (tsv) files (large: e.g. 1 GB file of 10000 x 10000 matrix). tsvio is an R wrapper to C code that creates an index file for the rows of the tsv file, and uses that index file to collect rows and/or columns from the tsv file without reading the whole file into memory.
Extends standard penalized regression (Lasso, Ridge, and Elastic-net) to allow feature-specific shrinkage based on external information with the goal of achieving a better prediction accuracy and variable selection. Examples of external information include the grouping of predictors, prior knowledge of biological importance, external p-values, function annotations, etc. The choice of multiple tuning parameters is done using an Empirical Bayes approach. A majorization-minimization algorithm is employed for implementation.
The development of high-throughput sequencing led to increased use of co-expression analysis to go beyong single feature (i.e. gene) focus. We propose GWENA (Gene Whole co-Expression Network Analysis) , a tool designed to perform gene co-expression network analysis and explore the results in a single pipeline. It includes functional enrichment of modules of co-expressed genes, phenotypcal association, topological analysis and comparison of networks configuration between conditions.
This package enables the interpretation and analysis of results from a gene set enrichment analysis using network-based and text-mining approaches. Most enrichment analyses result in large lists of significant gene sets that are difficult to interpret. Tools in this package help build a similarity-based network of significant gene sets from a gene set enrichment analysis that can then be investigated for their biological function using text-mining approaches.
tidyr is a reframing of the reshape2 package designed to accompany the tidy data framework, and to work hand-in-hand with magrittr and dplyr to build a solid pipeline for data analysis. It is designed specifically for tidying data, not the general reshaping that reshape2 does, or the general aggregation that reshape did. In particular, built-in methods only work for data frames, and tidyr provides no margins or aggregation.
SPAMS (SPArse Modeling Software) is an optimization toolbox for solving various sparse estimation problems. It includes tools for the following problems:
Dictionary learning and matrix factorization (NMF, sparse principle component analysis (PCA), ...)
Solving sparse decomposition problems with LARS, coordinate descent, OMP, SOMP, proximal methods
Solving structured sparse decomposition problems (l1/l2, l1/linf, sparse group lasso, tree-structured regularization, structured sparsity with overlapping groups,...).
Gene-environment (GÃ E) interactions have important implications to elucidate the etiology of complex diseases beyond the main genetic and environmental effects. Outliers and data contamination in disease phenotypes of GÃ E studies have been commonly encountered, leading to the development of a broad spectrum of robust penalization methods. Nevertheless, within the Bayesian framework, the issue has not been taken care of in existing studies. We develop a robust Bayesian variable selection method for GÃ E interaction studies. The proposed Bayesian method can effectively accommodate heavy-tailed errors and outliers in the response variable while conducting variable selection by accounting for structural sparsity. In particular, the spike-and-slab priors have been imposed on both individual and group levels to identify important main and interaction effects. An efficient Gibbs sampler has been developed to facilitate fast computation. The Markov chain Monte Carlo algorithms of the proposed and alternative methods are efficiently implemented in C++.
This package provides a collection of functions for structure learning of causal networks and estimation of joint causal effects from observational Gaussian data. Main algorithm consists of a Markov chain Monte Carlo scheme for posterior inference of causal structures, parameters and causal effects between variables. References: F. Castelletti and A. Mascaro (2021) <doi:10.1007/s10260-021-00579-1>, F. Castelletti and A. Mascaro (2022) <doi:10.48550/arXiv.2201.12003>
.
Stan based functions to estimate CAR-MM models. These models allow to estimate Generalised Linear Models with CAR (conditional autoregressive) spatial random effects for spatially and temporally misaligned data, provided a suitable Multiple Membership matrix. The main references are Gramatica, Liverani and Congdon (2023) <doi:10.1214/23-BA1370>, Petrof, Neyens, Nuyts, Nackaerts, Nemery and Faes (2020) <doi:10.1002/sim.8697> and Gramatica, Congdon and Liverani <doi:10.1111/rssc.12480>.
Supplies higher-order coordinatized data specification and fluid transform operators that include pivot and anti-pivot as special cases. The methodology is describe in Zumel', 2018, "Fluid data reshaping with cdata'", <https://winvector.github.io/FluidData/FluidDataReshapingWithCdata.html>
, <DOI:10.5281/zenodo.1173299> . This package introduces the idea of explicit control table specification of data transforms. Works on in-memory data or on remote data using rquery and SQL database interfaces.
Generally, most of the packages specify the probability density function, cumulative distribution function, quantile function, and random numbers generation of the probability distributions. The present package allows to compute some important distributional properties, including the first four ordinary and central moments, Pearson's coefficient of skewness and kurtosis, the mean and variance, coefficient of variation, median, and quartile deviation at some parametric values of several well-known and extensively used probability distributions.
Interface to the python package dgpsi for Gaussian process, deep Gaussian process, and linked deep Gaussian process emulations of computer models and networks using stochastic imputation (SI). The implementations follow Ming & Guillas (2021) <doi:10.1137/20M1323771> and Ming, Williamson, & Guillas (2023) <doi:10.1080/00401706.2022.2124311> and Ming & Williamson (2023) <doi:10.48550/arXiv.2306.01212>
. To get started with the package, see <https://mingdeyu.github.io/dgpsi-R/>.
R package to build and simulate deterministic discrete-time compartmental models that can be non-Markov. Length of stay in each compartment can be defined to follow a parametric distribution (d_exponential()
, d_gamma()
, d_weibull()
, d_lognormal()
) or a non-parametric distribution (nonparametric()
). Other supported types of transition from one compartment to another includes fixed transition (constant()
), multinomial (multinomial()
), fixed transition probability (transprob()
).
We offer an implementation of the series representation put forth in "A series representation for multidimensional Rayleigh distributions" by Wiegand and Nadarajah <DOI: 10.1002/dac.3510>. Furthermore we have implemented an integration approach proposed by Beaulieu et al. for 3 and 4-dimensional Rayleigh densities (Beaulieu, Zhang, "New simplest exact forms for the 3D and 4D multivariate Rayleigh PDFs with applications to antenna array geometrics", <DOI: 10.1109/TCOMM.2017.2709307>).
Designed to simplify geospatial data access from the Statistics Finland Web Feature Service API <https://geo.stat.fi/geoserver/index.html>, the geofi package offers researchers and analysts a set of tools to obtain and harmonize administrative spatial data for a wide range of applications, from urban planning to environmental research. The package contains annually updated time series of municipality key datasets that can be used for data aggregation and language translations.
Statistical testing procedures for detecting GxE
(gene-environment) interactions. The main focus lies on GRSxE
interaction tests that aim at detecting GxE
interactions through GRS (genetic risk scores). Moreover, a novel testing procedure based on bagging and OOB (out-of-bag) predictions is implemented for incorporating all available observations at both GRS construction and GxE
testing (Lau et al., 2023, <doi:10.1038/s41598-023-28172-4>).
This package provides functions and methods for: splitting large raster objects into smaller chunks, transferring images from a binary format into raster layers, transferring raster layers into an RData file, calculating the maximum gap (amount of consecutive missing values) of a numeric vector, and fitting harmonic regression models to periodic time series. The homoscedastic harmonic regression model is based on G. Roerink, M. Menenti and W. Verhoef (2000) <doi:10.1080/014311600209814>.
Generation of synthetic data from a real dataset using the combination of rank normal inverse transformation with the calculation of correlation matrix <doi:10.1055/a-2048-7692>. Completely artificial data may be generated through the use of Generalized Lambda Distribution and Generalized Poisson Distribution <doi:10.1201/9781420038040>. Quantitative, binary, ordinal categorical, and survival data may be simulated. Functionalities are offered to generate synthetic data sets according to user's needs.
Partial informational correlation (PIC) is used to identify the meaningful predictors to the response from a large set of potential predictors. Details of methodologies used in the package can be found in Sharma, A., Mehrotra, R. (2014). <doi:10.1002/2013WR013845>, Sharma, A., Mehrotra, R., Li, J., & Jha, S. (2016). <doi:10.1016/j.envsoft.2016.05.021>, and Mehrotra, R., & Sharma, A. (2006). <doi:10.1016/j.advwatres.2005.08.007>.
This package provides a comprehensive suite of tools for analyzing omics data. It includes functionalities for alpha diversity analysis, beta diversity analysis, differential abundance analysis, community assembly analysis, visualization of phylogenetic tree, and functional enrichment analysis. With a progressive approach, the package offers a range of analysis methods to explore and understand the complex communities. It is designed to support researchers and practitioners in conducting in-depth and professional omics data analysis.
Calculates, via simulation, power and appropriate stopping alpha boundaries (and/or futility bounds) for sequential analyses (i.e., group sequential design) as well as for multiple hypotheses (multiple tests included in an analysis), given any specified global error rate. This enables the sequential use of practically any significance test, as long as the underlying data can be simulated in advance to a reasonable approximation. Lukács (2022) <doi:10.21105/joss.04643>.
An automatic cell type detection and assignment algorithm for single cell RNA-Seq and Cytof/FACS data. SCINA is capable of assigning cell type identities to a pool of cells profiled by scRNA-Seq
or Cytof/FACS data with prior knowledge of markers, such as genes and protein symbols that are highly or lowly expressed in each category. See Zhang Z, et al (2019) <doi:10.3390/genes10070531> for more details.
Implementation of SING algorithm to extract joint and individual non-Gaussian components from two datasets. SING uses an objective function that maximizes the skewness and kurtosis of latent components with a penalty to enhance the similarity between subject scores. Unlike other existing methods, SING does not use PCA for dimension reduction, but rather uses non-Gaussianity, which can improve feature extraction. Benjamin B.Risk, Irina Gaynanova (2021) <doi:10.1214/21-AOAS1466>.