This package provides a comprehensive toolkit for calculating and visualizing Nitrogen Use Efficiency (NUE) indicators in agricultural research. The package implements 23 parameters categorized into fertilizer-based, plant-based, soil-based, isotope-based, ecology-based, and system-based indicators based on Congreves et al. (2021) <doi:10.3389/fpls.2021.637108>. Key features include vectorized calculations for paired-plot experimental designs, batch processing capabilities for handling large datasets, and built-in visualization tools using ggplot2'. Designed to streamline the workflow from raw agronomic data to publication-ready metrics and plots.
This package contains sixteen moisture sorption isotherm models, which evaluate the fitness of adsorption and desorption curves for further understanding of the relationship between moisture content and water activity. Fitness evaluation is conducted through parameter estimation and error analysis. Moreover, graphical representation, hysteresis area estimation, and isotherm classification through the equation of Blahovec & Yanniotis (2009) <doi:10.1016/j.jfoodeng.2008.08.007> which is based on the classification system introduced by Brunauer et. al. (1940) <doi:10.1021/ja01864a025> are also included for the visualization of models and hysteresis.
This package provides a computational framework for analyzing mutations in immunoglobulin (Ig) sequences. Includes methods for Bayesian estimation of antigen-driven selection pressure, mutational load quantification, building of somatic hypermutation (SHM) models, and model-dependent distance calculations. Also includes empirically derived models of SHM for both mice and humans. Citations: Gupta and Vander Heiden, et al (2015) <doi:10.1093/bioinformatics/btv359>, Yaari, et al (2012) <doi:10.1093/nar/gks457>, Yaari, et al (2013) <doi:10.3389/fimmu.2013.00358>, Cui, et al (2016) <doi:10.4049/jimmunol.1502263>.
This package provides a mixture model for clustering individuals (or sampling groups) into stocks based on their genetic profile. Here, sampling groups are individuals that are sure to come from the same stock (e.g. breeding adults or larvae). The mixture (log-)likelihood is maximised using the EM-algorithm after finding good starting values via a K-means clustering of the genetic data. Details can be found in: Foster, S. D.; Feutry, P.; Grewe, P. M.; Berry, O.; Hui, F. K. C. & Davies (2020) <doi:10.1111/1755-0998.12920>.
The LSTM (Long Short-Term Memory) model is a Recurrent Neural Network (RNN) based architecture that is widely used for time series forecasting. Min-Max transformation has been used for data preparation. Here, we have used one LSTM layer as a simple LSTM model and a Dense layer is used as the output layer. Then, compile the model using the loss function, optimizer and metrics. This package is based on Keras and TensorFlow modules and the algorithm of Paul and Garai (2021) <doi:10.1007/s00500-021-06087-4>.
This package provides a dedicated viral-explainer model tool designed to empower researchers in the field of HIV research, particularly in viral load and CD4 (Cluster of Differentiation 4) lymphocytes regression modeling. Drawing inspiration from the tidymodels framework for rigorous model building of Max Kuhn and Hadley Wickham (2020) <https://www.tidymodels.org>, and the DALEXtra tool for explainability by Przemyslaw Biecek (2020) <doi:10.48550/arXiv.2009.13248>. It aims to facilitate interpretable and reproducible research in biostatistics and computational biology for the benefit of understanding HIV dynamics.
New tools for the imputation of missing values in high-dimensional data are introduced using the non-parametric nearest neighbor methods. It includes weighted nearest neighbor imputation methods that use specific distances for selected variables. It includes an automatic procedure of cross validation and does not require prespecified values of the tuning parameters. It can be used to impute missing values in high-dimensional data when the sample size is smaller than the number of predictors. For more information see Faisal and Tutz (2017) <doi:10.1515/sagmb-2015-0098>.
Alternating Manifold Proximal Gradient Method for Sparse PCA uses the Alternating Manifold Proximal Gradient (AManPG) method to find sparse principal components from a data or covariance matrix. Provides a novel algorithm for solving the sparse principal component analysis problem which provides advantages over existing methods in terms of efficiency and convergence guarantees. Chen, S., Ma, S., Xue, L., & Zou, H. (2020) <doi:10.1287/ijoo.2019.0032>. Zou, H., Hastie, T., & Tibshirani, R. (2006) <doi:10.1198/106186006X113430>. Zou, H., & Xue, L. (2018) <doi:10.1109/JPROC.2018.2846588>.
This package provides functions for testing if the covariance structure of 2-dimensional data (e.g. samples of surfaces X_i = X_i(s,t)) is separable, i.e. if covariance(X) = C_1 x C_2. A complete descriptions of the implemented tests can be found in the paper Aston, John A. D.; Pigoli, Davide; Tavakoli, Shahin. Tests for separability in nonparametric covariance operators of random surfaces. Ann. Statist. 45 (2017), no. 4, 1431--1461. <doi:10.1214/16-AOS1495> <https://projecteuclid.org/euclid.aos/1498636862> <arXiv:1505.02023>.
Tool collection for common and not so common data science use cases. This includes custom made algorithms for data management as well as value calculations that are hard to find elsewhere because of their specificity but would be a waste to get lost nonetheless. Currently available functionality: find sub-graphs in an edge list data.frame, find mode or modes in a vector of values, extract (a) specific regular expression group(s), generate ISO time stamps that play well with file names, or generate URL parameter lists by expanding value combinations.
This package implements the GAMbag, GAMrsm and GAMens ensemble classifiers for binary classification (De Bock et al., 2010) <doi:10.1016/j.csda.2009.12.013>. The ensembles implement Bagging (Breiman, 1996) <doi:10.1023/A:1010933404324>, the Random Subspace Method (Ho, 1998) <doi:10.1109/34.709601> , or both, and use Hastie and Tibshirani's (1990, ISBN:978-0412343902) generalized additive models (GAMs) as base classifiers. Once an ensemble classifier has been trained, it can be used for predictions on new data. A function for cross validation is also included.
Build a map of path-based geometry, this is a simple description of the number of parts in an object and their basic structure. Translation and restructuring operations for planar shapes and other hierarchical types require a data model with a record of the underlying relationships between elements. The gibble() function creates a geometry map, a simple record of the underlying structure in path-based hierarchical types. There are methods for the planar shape types in the sf and sp packages and for types in the trip and silicate packages.
Mapper-based survival analysis with transcriptomics data is designed to carry out. Mapper-based survival analysis is a modification of Progression Analysis of Disease (PAD) where survival data is taken into account in the filtering function. More details in: J. Fores-Martos, B. Suay-Garcia, R. Bosch-Romeu, M.C. Sanfeliu-Alonso, A. Falco, J. Climent, "Progression Analysis of Disease with Survival (PAD-S) by SurvMap identifies different prognostic subgroups of breast cancer in a large combined set of transcriptomics and methylation studies" <doi:10.1101/2022.09.08.507080>.
The Gene Ontology (GO) Consortium <https://geneontology.org/> organizes genes into hierarchical categories based on biological process (BP), molecular function (MF) and cellular component (CC, i.e., subcellular localization). Tools such as GoMiner (see Zeeberg, B.R., Feng, W., Wang, G. et al. (2003) <doi:10.1186/gb-2003-4-4-r28>) can leverage GO to perform ontological analysis of microarray and proteomics studies, typically generating a list of significant functional categories. To capture the benefit of all three ontologies, I developed HTGM3D', a three-dimensional version of GoMiner'.
An implementation of classifier chains (CC's) for multi-label prediction. Users can employ an external package (e.g. randomForest', C50'), or supply their own. The package can train a single set of CC's or train an ensemble of CC's -- in parallel if running in a multi-core environment. New observations are classified using a Gibbs sampler since each unobserved label is conditioned on the others. The package includes methods for evaluating the predictions for accuracy and aggregating across iterations and models to produce binary or probabilistic classifications.
Three distinct methods are implemented for evaluating the sums of arbitrary negative binomial distributions. These methods are: Furman's exact probability mass function (Furman (2007) <doi:10.1016/j.spl.2006.06.007>), saddlepoint approximation, and a method of moments approximation. Functions are provided to calculate the density function, the distribution function and the quantile function of the convolutions in question given said evaluation methods. Functions for generating random deviates from negative binomial convolutions and for directly calculating the mean, variance, skewness, and excess kurtosis of said convolutions are also provided.
The semiparametric accelerated failure time (AFT) model is an attractive alternative to the Cox proportional hazards model. This package provides a suite of functions for fitting one popular rank-based estimator of the semiparametric AFT model, the regularized Gehan estimator. Specifically, we provide functions for cross-validation, prediction, coefficient extraction, and visualizing both trace plots and cross-validation curves. For further details, please see Suder, P. M. and Molstad, A. J., (2022) Scalable algorithms for semiparametric accelerated failure time models in high dimensions, Statistics in Medicine <doi:10.1002/sim.9264>.
This package provides tools for researchers to explicitly show that their results comply to rules for statistical disclosure control imposed by research data centers. These tools help in checking descriptive statistics and models and in calculating extreme values that are not individual data. Also included is a simple function to create log files. The methods used here are described in the "Guidelines for the checking of output based on microdata research" by Bond, Brandt, and de Wolf (2015) <https://cros.ec.europa.eu/system/files/2024-02/Output-checking-guidelines.pdf>.
Performance analysis workflow that combines the power of the R language (and the tidyverse realm) and many auxiliary tools to provide a consistent, flexible, extensible, fast, and versatile framework for the performance analysis of task-based applications that run on top of the StarPU runtime (with its MPI (Message Passing Interface) layer for multi-node support). Its goal is to provide a fruitful prototypical environment to conduct performance analysis hypothesis-checking for task-based applications that run on heterogeneous (multi-GPU, multi-core) multi-node HPC (High-performance computing) platforms.
An introduction to several novel predictive variable selection methods for random forest. They are based on various variable importance methods (i.e., averaged variable importance (AVI), and knowledge informed AVI (i.e., KIAVI, and KIAVI2)) and predictive accuracy in stepwise algorithms. For details of the variable selection methods, please see: Li, J., Siwabessy, J., Huang, Z. and Nichol, S. (2019) <doi:10.3390/geosciences9040180>. Li, J., Alvarez, B., Siwabessy, J., Tran, M., Huang, Z., Przeslawski, R., Radke, L., Howard, F., Nichol, S. (2017). <DOI: 10.13140/RG.2.2.27686.22085>.
This package provides a nonparametric method to estimate Toeplitz covariance matrices from a sample of n independently and identically distributed p-dimensional vectors with mean zero. The data is preprocessed with the discrete cosine matrix and a variance stabilization transformation to obtain an approximate Gaussian regression setting for the log-spectral density function. Estimates of the spectral density function and the inverse of the covariance matrix are provided as well. Functions for simulating data and a protein data example are included. For details see (Klockmann, Krivobokova; 2023), <arXiv:2303.10018>.
BASiCS is an integrated Bayesian hierarchical model to perform statistical analyses of single-cell RNA sequencing datasets in the context of supervised experiments (where the groups of cells of interest are known a priori. BASiCS performs built-in data normalisation (global scaling) and technical noise quantification (based on spike-in genes). BASiCS provides an intuitive detection criterion for highly (or lowly) variable genes within a single group of cells. Additionally, BASiCS can compare gene expression patterns between two or more pre-specified groups of cells.
This package implements methods for batch correction and integration of scRNA-seq datasets, based on the Seurat anchor-based integration framework. In particular, STACAS is optimized for the integration of heterogeneous datasets with only limited overlap between cell sub-types (e.g. TIL sets of CD8 from tumor with CD8/CD4 T cells from lymphnode), for which the default Seurat alignment methods would tend to over-correct biological differences. The 2.0 version of the package allows the users to incorporate explicit information about cell-types in order to assist the integration process.
Streaming JSON (ndjson) has one JSON record per-line and many modern ndjson files contain large numbers of records. These constructs may not be columnar in nature, but it is often useful to read in these files and "flatten" the structure out to enable working with the data in an R data.frame-like context. Functions are provided that make it possible to read in plain ndjson files or compressed (gz) ndjson files and either validate the format of the records or create "flat" data.table structures from them.