Performance analysis workflow that combines the power of the R language (and the tidyverse realm) and many auxiliary tools to provide a consistent, flexible, extensible, fast, and versatile framework for the performance analysis of task-based applications that run on top of the StarPU runtime (with its MPI (Message Passing Interface) layer for multi-node support). Its goal is to provide a fruitful prototypical environment to conduct performance analysis hypothesis-checking for task-based applications that run on heterogeneous (multi-GPU, multi-core) multi-node HPC (High-performance computing) platforms.
Experience studies are used by actuaries to explore historical experience across blocks of business and to inform assumption setting activities. This package provides functions for preparing data, creating studies, visualizing results, and beginning assumption development. Experience study methods, including exposure calculations, are described in: Atkinson & McGarry (2016) "Experience Study Calculations" <https://www.soa.org/49378a/globalassets/assets/files/research/experience-study-calculations.pdf>. The limited fluctuation credibility method used by the exp_stats() function is described in: Herzog (1999, ISBN:1-56698-374-6) "Introduction to Credibility Theory".
Count transformation models featuring parameters interpretable as discrete hazard ratios, odds ratios, reverse-time discrete hazard ratios, or transformed expectations. An appropriate data transformation for a count outcome and regression coefficients are simultaneously estimated by maximising the exact discrete log-likelihood using the computational framework provided in package mlt', technical details are given in Siegfried & Hothorn (2020) <DOI:10.1111/2041-210X.13383>. The package also contains an experimental implementation of multivariate count transformation models with an application to multi-species distribution models <DOI:10.48550/arXiv.2201.13095>.
Collection of ancillary functions and utilities for Partial Linear Single Index Models for Environmental mixture analyses, which currently provides functions for scalar outcomes. The outputs of these functions include the single index function, single index coefficients, partial linear coefficients, mixture overall effect, exposure main and interaction effects, and differences of quartile effects. In the future, we will add functions for binary, ordinal, Poisson, survival, and longitudinal outcomes, as well as models for time-dependent exposures. See Wang et al (2020) <doi:10.1186/s12940-020-00644-4> for an overview.
This package provides an implementation of two-dimensional functional principal component analysis (FPCA), Marginal FPCA, and Product FPCA for repeated functional data. Marginal and Product FPCA implementations are done for both dense and sparsely observed functional data. References: Chen, K., Delicado, P., & Müller, H. G. (2017) <doi:10.1111/rssb.12160>. Chen, K., & Müller, H. G. (2012) <doi:10.1080/01621459.2012.734196>. Hall, P., Müller, H.G. and Wang, J.L. (2006) <doi:10.1214/009053606000000272>. Yao, F., Müller, H. G., & Wang, J. L. (2005) <doi:10.1198/016214504000001745>.
Create network-style visualizations of pairwise relationships using custom edge glyphs built on top of ggplot2'. The package supports both statistical and non-statistical data and allows users to represent directed relationships. This enables clear, publication-ready graphics for exploring and communicating relational structures in a wide range of domains. The method was first used in Abu-Akel et al. (2021) <doi:10.1371/journal.pone.0245100>. Code is released under the MIT License; included datasets are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0).
Fast, model-agnostic implementation of different H-statistics introduced by Jerome H. Friedman and Bogdan E. Popescu (2008) <doi:10.1214/07-AOAS148>. These statistics quantify interaction strength per feature, feature pair, and feature triple. The package supports multi-output predictions and can account for case weights. In addition, several variants of the original statistics are provided. The shape of the interactions can be explored through partial dependence plots or individual conditional expectation plots. DALEX explainers, meta learners ('mlr3', tidymodels', caret') and most other models work out-of-the-box.
Rank-based tests for enrichment of KOG (euKaryotic Orthologous Groups) classes with up- or down-regulated genes based on a continuous measure. The meta-analysis is based on correlation of KOG delta-ranks across datasets (delta-rank is the difference between mean rank of genes belonging to a KOG class and mean rank of all other genes). With binary measure (1 or 0 to indicate significant and non-significant genes), one-tailed Fisher's exact test for over-representation of each KOG class among significant genes will be performed.
This package contains sixteen moisture sorption isotherm models, which evaluate the fitness of adsorption and desorption curves for further understanding of the relationship between moisture content and water activity. Fitness evaluation is conducted through parameter estimation and error analysis. Moreover, graphical representation, hysteresis area estimation, and isotherm classification through the equation of Blahovec & Yanniotis (2009) <doi:10.1016/j.jfoodeng.2008.08.007> which is based on the classification system introduced by Brunauer et. al. (1940) <doi:10.1021/ja01864a025> are also included for the visualization of models and hysteresis.
This package provides a computational framework for analyzing mutations in immunoglobulin (Ig) sequences. Includes methods for Bayesian estimation of antigen-driven selection pressure, mutational load quantification, building of somatic hypermutation (SHM) models, and model-dependent distance calculations. Also includes empirically derived models of SHM for both mice and humans. Citations: Gupta and Vander Heiden, et al (2015) <doi:10.1093/bioinformatics/btv359>, Yaari, et al (2012) <doi:10.1093/nar/gks457>, Yaari, et al (2013) <doi:10.3389/fimmu.2013.00358>, Cui, et al (2016) <doi:10.4049/jimmunol.1502263>.
This package provides a mixture model for clustering individuals (or sampling groups) into stocks based on their genetic profile. Here, sampling groups are individuals that are sure to come from the same stock (e.g. breeding adults or larvae). The mixture (log-)likelihood is maximised using the EM-algorithm after finding good starting values via a K-means clustering of the genetic data. Details can be found in: Foster, S. D.; Feutry, P.; Grewe, P. M.; Berry, O.; Hui, F. K. C. & Davies (2020) <doi:10.1111/1755-0998.12920>.
The LSTM (Long Short-Term Memory) model is a Recurrent Neural Network (RNN) based architecture that is widely used for time series forecasting. Min-Max transformation has been used for data preparation. Here, we have used one LSTM layer as a simple LSTM model and a Dense layer is used as the output layer. Then, compile the model using the loss function, optimizer and metrics. This package is based on Keras and TensorFlow modules and the algorithm of Paul and Garai (2021) <doi:10.1007/s00500-021-06087-4>.
This package provides a dedicated viral-explainer model tool designed to empower researchers in the field of HIV research, particularly in viral load and CD4 (Cluster of Differentiation 4) lymphocytes regression modeling. Drawing inspiration from the tidymodels framework for rigorous model building of Max Kuhn and Hadley Wickham (2020) <https://www.tidymodels.org>, and the DALEXtra tool for explainability by Przemyslaw Biecek (2020) <doi:10.48550/arXiv.2009.13248>. It aims to facilitate interpretable and reproducible research in biostatistics and computational biology for the benefit of understanding HIV dynamics.
New tools for the imputation of missing values in high-dimensional data are introduced using the non-parametric nearest neighbor methods. It includes weighted nearest neighbor imputation methods that use specific distances for selected variables. It includes an automatic procedure of cross validation and does not require prespecified values of the tuning parameters. It can be used to impute missing values in high-dimensional data when the sample size is smaller than the number of predictors. For more information see Faisal and Tutz (2017) <doi:10.1515/sagmb-2015-0098>.
Genetically modified organisms (GMOs) and cell lines are widely used models in all kinds of biological research. As part of characterising these models, DNA sequencing technology and bioinformatics analyses are used systematically to study their genomes. Therefore, large volumes of data are generated and various algorithms are applied to analyse this data, which introduces a challenge on representing all findings in an informative and concise manner. `gmoviz` provides users with an easy way to visualise and facilitate the explanation of complex genomic editing events on a larger, biologically-relevant scale.
VarCon is an R package which converts the positional information from the annotation of an single nucleotide variation (SNV) (either referring to the coding sequence or the reference genomic sequence). It retrieves the genomic reference sequence around the position of the single nucleotide variation. To asses, whether the SNV could potentially influence binding of splicing regulatory proteins VarCon calcualtes the HEXplorer score as an estimation. Besides, VarCon additionally reports splice site strengths of splice sites within the retrieved genomic sequence and any changes due to the SNV.
Alternating Manifold Proximal Gradient Method for Sparse PCA uses the Alternating Manifold Proximal Gradient (AManPG) method to find sparse principal components from a data or covariance matrix. Provides a novel algorithm for solving the sparse principal component analysis problem which provides advantages over existing methods in terms of efficiency and convergence guarantees. Chen, S., Ma, S., Xue, L., & Zou, H. (2020) <doi:10.1287/ijoo.2019.0032>. Zou, H., Hastie, T., & Tibshirani, R. (2006) <doi:10.1198/106186006X113430>. Zou, H., & Xue, L. (2018) <doi:10.1109/JPROC.2018.2846588>.
Flexible univariate count models based on renewal processes. The models may include covariates and can be specified with familiar formula syntax as in glm() and package flexsurv'. The methodology is described by Kharrat et all (2019) <doi:10.18637/jss.v090.i13> (included as vignette Countr_guide in the package). If the suggested package pscl is not available from CRAN, it can be installed with remotes::install_github("cran/pscl")'. It is no longer used by the functions in this package but is needed for some of the extended examples.
This package provides functions for testing if the covariance structure of 2-dimensional data (e.g. samples of surfaces X_i = X_i(s,t)) is separable, i.e. if covariance(X) = C_1 x C_2. A complete descriptions of the implemented tests can be found in the paper Aston, John A. D.; Pigoli, Davide; Tavakoli, Shahin. Tests for separability in nonparametric covariance operators of random surfaces. Ann. Statist. 45 (2017), no. 4, 1431--1461. <doi:10.1214/16-AOS1495> <https://projecteuclid.org/euclid.aos/1498636862> <arXiv:1505.02023>.
Tool collection for common and not so common data science use cases. This includes custom made algorithms for data management as well as value calculations that are hard to find elsewhere because of their specificity but would be a waste to get lost nonetheless. Currently available functionality: find sub-graphs in an edge list data.frame, find mode or modes in a vector of values, extract (a) specific regular expression group(s), generate ISO time stamps that play well with file names, or generate URL parameter lists by expanding value combinations.
Build a map of path-based geometry, this is a simple description of the number of parts in an object and their basic structure. Translation and restructuring operations for planar shapes and other hierarchical types require a data model with a record of the underlying relationships between elements. The gibble() function creates a geometry map, a simple record of the underlying structure in path-based hierarchical types. There are methods for the planar shape types in the sf and sp packages and for types in the trip and silicate packages.
Mapper-based survival analysis with transcriptomics data is designed to carry out. Mapper-based survival analysis is a modification of Progression Analysis of Disease (PAD) where survival data is taken into account in the filtering function. More details in: J. Fores-Martos, B. Suay-Garcia, R. Bosch-Romeu, M.C. Sanfeliu-Alonso, A. Falco, J. Climent, "Progression Analysis of Disease with Survival (PAD-S) by SurvMap identifies different prognostic subgroups of breast cancer in a large combined set of transcriptomics and methylation studies" <doi:10.1101/2022.09.08.507080>.
This package implements the GAMbag, GAMrsm and GAMens ensemble classifiers for binary classification (De Bock et al., 2010) <doi:10.1016/j.csda.2009.12.013>. The ensembles implement Bagging (Breiman, 1996) <doi:10.1023/A:1010933404324>, the Random Subspace Method (Ho, 1998) <doi:10.1109/34.709601> , or both, and use Hastie and Tibshirani's (1990, ISBN:978-0412343902) generalized additive models (GAMs) as base classifiers. Once an ensemble classifier has been trained, it can be used for predictions on new data. A function for cross validation is also included.
The Gene Ontology (GO) Consortium <https://geneontology.org/> organizes genes into hierarchical categories based on biological process (BP), molecular function (MF) and cellular component (CC, i.e., subcellular localization). Tools such as GoMiner (see Zeeberg, B.R., Feng, W., Wang, G. et al. (2003) <doi:10.1186/gb-2003-4-4-r28>) can leverage GO to perform ontological analysis of microarray and proteomics studies, typically generating a list of significant functional categories. To capture the benefit of all three ontologies, I developed HTGM3D', a three-dimensional version of GoMiner'.