Linnorm is an R package for the analysis of RNA-seq, scRNA-seq, ChIP-seq count data or any large scale count data. It transforms such datasets for parametric tests. In addition to the transformtion function (Linnorm), the following pipelines are implemented:
Library size/batch effect normalization (
Linnorm.Norm)Cell subpopluation analysis and visualization using t-SNE or PCA K-means clustering or hierarchical clustering (
Linnorm.tSNE,Linnorm.PCA,Linnorm.HClust)Differential expression analysis or differential peak detection using limma (
Linnorm.limma)Highly variable gene discovery and visualization (
Linnorm.HVar)Gene correlation network analysis and visualization (
Linnorm.Cor)Stable gene selection for scRNA-seq data; for users without or who do not want to rely on spike-in genes (
Linnorm.SGenes)Data imputation (
Linnorm.DataImput).
Linnorm can work with raw count, CPM, RPKM, FPKM and TPM. Additionally, the RnaXSim function is included for simulating RNA-seq data for the evaluation of DEG analysis methods.
PaIRKAT is model framework for assessing statistical relationships between networks of metabolites (pathways) and an outcome of interest (phenotype). PaIRKAT queries the KEGG database to determine interactions between metabolites from which network connectivity is constructed. This model framework improves testing power on high dimensional data by including graph topography in the kernel machine regression setting. Studies on high dimensional data can struggle to include the complex relationships between variables. The semi-parametric kernel machine regression model is a powerful tool for capturing these types of relationships. They provide a framework for testing for relationships between outcomes of interest and high dimensional data such as metabolomic, genomic, or proteomic pathways. PaIRKAT uses known biological connections between high dimensional variables by representing them as edges of ‘graphs’ or ‘networks.’ It is common for nodes (e.g. metabolites) to be disconnected from all others within the graph, which leads to meaningful decreases in testing power whether or not the graph information is included. We include a graph regularization or ‘smoothing’ approach for managing this issue.
This package provides functions to produce accessible HTML slides, HTML', Word and PDF documents from input R markdown files. Accessible PDF files are produced only on a Windows Operating System. One aspect of accessibility is providing a headings structure that is recognised by a screen reader, providing a navigational tool for a blind or partially-sighted person. A key aim is to produce documents of different formats easily from each of a collection of R markdown source files. Input R markdown files are rendered using the render() function from the rmarkdown package <https://cran.r-project.org/package=rmarkdown>. A zip file containing multiple output files can be produced from one function call. A user-supplied template Word document can be used to determine the formatting of an output Word document. Accessible PDF files are produced from Word documents using OfficeToPDF <https://github.com/cognidox/OfficeToPDF>. A convenience function, install_otp() is provided to install this software. The option to print HTML output to (non-accessible) PDF files is also available.
Three games: proton, frequon and regression. Each one is a console-based data-crunching game for younger and older data scientists. Act as a data-hacker and find Slawomir Pietraszko's credentials to the Proton server. In proton you have to solve four data-based puzzles to find the login and password. There are many ways to solve these puzzles. You may use loops, data filtering, ordering, aggregation or other tools. Only basics knowledge of R is required to play the game, yet the more functions you know, the more approaches you can try. In frequon you will help to perform statistical cryptanalytic attack on a corpus of ciphered messages. This time seven sub-tasks are pushing the bar much higher. Do you accept the challenge? In regression you will test your modeling skills in a series of eight sub-tasks. Try only if ANOVA is your close friend. It's a part of Beta and Bit project. You will find more about the Beta and Bit project at <https://github.com/BetaAndBit/Charts>.
Developed to help researchers who need to model the kinetics of carbon dioxide (CO2) production in alcoholic fermentation of wines, beers and other fermented products. The following models are available for modeling the carbon dioxide production curve as a function of time: 5PL, Gompertz and 4PL. This package has different functions, which applied can: perform the modeling of the data obtained in the fermentation and return the coefficients, analyze the model fit and return different statistical metrics, and calculate the kinetic parameters: Maximum production of carbon dioxide; Maximum rate of production of carbon dioxide; Moment in which maximum fermentation rate occurs; Duration of the latency phase for carbon dioxide production; Carbon dioxide produced until maximum fermentation rate occurs. In addition, a function that generates graphs with the observed and predicted data from the models, isolated and combined, is available. Gava, A., Borsato, D., & Ficagna, E. (2020)."Effect of mixture of fining agents on the fermentation kinetics of base wine for sparkling wine production: Use of methodology for modeling". <doi:10.1016/j.lwt.2020.109660>.
This package provides a fast and flexible general-purpose implementation of Particle Swarm Optimization (PSO) and Differential Evolution (DE) for solving global minimization problems is provided. It is designed to handle complex optimization tasks with nonlinear, non-differentiable, and multi-modal objective functions defined by users. There are five types of PSO variants: Particle Swarm Optimization (PSO, Eberhart & Kennedy, 1995) <doi:10.1109/MHS.1995.494215>, Quantum-behaved particle Swarm Optimization (QPSO, Sun et al., 2004) <doi:10.1109/CEC.2004.1330875>, Locally convergent rotationally invariant particle swarm optimization (LcRiPSO, Bonyadi & Michalewicz, 2014) <doi:10.1007/s11721-014-0095-1>, Competitive Swarm Optimizer (CSO, Cheng & Jin, 2015) <doi:10.1109/TCYB.2014.2322602> and Double exponential particle swarm optimization (DExPSO, Stehlik et al., 2024) <doi:10.1016/j.asoc.2024.111913>. For the DE algorithm, six types in Storn, R. & Price, K. (1997) <doi:10.1023/A:1008202821328> are included: DE/rand/1, DE/rand/2, DE/best/1, DE/best/2, DE/rand_to-best/1 and DE/rand_to-best/2.
The number of biological databases is growing rapidly, but different databases use different IDs to refer to the same biological entity. The inconsistency in IDs impedes the integration of various types of biological data. To resolve the problem, we developed MantaID', a data-driven, machine-learning based approach that automates identifying IDs on a large scale. The MantaID model's prediction accuracy was proven to be 99%, and it correctly and effectively predicted 100,000 ID entries within two minutes. MantaID supports the discovery and exploitation of ID patterns from large quantities of databases. (e.g., up to 542 biological databases). An easy-to-use freely available open-source software R package, a user-friendly web application, and API were also developed for MantaID to improve applicability. To our knowledge, MantaID is the first tool that enables an automatic, quick, accurate, and comprehensive identification of large quantities of IDs, and can therefore be used as a starting point to facilitate the complex assimilation and aggregation of biological data across diverse databases.
Offers a flexible and user-friendly interface for visualizing conditional effects from a broad range of regression models, including mixed-effects and generalized additive (mixed) models. Compatible model types include lm(), rlm(), glm(), glm.nb(), and gam() (from mgcv'); nonlinear models via nls(); and generalized least squares via gls(). Mixed-effects models with random intercepts and/or slopes can be fitted using lmer(), glmer(), glmer.nb(), glmmTMB(), or gam() (from mgcv', via smooth terms). Plots are rendered using base R graphics with extensive customization options. Approximate confidence intervals for nls() models are computed using the delta method. Robust standard errors for rlm() are computed using the sandwich estimator (Zeileis 2004) <doi:10.18637/jss.v011.i10>. Methods for generalized additive models follow Wood (2017) <doi:10.1201/9781315370279>. For linear mixed-effects models with lme4', see Bates et al. (2015) <doi:10.18637/jss.v067.i01>. For mixed models using glmmTMB', see Brooks et al. (2017) <doi:10.32614/RJ-2017-066>.
We provide a toolbox to fit a continuous-time fractionally integrated ARMA process (CARFIMA) on univariate and irregularly spaced time series data via both frequentist and Bayesian machinery. A general-order CARFIMA(p, H, q) model for p>q is specified in Tsai and Chan (2005) <doi:10.1111/j.1467-9868.2005.00522.x> and it involves p+q+2 unknown model parameters, i.e., p AR parameters, q MA parameters, Hurst parameter H, and process uncertainty (standard deviation) sigma. Also, the model can account for heteroscedastic measurement errors, if the information about measurement error standard deviations is known. The package produces their maximum likelihood estimates and asymptotic uncertainties using a global optimizer called the differential evolution algorithm. It also produces posterior samples of the model parameters via Metropolis-Hastings within a Gibbs sampler equipped with adaptive Markov chain Monte Carlo. These fitting procedures, however, may produce numerical errors if p>2. The toolbox also contains a function to simulate discrete time series data from CARFIMA(p, H, q) process given the model parameters and observation times.
The remit of the European Clinical Trials Data Base (EudraCT <https://eudract.ema.europa.eu/> ), or ClinicalTrials.gov <https://clinicaltrials.gov/>, is to provide open access to summaries of all registered clinical trial results; thus aiming to prevent non-reporting of negative results and provide open-access to results to inform future research. The amount of information required and the format of the results, however, imposes a large extra workload at the end of studies on clinical trial units. In particular, the adverse-event-reporting component requires entering: each unique combination of treatment group and safety event; for every such event above, a further 4 pieces of information (body system, number of occurrences, number of subjects, number exposed) for non-serious events, plus an extra three pieces of data for serious adverse events (numbers of causally related events, deaths, causally related deaths). This package prepares the required statistics needed by EudraCT and formats them into the precise requirements to directly upload an XML file into the web portal, with no further data entry by hand.
Fast implementations to compute the genetic covariance matrix, the Jaccard similarity matrix, the s-matrix (the weighted Jaccard similarity matrix), and the (classic or robust) genomic relationship matrix of a (dense or sparse) input matrix (see Hahn, Lutz, Hecker, Prokopenko, Cho, Silverman, Weiss, and Lange (2020) <doi:10.1002/gepi.22356>). Full support for sparse matrices from the R-package Matrix'. Additionally, an implementation of the power method (von Mises iteration) to compute the largest eigenvector of a matrix is included, a function to perform an automated full run of global and local correlations in population stratification data, a function to compute sliding windows, and a function to invert minor alleles and to select those variants/loci exceeding a minimal cutoff value. New functionality in locStra allows one to extract the k leading eigenvectors of the genetic covariance matrix, Jaccard similarity matrix, s-matrix, and genomic relationship matrix via fast PCA without actually computing the similarity matrices. The fast PCA to compute the k leading eigenvectors can now also be run directly from bed'+'bim'+'fam files.
This package provides tools for statistical analysis using the binscatter methods developed by Cattaneo, Crump, Farrell and Feng (2024a) <doi:10.48550/arXiv.1902.09608>, Cattaneo, Crump, Farrell and Feng (2024b) <https://nppackages.github.io/references/Cattaneo-Crump-Farrell-Feng_2024_NonlinearBinscatter.pdf> and Cattaneo, Crump, Farrell and Feng (2024c) <doi:10.48550/arXiv.1902.09615>. Binscatter provides a flexible way of describing the relationship between two variables based on partitioning/binning of the independent variable of interest. binsreg(), binsqreg() and binsglm() implement binscatter least squares regression, quantile regression and generalized linear regression respectively, with particular focus on constructing binned scatter plots. They also implement robust (pointwise and uniform) inference of regression functions and derivatives thereof. binstest() implements hypothesis testing procedures for parametric functional forms of and nonparametric shape restrictions on the regression function. binspwc() implements hypothesis testing procedures for pairwise group comparison of binscatter estimators. binsregselect() implements data-driven procedures for selecting the number of bins for binscatter estimation. All the commands allow for covariate adjustment, smoothness restrictions and clustering.
Opens a shiny app which supports theoretical and computational analysis of block designs for symmetrical and mixed level factorial experiments. This package includes tools to check whether a design has orthogonal factorial structure (OFS) with balance or not and is able to find the orthogonality deviation value if not having OFS. This package includes function to evaluate efficiency factor of all factorial effects in two situations, in the first situation if the design is verified with OFS and balance then calculate the efficiencies of all factorial effects using a specific analytical procedure and in the second situation if the design is verified with non-OFS and balance then a new general method has been developed and used to calculate efficiencies under the condition that the design should be proper and equi-replicated, See Gupta, S.C. and Mukerjee, R. (1987): "A Calculus for factorial arrangements". Lecture Notes in Statistics. No. 59, Springer-Verlag, Berlin, New York, <doi:10.1007/978-1-4419-8730-3>. For the easy use of package, shiny app is used for giving inputs and inputs validation.
Gene sets are fundamental for gene enrichment analysis. The package geneset enables querying gene sets from public databases including GO (Gene Ontology Consortium. (2004) <doi:10.1093/nar/gkh036>), KEGG (Minoru et al. (2000) <doi:10.1093/nar/28.1.27>), WikiPathway (Marvin et al. (2020) <doi:10.1093/nar/gkaa1024>), MsigDb (Arthur et al. (2015) <doi:10.1016/j.cels.2015.12.004>), Reactome (David et al. (2011) <doi:10.1093/nar/gkq1018>), MeSH (Ish et al. (2014) <doi:10.4103/0019-5413.139827>), DisGeNET (Janet et al. (2017) <doi:10.1093/nar/gkw943>), Disease Ontology (Lynn et al. (2011) <doi:10.1093/nar/gkr972>), Network of Cancer Genes (Dimitra et al. (2019) <doi:10.1186/s13059-018-1612-0>) and COVID-19 (Maxim et al. (2020) <doi:10.21203/rs.3.rs-28582/v1>). Gene sets are stored in the list object which provides data frame of geneset and geneset_name'. The geneset has two columns of term ID and gene ID. The geneset_name has two columns of terms ID and term description.
It provides cumulative distribution function (CDF), quantile, p-value, statistical power calculator and random number generator for a collection of group-testing procedures, including the Higher Criticism tests, the one-sided Kolmogorov-Smirnov tests, the one-sided Berk-Jones tests, the one-sided phi-divergence tests, etc. The input are a group of p-values. The null hypothesis is that they are i.i.d. Uniform(0,1). In the context of signal detection, the null hypothesis means no signals. In the context of the goodness-of-fit testing, which contrasts a group of i.i.d. random variables to a given continuous distribution, the input p-values can be obtained by the CDF transformation. The null hypothesis means that these random variables follow the given distribution. For reference, see [1]Hong Zhang, Jiashun Jin and Zheyang Wu. "Distributions and power of optimal signal-detection statistics in finite case", IEEE Transactions on Signal Processing (2020) 68, 1021-1033; [2] Hong Zhang and Zheyang Wu. "The general goodness-of-fit tests for correlated data", Computational Statistics & Data Analysis (2022) 167, 107379.
Power analysis and sample size calculation for Welch and Hsu (Hedderich and Sachs (2018), ISBN:978-3-662-56657-2) t-tests including Monte-Carlo simulations of empirical power and type-I-error. Power and sample size calculation for Wilcoxon rank sum and signed rank tests via Monte-Carlo simulations. Power and sample size required for the evaluation of a diagnostic test(-system) (Flahault et al. (2005), <doi:10.1016/j.jclinepi.2004.12.009>; Dobbin and Simon (2007), <doi:10.1093/biostatistics/kxj036>) as well as for a single proportion (Fleiss et al. (2003), ISBN:978-0-471-52629-2; Piegorsch (2004), <doi:10.1016/j.csda.2003.10.002>; Thulin (2014), <doi:10.1214/14-ejs909>), comparing two negative binomial rates (Zhu and Lakkis (2014), <doi:10.1002/sim.5947>), ANCOVA (Shieh (2020), <doi:10.1007/s11336-019-09692-3>), reference ranges (Jennen-Steinmetz and Wellek (2005), <doi:10.1002/sim.2177>), multiple primary endpoints (Sozu et al. (2015), ISBN:978-3-319-22005-5), and AUC (Hanley and McNeil (1982), <doi:10.1148/radiology.143.1.7063747>).
Computes interdaily stability (IS), intradaily variability (IV) & the relative amplitude (RA) from actigraphy data as described in Blume et al. (2016) <doi: 10.1016/j.mex.2016.05.006> and van Someren et al. (1999) <doi: 10.3109/07420529908998724>. Additionally, it also computes L5 (i.e. the 5 hours with lowest average actigraphy amplitude) and M10 (the 10 hours with highest average amplitude) as well as the respective start times. The flex versions will also compute the L-value for a user-defined number of minutes. IS describes the strength of coupling of a rhythm to supposedly stable zeitgebers. It varies between 0 (Gaussian Noise) and 1 for perfect IS. IV describes the fragmentation of a rhythm, i.e. the frequency and extent of transitions between rest and activity. It is near 0 for a perfect sine wave, about 2 for Gaussian noise and may be even higher when a definite ultradian period of about 2 hrs is present. RA is the relative amplitude of a rhythm. Note that to obtain reliable results, actigraphy data should cover a reasonable number of days.
Several tests for high dimensional generalized linear models have been proposed recently. In this package, we implemented a new test called adaptive sum of powered score (aSPU) for high dimensional generalized linear models, which is often more powerful than the existing methods in a wide scenarios. We also implemented permutation based version of several existing methods for research purpose. We recommend users use the aSPU test for their real testing problem. You can learn more about the tests implemented in the package via the following papers: 1. Pan, W., Kim, J., Zhang, Y., Shen, X. and Wei, P. (2014) <DOI:10.1534/genetics.114.165035> A powerful and adaptive association test for rare variants, Genetics, 197(4). 2. Guo, B., and Chen, S. X. (2016) <DOI:10.1111/rssb.12152>. Tests for high dimensional generalized linear models. Journal of the Royal Statistical Society: Series B. 3. Goeman, J. J., Van Houwelingen, H. C., and Finos, L. (2011) <DOI:10.1093/biomet/asr016>. Testing against a high-dimensional alternative in the generalized linear model: asymptotic type I error control. Biometrika, 98(2).
Assists actuaries and other insurance modellers in pricing, reserving and capital modelling for non-life insurance and reinsurance modelling. Provides functions that help model excess levels, capping and pure Incurred but not reported claims (pure IBNR). Includes capped mean, exposure curves and increased limit factor curves (ILFs) for LogNormal, Gamma, Pareto, Sliced LogNormal-Pareto and Sliced Gamma-Pareto distributions. Includes mean, probability density function (pdf), cumulative probability function (cdf) and inverse cumulative probability function for Sliced LogNormal-Pareto and Sliced Gamma-Pareto distributions. Includes calculating pure IBNR exposure with LogNormal and Gamma distribution for reporting delay. Includes three shiny tools, one to simulate insurance claims applying reinsurance structures, fit generalised linear models and fit claims frequency or severity distributions. Methods used in the package refer to Free for All by Yiannis Parizas (2023) <https://www.theactuary.com/2023/03/02/free-all>; Escaping the triangle by Yiannis Parizas (2019) <https://www.theactuary.com/features/2019/06/2019/06/05/escaping-triangle>; Take to excess by Yiannis Parizas (2019) <https://www.theactuary.com/features/2019/03/2019/03/06/taken-excess>.
Doubly robust estimation and inference of log hazard ratio under the Cox marginal structural model with informative censoring. An augmented inverse probability weighted estimator that involves 3 working models, one for conditional failure time T, one for conditional censoring time C and one for propensity score. Both models for T and C can depend on both a binary treatment A and additional baseline covariates Z, while the propensity score model only depends on Z. With the help of cross-fitting techniques, achieves the rate-doubly robust property that allows the use of most machine learning or non-parametric methods for all 3 working models, which are not permitted in classic inverse probability weighting or doubly robust estimators. When the proportional hazard assumption is violated, CoxAIPW estimates a causal estimated that is a weighted average of the time-varying log hazard ratio. Reference: Luo, J. (2023). Statistical Robustness - Distributed Linear Regression, Informative Censoring, Causal Inference, and Non-Proportional Hazards [Unpublished doctoral dissertation]. University of California San Diego.; Luo & Xu (2022) <doi:10.48550/arXiv.2206.02296>; Rava (2021) <https://escholarship.org/uc/item/8h1846gs>.
Set of functions to create clear graphics and run common statistical analyses for agricultural experiments (ANOVA with post-hoc tests such as Tukey HSD and Duncan MRR, coefficient of variation, and simple power calculations), streamlining exploratory analysis and reporting. Functions build on ggplot2 and base stats and follow methods widely used in agronomy (field trials, plant breeding). Key references include Tukey (1949) <doi:10.2307/3001913>, Duncan (1955) <doi:10.2307/3001478>, Cohen (1988, ISBN:9781138892899); see also agricolae <https://CRAN.R-project.org/package=agricolae> and Wickham (2016, ISBN:9783319242750) for ggplot2'. Versión en español: Conjunto de funciones para generar gráficos claros y ejecutar análisis habituales en ensayos agrà colas (ANOVA con pruebas post-hoc como Tukey HSD y Duncan MRR, coeficiente de variación y cálculos simples de potencia), facilitando el análisis exploratorio y la elaboración de reportes. Los métodos implementados se basan en Tukey (1949) <doi:10.2307/3001913>, Duncan (1955) <doi:10.2307/3001478> y Cohen (1988, ISBN:9781138892899); ver también agricolae <https://CRAN.R-project.org/package=agricolae> y Wickham (2016, ISBN:9783319242750) para ggplot2'.
Routines for estimating tree fiber (tracheid) length distributions in the standing tree based on increment core samples. Two types of data can be used with the package, increment core data measured by means of an optical fiber analyzer (OFA), e.g. such as the Kajaani Fiber Lab, or measured by microscopy. Increment core data analyzed by OFAs consist of the cell lengths of both cut and uncut fibres (tracheids) and fines (such as ray parenchyma cells) without being able to identify which cells are cut or if they are fines or fibres. The microscopy measured data consist of the observed lengths of the uncut fibres in the increment core. A censored version of a mixture of the fine and fiber length distributions is proposed to fit the OFA data, under distributional assumptions (Svensson et al., 2006) <doi:10.1111/j.1467-9469.2006.00501.x>. The package offers two choices for the assumptions of the underlying density functions of the true fiber (fine) lenghts of those fibers (fines) that at least partially appear in the increment core, being the generalized gamma and the log normal densities.
This package provides a hybrid modeling framework combining Support Vector Regression (SVR) with metaheuristic optimization algorithms, including the Archimedes Optimization Algorithm (AO) (Hashim et al. (2021) <doi:10.1007/s10489-020-01893-z>), Coot Bird Optimization (CBO) (Naruei & Keynia (2021) <doi:10.1016/j.eswa.2021.115352>), and their hybrid (AOCBO), as well as several others such as Harris Hawks Optimization (HHO) (Heidari et al. (2019) <doi:10.1016/j.future.2019.02.028>), Gray Wolf Optimizer (GWO) (Mirjalili et al. (2014) <doi:10.1016/j.advengsoft.2013.12.007>), Ant Lion Optimization (ALO) (Mirjalili (2015) <doi:10.1016/j.advengsoft.2015.01.010>), and Enhanced Harris Hawk Optimization with Coot Bird Optimization (EHHOCBO) (Cui et al. (2023) <doi:10.32604/cmes.2023.026019>). The package enables automatic tuning of SVR hyperparameters (cost, gamma, and epsilon) to enhance prediction performance. Suitable for regression tasks in domains such as renewable energy forecasting and hourly data prediction. For more details about implementation and parameter bounds see: Setiawan et al. (2021) <doi:10.1016/j.procs.2020.12.003> and Liu et al. (2018) <doi:10.1155/2018/6076475>.
Fast and multi-threaded implementation of isolation forest (Liu, Ting, Zhou (2008) <doi:10.1109/ICDM.2008.17>), extended isolation forest (Hariri, Kind, Brunner (2018) <doi:10.48550/arXiv.1811.02141>), SCiForest (Liu, Ting, Zhou (2010) <doi:10.1007/978-3-642-15883-4_18>), fair-cut forest (Cortes (2021) <doi:10.48550/arXiv.2110.13402>), robust random-cut forest (Guha, Mishra, Roy, Schrijvers (2016) <http://proceedings.mlr.press/v48/guha16.html>), and customizable variations of them, for isolation-based outlier detection, clustered outlier detection, distance or similarity approximation (Cortes (2019) <doi:10.48550/arXiv.1910.12362>), isolation kernel calculation (Ting, Zhu, Zhou (2018) <doi:10.1145/3219819.3219990>), and imputation of missing values (Cortes (2019) <doi:10.48550/arXiv.1911.06646>), based on random or guided decision tree splitting, and providing different metrics for scoring anomalies based on isolation depth or density (Cortes (2021) <doi:10.48550/arXiv.2111.11639>). Provides simple heuristics for fitting the model to categorical columns and handling missing data, and offers options for varying between random and guided splits, and for using different splitting criteria.