Developed to help researchers who need to model the kinetics of carbon dioxide (CO2) production in alcoholic fermentation of wines, beers and other fermented products. The following models are available for modeling the carbon dioxide production curve as a function of time: 5PL, Gompertz and 4PL. This package has different functions, which applied can: perform the modeling of the data obtained in the fermentation and return the coefficients, analyze the model fit and return different statistical metrics, and calculate the kinetic parameters: Maximum production of carbon dioxide; Maximum rate of production of carbon dioxide; Moment in which maximum fermentation rate occurs; Duration of the latency phase for carbon dioxide production; Carbon dioxide produced until maximum fermentation rate occurs. In addition, a function that generates graphs with the observed and predicted data from the models, isolated and combined, is available. Gava, A., Borsato, D., & Ficagna, E. (2020)."Effect of mixture of fining agents on the fermentation kinetics of base wine for sparkling wine production: Use of methodology for modeling". <doi:10.1016/j.lwt.2020.109660>.
PaIRKAT
is model framework for assessing statistical relationships between networks of metabolites (pathways) and an outcome of interest (phenotype). PaIRKAT
queries the KEGG database to determine interactions between metabolites from which network connectivity is constructed. This model framework improves testing power on high dimensional data by including graph topography in the kernel machine regression setting. Studies on high dimensional data can struggle to include the complex relationships between variables. The semi-parametric kernel machine regression model is a powerful tool for capturing these types of relationships. They provide a framework for testing for relationships between outcomes of interest and high dimensional data such as metabolomic, genomic, or proteomic pathways. PaIRKAT
uses known biological connections between high dimensional variables by representing them as edges of ‘graphs’ or ‘networks.’ It is common for nodes (e.g. metabolites) to be disconnected from all others within the graph, which leads to meaningful decreases in testing power whether or not the graph information is included. We include a graph regularization or ‘smoothing’ approach for managing this issue.
Linnorm is an R package for the analysis of RNA-seq, scRNA-seq, ChIP-seq count data or any large scale count data. It transforms such datasets for parametric tests. In addition to the transformtion function (Linnorm
), the following pipelines are implemented:
Library size/batch effect normalization (
Linnorm.Norm
)Cell subpopluation analysis and visualization using t-SNE or PCA K-means clustering or hierarchical clustering (
Linnorm.tSNE
,Linnorm.PCA
,Linnorm.HClust
)Differential expression analysis or differential peak detection using limma (
Linnorm.limma
)Highly variable gene discovery and visualization (
Linnorm.HVar
)Gene correlation network analysis and visualization (
Linnorm.Cor
)Stable gene selection for scRNA-seq data; for users without or who do not want to rely on spike-in genes (
Linnorm.SGenes
)Data imputation (
Linnorm.DataImput
).
Linnorm can work with raw count, CPM, RPKM, FPKM and TPM. Additionally, the RnaXSim
function is included for simulating RNA-seq data for the evaluation of DEG analysis methods.
This package provides a fast and flexible general-purpose implementation of Particle Swarm Optimization (PSO) and Differential Evolution (DE) for solving global minimization problems is provided. It is designed to handle complex optimization tasks with nonlinear, non-differentiable, and multi-modal objective functions defined by users. There are five types of PSO variants: Particle Swarm Optimization (PSO, Eberhart & Kennedy, 1995) <doi:10.1109/MHS.1995.494215>, Quantum-behaved particle Swarm Optimization (QPSO, Sun et al., 2004) <doi:10.1109/CEC.2004.1330875>, Locally convergent rotationally invariant particle swarm optimization (LcRiPSO
, Bonyadi & Michalewicz, 2014) <doi:10.1007/s11721-014-0095-1>, Competitive Swarm Optimizer (CSO, Cheng & Jin, 2015) <doi:10.1109/TCYB.2014.2322602> and Double exponential particle swarm optimization (DExPSO
, Stehlik et al., 2024) <doi:10.1016/j.asoc.2024.111913>. For the DE algorithm, six types in Storn, R. & Price, K. (1997) <doi:10.1023/A:1008202821328> are included: DE/rand/1, DE/rand/2, DE/best/1, DE/best/2, DE/rand_to-best/1 and DE/rand_to-best/2.
The number of biological databases is growing rapidly, but different databases use different IDs to refer to the same biological entity. The inconsistency in IDs impedes the integration of various types of biological data. To resolve the problem, we developed MantaID
', a data-driven, machine-learning based approach that automates identifying IDs on a large scale. The MantaID
model's prediction accuracy was proven to be 99%, and it correctly and effectively predicted 100,000 ID entries within two minutes. MantaID
supports the discovery and exploitation of ID patterns from large quantities of databases. (e.g., up to 542 biological databases). An easy-to-use freely available open-source software R package, a user-friendly web application, and API were also developed for MantaID
to improve applicability. To our knowledge, MantaID
is the first tool that enables an automatic, quick, accurate, and comprehensive identification of large quantities of IDs, and can therefore be used as a starting point to facilitate the complex assimilation and aggregation of biological data across diverse databases.
We provide a toolbox to fit a continuous-time fractionally integrated ARMA process (CARFIMA) on univariate and irregularly spaced time series data via both frequentist and Bayesian machinery. A general-order CARFIMA(p, H, q) model for p>q is specified in Tsai and Chan (2005) <doi:10.1111/j.1467-9868.2005.00522.x> and it involves p+q+2 unknown model parameters, i.e., p AR parameters, q MA parameters, Hurst parameter H, and process uncertainty (standard deviation) sigma. Also, the model can account for heteroscedastic measurement errors, if the information about measurement error standard deviations is known. The package produces their maximum likelihood estimates and asymptotic uncertainties using a global optimizer called the differential evolution algorithm. It also produces posterior samples of the model parameters via Metropolis-Hastings within a Gibbs sampler equipped with adaptive Markov chain Monte Carlo. These fitting procedures, however, may produce numerical errors if p>2. The toolbox also contains a function to simulate discrete time series data from CARFIMA(p, H, q) process given the model parameters and observation times.
The remit of the European Clinical Trials Data Base (EudraCT
<https://eudract.ema.europa.eu/> ), or ClinicalTrials.gov
<https://clinicaltrials.gov/>, is to provide open access to summaries of all registered clinical trial results; thus aiming to prevent non-reporting of negative results and provide open-access to results to inform future research. The amount of information required and the format of the results, however, imposes a large extra workload at the end of studies on clinical trial units. In particular, the adverse-event-reporting component requires entering: each unique combination of treatment group and safety event; for every such event above, a further 4 pieces of information (body system, number of occurrences, number of subjects, number exposed) for non-serious events, plus an extra three pieces of data for serious adverse events (numbers of causally related events, deaths, causally related deaths). This package prepares the required statistics needed by EudraCT
and formats them into the precise requirements to directly upload an XML file into the web portal, with no further data entry by hand.
Fast implementations to compute the genetic covariance matrix, the Jaccard similarity matrix, the s-matrix (the weighted Jaccard similarity matrix), and the (classic or robust) genomic relationship matrix of a (dense or sparse) input matrix (see Hahn, Lutz, Hecker, Prokopenko, Cho, Silverman, Weiss, and Lange (2020) <doi:10.1002/gepi.22356>). Full support for sparse matrices from the R-package Matrix'. Additionally, an implementation of the power method (von Mises iteration) to compute the largest eigenvector of a matrix is included, a function to perform an automated full run of global and local correlations in population stratification data, a function to compute sliding windows, and a function to invert minor alleles and to select those variants/loci exceeding a minimal cutoff value. New functionality in locStra
allows one to extract the k leading eigenvectors of the genetic covariance matrix, Jaccard similarity matrix, s-matrix, and genomic relationship matrix via fast PCA without actually computing the similarity matrices. The fast PCA to compute the k leading eigenvectors can now also be run directly from bed'+'bim'+'fam files.
This package provides tools for statistical analysis using the binscatter methods developed by Cattaneo, Crump, Farrell and Feng (2024a) <doi:10.48550/arXiv.1902.09608>
, Cattaneo, Crump, Farrell and Feng (2024b) <https://nppackages.github.io/references/Cattaneo-Crump-Farrell-Feng_2024_NonlinearBinscatter.pdf>
and Cattaneo, Crump, Farrell and Feng (2024c) <doi:10.48550/arXiv.1902.09615>
. Binscatter provides a flexible way of describing the relationship between two variables based on partitioning/binning of the independent variable of interest. binsreg()
, binsqreg()
and binsglm()
implement binscatter least squares regression, quantile regression and generalized linear regression respectively, with particular focus on constructing binned scatter plots. They also implement robust (pointwise and uniform) inference of regression functions and derivatives thereof. binstest()
implements hypothesis testing procedures for parametric functional forms of and nonparametric shape restrictions on the regression function. binspwc()
implements hypothesis testing procedures for pairwise group comparison of binscatter estimators. binsregselect()
implements data-driven procedures for selecting the number of bins for binscatter estimation. All the commands allow for covariate adjustment, smoothness restrictions and clustering.
Gene sets are fundamental for gene enrichment analysis. The package geneset enables querying gene sets from public databases including GO (Gene Ontology Consortium. (2004) <doi:10.1093/nar/gkh036>), KEGG (Minoru et al. (2000) <doi:10.1093/nar/28.1.27>), WikiPathway
(Marvin et al. (2020) <doi:10.1093/nar/gkaa1024>), MsigDb
(Arthur et al. (2015) <doi:10.1016/j.cels.2015.12.004>), Reactome (David et al. (2011) <doi:10.1093/nar/gkq1018>), MeSH
(Ish et al. (2014) <doi:10.4103/0019-5413.139827>), DisGeNET
(Janet et al. (2017) <doi:10.1093/nar/gkw943>), Disease Ontology (Lynn et al. (2011) <doi:10.1093/nar/gkr972>), Network of Cancer Genes (Dimitra et al. (2019) <doi:10.1186/s13059-018-1612-0>) and COVID-19 (Maxim et al. (2020) <doi:10.21203/rs.3.rs-28582/v1>). Gene sets are stored in the list object which provides data frame of geneset and geneset_name'. The geneset has two columns of term ID and gene ID. The geneset_name has two columns of terms ID and term description.
It provides cumulative distribution function (CDF), quantile, p-value, statistical power calculator and random number generator for a collection of group-testing procedures, including the Higher Criticism tests, the one-sided Kolmogorov-Smirnov tests, the one-sided Berk-Jones tests, the one-sided phi-divergence tests, etc. The input are a group of p-values. The null hypothesis is that they are i.i.d. Uniform(0,1). In the context of signal detection, the null hypothesis means no signals. In the context of the goodness-of-fit testing, which contrasts a group of i.i.d. random variables to a given continuous distribution, the input p-values can be obtained by the CDF transformation. The null hypothesis means that these random variables follow the given distribution. For reference, see [1]Hong Zhang, Jiashun Jin and Zheyang Wu. "Distributions and power of optimal signal-detection statistics in finite case", IEEE Transactions on Signal Processing (2020) 68, 1021-1033; [2] Hong Zhang and Zheyang Wu. "The general goodness-of-fit tests for correlated data", Computational Statistics & Data Analysis (2022) 167, 107379.
Power analysis and sample size calculation for Welch and Hsu (Hedderich and Sachs (2018), ISBN:978-3-662-56657-2) t-tests including Monte-Carlo simulations of empirical power and type-I-error. Power and sample size calculation for Wilcoxon rank sum and signed rank tests via Monte-Carlo simulations. Power and sample size required for the evaluation of a diagnostic test(-system) (Flahault et al. (2005), <doi:10.1016/j.jclinepi.2004.12.009>; Dobbin and Simon (2007), <doi:10.1093/biostatistics/kxj036>) as well as for a single proportion (Fleiss et al. (2003), ISBN:978-0-471-52629-2; Piegorsch (2004), <doi:10.1016/j.csda.2003.10.002>; Thulin (2014), <doi:10.1214/14-ejs909>), comparing two negative binomial rates (Zhu and Lakkis (2014), <doi:10.1002/sim.5947>), ANCOVA (Shieh (2020), <doi:10.1007/s11336-019-09692-3>), reference ranges (Jennen-Steinmetz and Wellek (2005), <doi:10.1002/sim.2177>), multiple primary endpoints (Sozu et al. (2015), ISBN:978-3-319-22005-5), and AUC (Hanley and McNeil
(1982), <doi:10.1148/radiology.143.1.7063747>).
Computes interdaily stability (IS), intradaily variability (IV) & the relative amplitude (RA) from actigraphy data as described in Blume et al. (2016) <doi: 10.1016/j.mex.2016.05.006> and van Someren et al. (1999) <doi: 10.3109/07420529908998724>. Additionally, it also computes L5 (i.e. the 5 hours with lowest average actigraphy amplitude) and M10 (the 10 hours with highest average amplitude) as well as the respective start times. The flex versions will also compute the L-value for a user-defined number of minutes. IS describes the strength of coupling of a rhythm to supposedly stable zeitgebers. It varies between 0 (Gaussian Noise) and 1 for perfect IS. IV describes the fragmentation of a rhythm, i.e. the frequency and extent of transitions between rest and activity. It is near 0 for a perfect sine wave, about 2 for Gaussian noise and may be even higher when a definite ultradian period of about 2 hrs is present. RA is the relative amplitude of a rhythm. Note that to obtain reliable results, actigraphy data should cover a reasonable number of days.
Several tests for high dimensional generalized linear models have been proposed recently. In this package, we implemented a new test called adaptive sum of powered score (aSPU
) for high dimensional generalized linear models, which is often more powerful than the existing methods in a wide scenarios. We also implemented permutation based version of several existing methods for research purpose. We recommend users use the aSPU
test for their real testing problem. You can learn more about the tests implemented in the package via the following papers: 1. Pan, W., Kim, J., Zhang, Y., Shen, X. and Wei, P. (2014) <DOI:10.1534/genetics.114.165035> A powerful and adaptive association test for rare variants, Genetics, 197(4). 2. Guo, B., and Chen, S. X. (2016) <DOI:10.1111/rssb.12152>. Tests for high dimensional generalized linear models. Journal of the Royal Statistical Society: Series B. 3. Goeman, J. J., Van Houwelingen, H. C., and Finos, L. (2011) <DOI:10.1093/biomet/asr016>. Testing against a high-dimensional alternative in the generalized linear model: asymptotic type I error control. Biometrika, 98(2).
This package provides a collection of analysis-ready datasets for the U.S. Geological Survey - Idaho National Laboratory (USGS-INL) groundwater and surface-water monitoring networks, administered by the USGS-INL Project Office in cooperation with the U.S. Department of Energy. The data collected from wells and surface-water stations at the Idaho National Laboratory and surrounding areas have been used to describe the effects of waste disposal on water contained in the eastern Snake River Plain aquifer, located in the southeastern part of Idaho, and the availability of water for long-term consumptive and industrial use. The package includes long-term monitoring records dating back to measurements from 1922. Geospatial data describing the areas from which samples were collected or observations were made are also included in the package. Bundling this data into a single package significantly reduces the magnitude of data processing for researchers and provides a way to distribute the data along with its documentation in a standard format. Geospatial datasets are made available in a common projection and datum, and geohydrologic data have been structured to facilitate analysis.
Assists actuaries and other insurance modellers in pricing, reserving and capital modelling for non-life insurance and reinsurance modelling. Provides functions that help model excess levels, capping and pure Incurred but not reported claims (pure IBNR). Includes capped mean, exposure curves and increased limit factor curves (ILFs) for LogNormal
, Gamma, Pareto, Sliced LogNormal-Pareto
and Sliced Gamma-Pareto distributions. Includes mean, probability density function (pdf), cumulative probability function (cdf) and inverse cumulative probability function for Sliced LogNormal-Pareto
and Sliced Gamma-Pareto distributions. Includes calculating pure IBNR exposure with LogNormal
and Gamma distribution for reporting delay. Includes three shiny tools, one to simulate insurance claims applying reinsurance structures, fit generalised linear models and fit claims frequency or severity distributions. Methods used in the package refer to Free for All by Yiannis Parizas (2023) <https://www.theactuary.com/2023/03/02/free-all>; Escaping the triangle by Yiannis Parizas (2019) <https://www.theactuary.com/features/2019/06/2019/06/05/escaping-triangle>; Take to excess by Yiannis Parizas (2019) <https://www.theactuary.com/features/2019/03/2019/03/06/taken-excess>.
Doubly robust estimation and inference of log hazard ratio under the Cox marginal structural model with informative censoring. An augmented inverse probability weighted estimator that involves 3 working models, one for conditional failure time T, one for conditional censoring time C and one for propensity score. Both models for T and C can depend on both a binary treatment A and additional baseline covariates Z, while the propensity score model only depends on Z. With the help of cross-fitting techniques, achieves the rate-doubly robust property that allows the use of most machine learning or non-parametric methods for all 3 working models, which are not permitted in classic inverse probability weighting or doubly robust estimators. When the proportional hazard assumption is violated, CoxAIPW
estimates a causal estimated that is a weighted average of the time-varying log hazard ratio. Reference: Luo, J. (2023). Statistical Robustness - Distributed Linear Regression, Informative Censoring, Causal Inference, and Non-Proportional Hazards [Unpublished doctoral dissertation]. University of California San Diego.; Luo & Xu (2022) <doi:10.48550/arXiv.2206.02296>
; Rava (2021) <https://escholarship.org/uc/item/8h1846gs>.
Routines for estimating tree fiber (tracheid) length distributions in the standing tree based on increment core samples. Two types of data can be used with the package, increment core data measured by means of an optical fiber analyzer (OFA), e.g. such as the Kajaani Fiber Lab, or measured by microscopy. Increment core data analyzed by OFAs consist of the cell lengths of both cut and uncut fibres (tracheids) and fines (such as ray parenchyma cells) without being able to identify which cells are cut or if they are fines or fibres. The microscopy measured data consist of the observed lengths of the uncut fibres in the increment core. A censored version of a mixture of the fine and fiber length distributions is proposed to fit the OFA data, under distributional assumptions (Svensson et al., 2006) <doi:10.1111/j.1467-9469.2006.00501.x>. The package offers two choices for the assumptions of the underlying density functions of the true fiber (fine) lenghts of those fibers (fines) that at least partially appear in the increment core, being the generalized gamma and the log normal densities.
Label-free bottom-up proteomics expression data is often affected by data heterogeneity and missing values. Normalization and missing value imputation are commonly used techniques to address these issues and make the dataset suitable for further downstream analysis. This package provides an optimal combination of normalization and imputation methods for the dataset. The package utilizes three normalization methods and three imputation methods.The statistical evaluation measures named pooled co-efficient of variance, pooled estimate of variance and pooled median absolute deviation are used for selecting the best combination of normalization and imputation method for the given dataset. The user can also visualize the results by using various plots available in this package. The user can also perform the differential expression analysis between two sample groups with the function included in this package. The chosen three normalization methods, three imputation methods and three evaluation measures were chosen for this study based on the research papers published by Välikangas et al. (2016) <doi:10.1093/bib/bbw095>, Jin et al. (2021) <doi:10.1038/s41598-021-81279-4> and Srivastava et al. (2023) <doi:10.2174/1574893618666230223150253>.
Fast and multi-threaded implementation of isolation forest (Liu, Ting, Zhou (2008) <doi:10.1109/ICDM.2008.17>), extended isolation forest (Hariri, Kind, Brunner (2018) <doi:10.48550/arXiv.1811.02141>
), SCiForest
(Liu, Ting, Zhou (2010) <doi:10.1007/978-3-642-15883-4_18>), fair-cut forest (Cortes (2021) <doi:10.48550/arXiv.2110.13402>
), robust random-cut forest (Guha, Mishra, Roy, Schrijvers (2016) <http://proceedings.mlr.press/v48/guha16.html>), and customizable variations of them, for isolation-based outlier detection, clustered outlier detection, distance or similarity approximation (Cortes (2019) <doi:10.48550/arXiv.1910.12362>
), isolation kernel calculation (Ting, Zhu, Zhou (2018) <doi:10.1145/3219819.3219990>), and imputation of missing values (Cortes (2019) <doi:10.48550/arXiv.1911.06646>
), based on random or guided decision tree splitting, and providing different metrics for scoring anomalies based on isolation depth or density (Cortes (2021) <doi:10.48550/arXiv.2111.11639>
). Provides simple heuristics for fitting the model to categorical columns and handling missing data, and offers options for varying between random and guided splits, and for using different splitting criteria.
This package provides a class of methods that combine dimension reduction and clustering of continuous, categorical or mixed-type data (Markos, Iodice D'Enza and van de Velden 2019; <DOI:10.18637/jss.v091.i10>). For continuous data, the package contains implementations of factorial K-means (Vichi and Kiers 2001; <DOI:10.1016/S0167-9473(00)00064-5>) and reduced K-means (De Soete and Carroll 1994; <DOI:10.1007/978-3-642-51175-2_24>); both methods that combine principal component analysis with K-means clustering. For categorical data, the package provides MCA K-means (Hwang, Dillon and Takane 2006; <DOI:10.1007/s11336-004-1173-x>), i-FCB (Iodice D'Enza and Palumbo 2013, <DOI:10.1007/s00180-012-0329-x>) and Cluster Correspondence Analysis (van de Velden, Iodice D'Enza and Palumbo 2017; <DOI:10.1007/s11336-016-9514-0>), which combine multiple correspondence analysis with K-means. For mixed-type data, it provides mixed Reduced K-means and mixed Factorial K-means (van de Velden, Iodice D'Enza and Markos 2019; <DOI:10.1002/wics.1456>), which combine PCA for mixed-type data with K-means.
Early insights in probability theory were largely influenced by questions about gambling and games of chance, as noted by Blitzstein and Hwang (2019, ISBN:978-1138369917). In modern times, playing cards continue to serve as an effective teaching tool for probability, statistics, and even R programming, as demonstrated by Grolemund (2014, ISBN:978-1449359010). The mmcards package offers a collection of utility functions designed to aid in the creation, manipulation, and utilization of playing card decks in multiple formats. These include a standard 52-card deck, as well as alternative decks such as decks defined by custom anonymous functions and custom interleaved decks. Optimized for the development of educational shiny applications, the package is particularly useful for teaching statistics and probability through card-based games. Functions include shuffle_deck()
, which creates either a shuffled standard deck or a shuffled custom alternative deck; deal_card()
, which takes a deck and returns a list object containing both the dealt card and the updated deck; and i_deck()
, which adds image paths to card objects, further enriching the package's utility in the development of interactive shiny application card games.
Implementation of uniformity tests on the circle and (hyper)sphere. The main function of the package is unif_test()
, which conveniently collects more than 35 tests for assessing uniformity on S^p-1 = x in R^p : ||x|| = 1, p >= 2. The test statistics are implemented in the unif_stat()
function, which allows computing several statistics for different samples within a single call, thus facilitating Monte Carlo experiments. Furthermore, the unif_stat_MC()
function allows parallelizing them in a simple way. The asymptotic null distributions of the statistics are available through the function unif_stat_distr()
. The core of sphunif is coded in C++ by relying on the Rcpp package. The package also provides several novel datasets and gives the replicability for the data applications/simulations in Garcà a-Portugués et al. (2021) <doi:10.1007/978-3-030-69944-4_12>, Garcà a-Portugués et al. (2023) <doi:10.3150/21-BEJ1454>, Garcà a-Portugués et al. (2024) <doi:10.48550/arXiv.2108.09874>
, and Fernández-de-Marcos and Garcà a-Portugués (2024) <doi:10.48550/arXiv.2405.13531>
.
Suns-Voc (or Isc-Voc) curves can provide the current-voltage (I-V) characteristics of the diode of photovoltaic cells without the effect of series resistance. Here, Suns-Voc curves can be constructed with outdoor time-series I-V curves [1,2,3] of full-size photovoltaic (PV) modules instead of having to be measured in the lab. Time series of four different power loss modes can be calculated based on obtained Isc-Voc curves. This material is based upon work supported by the U.S. Department of Energy's Office of Energy Efficiency and Renewable Energy (EERE) under Solar Energy Technologies Office (SETO) Agreement Number DE-EE0008172. Jennifer L. Braid is supported by the U.S. Department of Energy (DOE) Office of Energy Efficiency and Renewable Energy administered by the Oak Ridge Institute for Science and Education (ORISE) for the DOE. ORISE is managed by Oak Ridge Associated Universities (ORAU) under DOE contract number DE-SC0014664. [1] Wang, M. et al, 2018. <doi:10.1109/PVSC.2018.8547772>. [2] Walters et al, 2018 <doi:10.1109/PVSC.2018.8548187>. [3] Guo, S. et al, 2016. <doi:10.1117/12.2236939>.