Versatile method for ungrouping histograms (binned count data) assuming that counts are Poisson distributed and that the underlying sequence on a fine grid to be estimated is smooth. The method is based on the composite link model and estimation is achieved by maximizing a penalized likelihood. Smooth detailed sequences of counts and rates are so estimated from the binned counts. Ungrouping binned data can be desirable for many reasons: Bins can be too coarse to allow for accurate analysis; comparisons can be hindered when different grouping approaches are used in different histograms; and the last interval is often wide and open-ended and, thus, covers a lot of information in the tail area. Age-at-death distributions grouped in age classes and abridged life tables are examples of binned data. Because of modest assumptions, the approach is suitable for many demographic and epidemiological applications. For a detailed description of the method and applications see Rizzi et al. (2015) <doi:10.1093/aje/kwv020>.
The Macroeconomics-at-Risk (MaR) approach is based on a two-step semi-parametric estimation procedure that allows to forecast the full conditional distribution of an economic variable at a given horizon, as a function of a set of factors. These density forecasts are then be used to produce coherent forecasts for any downside risk measure, e.g., value-at-risk, expected shortfall, downside entropy. Initially introduced by Adrian et al. (2019) <doi:10.1257/aer.20161923> to reveal the vulnerability of economic growth to financial conditions, the MaR approach is currently extensively used by international financial institutions to provide Value-at-Risk (VaR) type forecasts for GDP growth (Growth-at-Risk) or inflation (Inflation-at-Risk). This package provides methods for estimating these models. Datasets for the US and the Eurozone are available to allow testing of the Adrian et al (2019) model. This package constitutes a useful toolbox (data and functions) for private practitioners, scholars as well as policymakers.
Data from statistical agencies and other institutions often need to be protected before they can be published. This package can be used to perturb statistical tables in a consistent way. The main idea is to add - at the micro data level - a record key for each unit. Based on these keys, for any cell in a statistical table a cell key is computed as a function on the record keys contributing to a specific cell. Values that are added to the cell in order to perturb it are derived from a lookup-table that maps values of cell keys to specific perturbation values. The theoretical basis for the methods implemented can be found in Thompson, Broadfoot and Elazar (2013) <https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2013/Topic_1_ABS.pdf> which was extended and enhanced by Giessing and Tent (2019) <https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2019/mtg1/SDC2019_S2_Germany_Giessing_Tent_AD.pdf>.
Partitioning clustering divides the objects in a data set into non-overlapping subsets or clusters by using the prototype-based probabilistic and possibilistic clustering algorithms. This package covers a set of the functions for Fuzzy C-Means (Bezdek, 1974) <doi:10.1080/01969727308546047>, Possibilistic C-Means (Krishnapuram & Keller, 1993) <doi:10.1109/91.227387>, Possibilistic Fuzzy C-Means (Pal et al, 2005) <doi:10.1109/TFUZZ.2004.840099>, Possibilistic Clustering Algorithm (Yang et al, 2006) <doi:10.1016/j.patcog.2005.07.005>, Possibilistic C-Means with Repulsion (Wachs et al, 2006) <doi:10.1007/3-540-31662-0_6> and the other variants of hard and soft clustering algorithms. The cluster prototypes and membership matrices required by these partitioning algorithms are initialized with different initialization techniques that are available in the package inaparc'. As the distance metrics, not only the Euclidean distance but also a set of the commonly used distance metrics are available to use with some of the algorithms in the package.
KnowSeq proposes a novel methodology that comprises the most relevant steps in the Transcriptomic gene expression analysis. KnowSeq expects to serve as an integrative tool that allows to process and extract relevant biomarkers, as well as to assess them through a Machine Learning approaches. Finally, the last objective of KnowSeq is the biological knowledge extraction from the biomarkers (Gene Ontology enrichment, Pathway listing and Visualization and Evidences related to the addressed disease). Although the package allows analyzing all the data manually, the main strenght of KnowSeq is the possibilty of carrying out an automatic and intelligent HTML report that collect all the involved steps in one document. It is important to highligh that the pipeline is totally modular and flexible, hence it can be started from whichever of the different steps. KnowSeq expects to serve as a novel tool to help to the experts in the field to acquire robust knowledge and conclusions for the data and diseases to study.
Weakly supervised (WS), multiple instance (MI) data lives in numerous interesting applications such as drug discovery, object detection, and tumor prediction on whole slide images. The mildsvm package provides an easy way to learn from this data by training Support Vector Machine (SVM)-based classifiers. It also contains helpful functions for building and printing multiple instance data frames. The core methods from mildsvm come from the following references: Kent and Yu (2024) <doi:10.1214/24-AOAS1876>; Xiao, Liu, and Hao (2018) <doi:10.1109/TNNLS.2017.2766164>; Muandet et al. (2012) <https://proceedings.neurips.cc/paper/2012/file/9bf31c7ff062936a96d3c8bd1f8f2ff3-Paper.pdf>; Chu and Keerthi (2007) <doi:10.1162/neco.2007.19.3.792>; and Andrews et al. (2003) <https://papers.nips.cc/paper/2232-support-vector-machines-for-multiple-instance-learning.pdf>. Many functions use the Gurobi optimization back-end to improve the optimization problem speed; the gurobi R package and associated software can be downloaded from <https://www.gurobi.com> after obtaining a license.
Sequencing and microarray samples often are collected or processed in multiple batches or at different times. This often produces technical biases that can lead to incorrect results in the downstream analysis. BatchQC is a software tool that streamlines batch preprocessing and evaluation by providing interactive diagnostics, visualizations, and statistical analyses to explore the extent to which batch variation impacts the data. BatchQC diagnostics help determine whether batch adjustment needs to be done, and how correction should be applied before proceeding with a downstream analysis. Moreover, BatchQC interactively applies multiple common batch effect approaches to the data and the user can quickly see the benefits of each method. BatchQC is developed as a Shiny App. The output is organized into multiple tabs and each tab features an important part of the batch effect analysis and visualization of the data. The BatchQC interface has the following analysis groups: Summary, Differential Expression, Median Correlations, Heatmaps, Circular Dendrogram, PCA Analysis, Shape, ComBat and SVA.
This package provides a comprehensive collection of functions for conducting meta-analyses in R. The package includes functions to calculate various effect sizes or outcome measures, fit fixed-, random-, and mixed-effects models to such data, carry out moderator and meta-regression analyses, and create various types of meta-analytical plots (e.g., forest, funnel, radial, L'Abbe, Baujat, GOSH plots). For meta-analyses of binomial and person-time data, the package also provides functions that implement specialized methods, including the Mantel-Haenszel method, Peto's method, and a variety of suitable generalized linear (mixed-effects) models (i.e. mixed-effects logistic and Poisson regression models). Finally, the package provides functionality for fitting meta-analytic multivariate/multilevel models that account for non-independent sampling errors and/or true effects (e.g. due to the inclusion of multiple treatment studies, multiple endpoints, or other forms of clustering). Network meta-analyses and meta-analyses accounting for known correlation structures (e.g. due to phylogenetic relatedness) can also be conducted.
This package provides a nonvisual procedure for screening time series for nonstationarity in the context of intensive longitudinal designs, such as ecological momentary assessments. The method combines two diagnostics: one for detecting trends (based on the split R-hat statistic from Bayesian convergence diagnostics) and one for detecting changes in variance (a novel extension inspired by Levene's test). This approach allows researchers to efficiently and reproducibly detect violations of the stationarity assumption, especially when visual inspection of many individual time series is impractical. The procedure is suitable for use in all areas of research where time series analysis is central. For a detailed description of the method and its validation through simulations and empirical application, see Zitzmann, S., Lindner, C., Lohmann, J. F., & Hecht, M. (2024) "A Novel Nonvisual Procedure for Screening for Nonstationarity in Time Series as Obtained from Intensive Longitudinal Designs" <https://www.researchgate.net/publication/384354932_A_Novel_Nonvisual_Procedure_for_Screening_for_Nonstationarity_in_Time_Series_as_Obtained_from_Intensive_Longitudinal_Designs>.
Intervention analysis is used to investigate structural changes in data resulting from external events. Traditional time series intervention models, viz. Autoregressive Integrated Moving Average model with exogeneous variables (ARIMA-X) and Artificial Neural Networks with exogeneous variables (ANN-X), rely on linear intervention functions such as step or ramp functions, or their combinations. In this package, the Gompertz, Logistic, Monomolecular, Richard and Hoerl function have been used as non-linear intervention function. The equation of the above models are represented as: Gompertz: A * exp(-B * exp(-k * t)); Logistic: K / (1 + ((K - N0) / N0) * exp(-r * t)); Monomolecular: A * exp(-k * t); Richard: A + (K - A) / (1 + exp(-B * (C - t)))^(1/beta) and Hoerl: a*(b^t)*(t^c).This package introduced algorithm for time series intervention analysis employing ARIMA and ANN models with a non-linear intervention function. This package has been developed using algorithm of Yeasin et al. <doi:10.1016/j.hazadv.2023.100325> and Paul and Yeasin <doi:10.1371/journal.pone.0272999>.
Alternative splicing produces a variety of different protein products from a given gene. VALERIE enables visualisation of alternative splicing events from high-throughput single-cell RNA-sequencing experiments. VALERIE computes percent spliced-in (PSI) values for user-specified genomic coordinates corresponding to alternative splicing events. PSI is the proportion of sequencing reads supporting the included exon/intron as defined by Shiozawa (2018) <doi:10.1038/s41467-018-06063-x>. PSI are inferred from sequencing reads data based on specialised infrastructures for representing and computing annotated genomic ranges by Lawrence (2013) <doi:10.1371/journal.pcbi.1003118>. Computed PSI for each single cell are subsequently presented in the form of a heatmap implemented using the pheatmap package by Kolde (2010) <https://CRAN.R-project.org/package=pheatmap>. Board overview of the mean PSI difference and associated p-values across different user-defined groups of single cells are presented in the form of a line graph using the ggplot2 package by Wickham (2007) <https://CRAN.R-project.org/package=ggplot2>.
Applying the family of the Bayesian Expectation-Maximization-Maximization (BEMM) algorithm to estimate: (1) Three parameter logistic (3PL) model proposed by Birnbaum (1968, ISBN:9780201043105); (2) four parameter logistic (4PL) model proposed by Barton & Lord (1981) <doi:10.1002/j.2333-8504.1981.tb01255.x>; (3) one parameter logistic guessing (1PLG) and (4) one parameter logistic ability-based guessing (1PLAG) models proposed by San Martà n et al (2006) <doi:10.1177/0146621605282773>. The BEMM family includes (1) the BEMM algorithm for 3PL model proposed by Guo & Zheng (2019) <doi:10.3389/fpsyg.2019.01175>; (2) the BEMM algorithm for 1PLG model and (3) the BEMM algorithm for 1PLAG model proposed by Guo, Wu, Zheng, & Chen (2021) <doi:10.1177/0146621621990761>; (4) the BEMM algorithm for 4PL model proposed by Zheng, Guo, & Kern (2021) <doi:10.1177/21582440211052556>; and (5) their maximum likelihood estimation versions proposed by Zheng, Meng, Guo, & Liu (2018) <doi:10.3389/fpsyg.2017.02302>. Thus, both Bayesian modal estimates and maximum likelihood estimates are available.
The robustness of many of the statistical techniques, such as factor analysis, applied in the social sciences rests upon the assumption of item-level normality. However, when dealing with real data, these assumptions are often not met. The Box-Cox transformation (Box & Cox, 1964) <http://www.jstor.org/stable/2984418> provides an optimal transformation for non-normal variables. Yet, for large datasets of continuous variables, its application in current software programs is cumbersome with analysts having to take several steps to normalise each variable. We present an R package normalr that enables researchers to make convenient optimal transformations of multiple variables in datasets. This R package enables users to quickly and accurately: (1) anchor all of their variables at 1.00, (2) select the desired precision with which the optimal lambda is estimated, (3) apply each unique exponent to its variable, (4) rescale resultant values to within their original X1 and X(n) ranges, and (5) provide original and transformed estimates of skewness, kurtosis, and other inferential assessments of normality.
Modeling spatial dependencies in dependent variables, extending traditional spatial regression approaches. It allows for the joint modeling of both the mean and the variance of the dependent variable, incorporating semiparametric effects in both models. Based on generalized additive models (GAM), the package enables the inclusion of non-parametric terms while maintaining the classical theoretical framework of spatial regression. Additionally, it implements the Generalized Spatial Autoregression (GSAR) model, which extends classical methods like logistic Spatial Autoregresive Models (SAR), probit Spatial Autoregresive Models (SAR), and Poisson Spatial Autoregresive Models (SAR), offering greater flexibility in modeling spatial dependencies and significantly improving computational efficiency and the statistical properties of the estimators. Related work includes: a) J.D. Toloza-Delgado, Melo O.O., Cruz N.A. (2024). "Joint spatial modeling of mean and non-homogeneous variance combining semiparametric SAR and GAMLSS models for hedonic prices". <doi:10.1016/j.spasta.2024.100864>. b) Cruz, N. A., Toloza-Delgado, J. D., Melo, O. O. (2024). "Generalized spatial autoregressive model". <doi:10.48550/arXiv.2412.00945>.
GOfuncR performs a gene ontology enrichment analysis based on the ontology enrichment software FUNC. GO-annotations are obtained from OrganismDb or OrgDb packages (Homo.sapiens by default); the GO-graph is included in the package and updated regularly. GOfuncR provides the standard candidate vs background enrichment analysis using the hypergeometric test, as well as three additional tests:
the Wilcoxon rank-sum test that is used when genes are ranked,
a binomial test that is used when genes are associated with two counts, and
a Chi-square or Fisher's exact test that is used in cases when genes are associated with four counts.
To correct for multiple testing and interdependency of the tests, family-wise error rates are computed based on random permutations of the gene-associated variables. GOfuncR also provides tools for exploring the ontology graph and the annotations, and options to take gene-length or spatial clustering of genes into account. It is also possible to provide custom gene coordinates, annotations and ontologies.
Simulate survival times from standard parametric survival distributions (exponential, Weibull, Gompertz), 2-component mixture distributions, or a user-defined hazard, log hazard, cumulative hazard, or log cumulative hazard function. Baseline covariates can be included under a proportional hazards assumption. Time dependent effects (i.e. non-proportional hazards) can be included by interacting covariates with linear time or a user-defined function of time. Clustered event times are also accommodated. The 2-component mixture distributions can allow for a variety of flexible baseline hazard functions reflecting those seen in practice. If the user wishes to provide a user-defined hazard or log hazard function then this is possible, and the resulting cumulative hazard function does not need to have a closed-form solution. For details see the supporting paper <doi:10.18637/jss.v097.i03>. Note that this package is modelled on the survsim package available in the Stata software (see Crowther and Lambert (2012) <https://www.stata-journal.com/sjpdf.html?articlenum=st0275> or Crowther and Lambert (2013) <doi:10.1002/sim.5823>).
Linnorm is an R package for the analysis of RNA-seq, scRNA-seq, ChIP-seq count data or any large scale count data. It transforms such datasets for parametric tests. In addition to the transformtion function (Linnorm), the following pipelines are implemented:
Library size/batch effect normalization (
Linnorm.Norm)Cell subpopluation analysis and visualization using t-SNE or PCA K-means clustering or hierarchical clustering (
Linnorm.tSNE,Linnorm.PCA,Linnorm.HClust)Differential expression analysis or differential peak detection using limma (
Linnorm.limma)Highly variable gene discovery and visualization (
Linnorm.HVar)Gene correlation network analysis and visualization (
Linnorm.Cor)Stable gene selection for scRNA-seq data; for users without or who do not want to rely on spike-in genes (
Linnorm.SGenes)Data imputation (
Linnorm.DataImput).
Linnorm can work with raw count, CPM, RPKM, FPKM and TPM. Additionally, the RnaXSim function is included for simulating RNA-seq data for the evaluation of DEG analysis methods.
This package provides functions to produce accessible HTML slides, HTML', Word and PDF documents from input R markdown files. Accessible PDF files are produced only on a Windows Operating System. One aspect of accessibility is providing a headings structure that is recognised by a screen reader, providing a navigational tool for a blind or partially-sighted person. A key aim is to produce documents of different formats easily from each of a collection of R markdown source files. Input R markdown files are rendered using the render() function from the rmarkdown package <https://cran.r-project.org/package=rmarkdown>. A zip file containing multiple output files can be produced from one function call. A user-supplied template Word document can be used to determine the formatting of an output Word document. Accessible PDF files are produced from Word documents using OfficeToPDF <https://github.com/cognidox/OfficeToPDF>. A convenience function, install_otp() is provided to install this software. The option to print HTML output to (non-accessible) PDF files is also available.
Three games: proton, frequon and regression. Each one is a console-based data-crunching game for younger and older data scientists. Act as a data-hacker and find Slawomir Pietraszko's credentials to the Proton server. In proton you have to solve four data-based puzzles to find the login and password. There are many ways to solve these puzzles. You may use loops, data filtering, ordering, aggregation or other tools. Only basics knowledge of R is required to play the game, yet the more functions you know, the more approaches you can try. In frequon you will help to perform statistical cryptanalytic attack on a corpus of ciphered messages. This time seven sub-tasks are pushing the bar much higher. Do you accept the challenge? In regression you will test your modeling skills in a series of eight sub-tasks. Try only if ANOVA is your close friend. It's a part of Beta and Bit project. You will find more about the Beta and Bit project at <https://github.com/BetaAndBit/Charts>.
Developed to help researchers who need to model the kinetics of carbon dioxide (CO2) production in alcoholic fermentation of wines, beers and other fermented products. The following models are available for modeling the carbon dioxide production curve as a function of time: 5PL, Gompertz and 4PL. This package has different functions, which applied can: perform the modeling of the data obtained in the fermentation and return the coefficients, analyze the model fit and return different statistical metrics, and calculate the kinetic parameters: Maximum production of carbon dioxide; Maximum rate of production of carbon dioxide; Moment in which maximum fermentation rate occurs; Duration of the latency phase for carbon dioxide production; Carbon dioxide produced until maximum fermentation rate occurs. In addition, a function that generates graphs with the observed and predicted data from the models, isolated and combined, is available. Gava, A., Borsato, D., & Ficagna, E. (2020)."Effect of mixture of fining agents on the fermentation kinetics of base wine for sparkling wine production: Use of methodology for modeling". <doi:10.1016/j.lwt.2020.109660>.
PaIRKAT is model framework for assessing statistical relationships between networks of metabolites (pathways) and an outcome of interest (phenotype). PaIRKAT queries the KEGG database to determine interactions between metabolites from which network connectivity is constructed. This model framework improves testing power on high dimensional data by including graph topography in the kernel machine regression setting. Studies on high dimensional data can struggle to include the complex relationships between variables. The semi-parametric kernel machine regression model is a powerful tool for capturing these types of relationships. They provide a framework for testing for relationships between outcomes of interest and high dimensional data such as metabolomic, genomic, or proteomic pathways. PaIRKAT uses known biological connections between high dimensional variables by representing them as edges of ‘graphs’ or ‘networks.’ It is common for nodes (e.g. metabolites) to be disconnected from all others within the graph, which leads to meaningful decreases in testing power whether or not the graph information is included. We include a graph regularization or ‘smoothing’ approach for managing this issue.
This package provides a fast and flexible general-purpose implementation of Particle Swarm Optimization (PSO) and Differential Evolution (DE) for solving global minimization problems is provided. It is designed to handle complex optimization tasks with nonlinear, non-differentiable, and multi-modal objective functions defined by users. There are five types of PSO variants: Particle Swarm Optimization (PSO, Eberhart & Kennedy, 1995) <doi:10.1109/MHS.1995.494215>, Quantum-behaved particle Swarm Optimization (QPSO, Sun et al., 2004) <doi:10.1109/CEC.2004.1330875>, Locally convergent rotationally invariant particle swarm optimization (LcRiPSO, Bonyadi & Michalewicz, 2014) <doi:10.1007/s11721-014-0095-1>, Competitive Swarm Optimizer (CSO, Cheng & Jin, 2015) <doi:10.1109/TCYB.2014.2322602> and Double exponential particle swarm optimization (DExPSO, Stehlik et al., 2024) <doi:10.1016/j.asoc.2024.111913>. For the DE algorithm, six types in Storn, R. & Price, K. (1997) <doi:10.1023/A:1008202821328> are included: DE/rand/1, DE/rand/2, DE/best/1, DE/best/2, DE/rand_to-best/1 and DE/rand_to-best/2.
The number of biological databases is growing rapidly, but different databases use different IDs to refer to the same biological entity. The inconsistency in IDs impedes the integration of various types of biological data. To resolve the problem, we developed MantaID', a data-driven, machine-learning based approach that automates identifying IDs on a large scale. The MantaID model's prediction accuracy was proven to be 99%, and it correctly and effectively predicted 100,000 ID entries within two minutes. MantaID supports the discovery and exploitation of ID patterns from large quantities of databases. (e.g., up to 542 biological databases). An easy-to-use freely available open-source software R package, a user-friendly web application, and API were also developed for MantaID to improve applicability. To our knowledge, MantaID is the first tool that enables an automatic, quick, accurate, and comprehensive identification of large quantities of IDs, and can therefore be used as a starting point to facilitate the complex assimilation and aggregation of biological data across diverse databases.
Offers a flexible and user-friendly interface for visualizing conditional effects from a broad range of regression models, including mixed-effects and generalized additive (mixed) models. Compatible model types include lm(), rlm(), glm(), glm.nb(), and gam() (from mgcv'); nonlinear models via nls(); and generalized least squares via gls(). Mixed-effects models with random intercepts and/or slopes can be fitted using lmer(), glmer(), glmer.nb(), glmmTMB(), or gam() (from mgcv', via smooth terms). Plots are rendered using base R graphics with extensive customization options. Approximate confidence intervals for nls() models are computed using the delta method. Robust standard errors for rlm() are computed using the sandwich estimator (Zeileis 2004) <doi:10.18637/jss.v011.i10>. Methods for generalized additive models follow Wood (2017) <doi:10.1201/9781315370279>. For linear mixed-effects models with lme4', see Bates et al. (2015) <doi:10.18637/jss.v067.i01>. For mixed models using glmmTMB', see Brooks et al. (2017) <doi:10.32614/RJ-2017-066>.