Sequencing and microarray samples often are collected or processed in multiple batches or at different times. This often produces technical biases that can lead to incorrect results in the downstream analysis. BatchQC
is a software tool that streamlines batch preprocessing and evaluation by providing interactive diagnostics, visualizations, and statistical analyses to explore the extent to which batch variation impacts the data. BatchQC
diagnostics help determine whether batch adjustment needs to be done, and how correction should be applied before proceeding with a downstream analysis. Moreover, BatchQC
interactively applies multiple common batch effect approaches to the data and the user can quickly see the benefits of each method. BatchQC
is developed as a Shiny App. The output is organized into multiple tabs and each tab features an important part of the batch effect analysis and visualization of the data. The BatchQC
interface has the following analysis groups: Summary, Differential Expression, Median Correlations, Heatmaps, Circular Dendrogram, PCA Analysis, Shape, ComBat
and SVA.
This package provides a nonvisual procedure for screening time series for nonstationarity in the context of intensive longitudinal designs, such as ecological momentary assessments. The method combines two diagnostics: one for detecting trends (based on the split R-hat statistic from Bayesian convergence diagnostics) and one for detecting changes in variance (a novel extension inspired by Levene's test). This approach allows researchers to efficiently and reproducibly detect violations of the stationarity assumption, especially when visual inspection of many individual time series is impractical. The procedure is suitable for use in all areas of research where time series analysis is central. For a detailed description of the method and its validation through simulations and empirical application, see Zitzmann, S., Lindner, C., Lohmann, J. F., & Hecht, M. (2024) "A Novel Nonvisual Procedure for Screening for Nonstationarity in Time Series as Obtained from Intensive Longitudinal Designs" <https://www.researchgate.net/publication/384354932_A_Novel_Nonvisual_Procedure_for_Screening_for_Nonstationarity_in_Time_Series_as_Obtained_from_Intensive_Longitudinal_Designs>.
This package provides a comprehensive collection of functions for conducting meta-analyses in R. The package includes functions to calculate various effect sizes or outcome measures, fit fixed-, random-, and mixed-effects models to such data, carry out moderator and meta-regression analyses, and create various types of meta-analytical plots (e.g., forest, funnel, radial, L'Abbe, Baujat, GOSH plots). For meta-analyses of binomial and person-time data, the package also provides functions that implement specialized methods, including the Mantel-Haenszel method, Peto's method, and a variety of suitable generalized linear (mixed-effects) models (i.e. mixed-effects logistic and Poisson regression models). Finally, the package provides functionality for fitting meta-analytic multivariate/multilevel models that account for non-independent sampling errors and/or true effects (e.g. due to the inclusion of multiple treatment studies, multiple endpoints, or other forms of clustering). Network meta-analyses and meta-analyses accounting for known correlation structures (e.g. due to phylogenetic relatedness) can also be conducted.
Intervention analysis is used to investigate structural changes in data resulting from external events. Traditional time series intervention models, viz. Autoregressive Integrated Moving Average model with exogeneous variables (ARIMA-X) and Artificial Neural Networks with exogeneous variables (ANN-X), rely on linear intervention functions such as step or ramp functions, or their combinations. In this package, the Gompertz, Logistic, Monomolecular, Richard and Hoerl function have been used as non-linear intervention function. The equation of the above models are represented as: Gompertz: A * exp(-B * exp(-k * t)); Logistic: K / (1 + ((K - N0) / N0) * exp(-r * t)); Monomolecular: A * exp(-k * t); Richard: A + (K - A) / (1 + exp(-B * (C - t)))^(1/beta) and Hoerl: a*(b^t)*(t^c).This package introduced algorithm for time series intervention analysis employing ARIMA and ANN models with a non-linear intervention function. This package has been developed using algorithm of Yeasin et al. <doi:10.1016/j.hazadv.2023.100325> and Paul and Yeasin <doi:10.1371/journal.pone.0272999>.
Alternative splicing produces a variety of different protein products from a given gene. VALERIE enables visualisation of alternative splicing events from high-throughput single-cell RNA-sequencing experiments. VALERIE computes percent spliced-in (PSI) values for user-specified genomic coordinates corresponding to alternative splicing events. PSI is the proportion of sequencing reads supporting the included exon/intron as defined by Shiozawa (2018) <doi:10.1038/s41467-018-06063-x>. PSI are inferred from sequencing reads data based on specialised infrastructures for representing and computing annotated genomic ranges by Lawrence (2013) <doi:10.1371/journal.pcbi.1003118>. Computed PSI for each single cell are subsequently presented in the form of a heatmap implemented using the pheatmap package by Kolde (2010) <https://CRAN.R-project.org/package=pheatmap>. Board overview of the mean PSI difference and associated p-values across different user-defined groups of single cells are presented in the form of a line graph using the ggplot2 package by Wickham (2007) <https://CRAN.R-project.org/package=ggplot2>.
Applying the family of the Bayesian Expectation-Maximization-Maximization (BEMM) algorithm to estimate: (1) Three parameter logistic (3PL) model proposed by Birnbaum (1968, ISBN:9780201043105); (2) four parameter logistic (4PL) model proposed by Barton & Lord (1981) <doi:10.1002/j.2333-8504.1981.tb01255.x>; (3) one parameter logistic guessing (1PLG) and (4) one parameter logistic ability-based guessing (1PLAG) models proposed by San Martà n et al (2006) <doi:10.1177/0146621605282773>. The BEMM family includes (1) the BEMM algorithm for 3PL model proposed by Guo & Zheng (2019) <doi:10.3389/fpsyg.2019.01175>; (2) the BEMM algorithm for 1PLG model and (3) the BEMM algorithm for 1PLAG model proposed by Guo, Wu, Zheng, & Chen (2021) <doi:10.1177/0146621621990761>; (4) the BEMM algorithm for 4PL model proposed by Zheng, Guo, & Kern (2021) <doi:10.1177/21582440211052556>; and (5) their maximum likelihood estimation versions proposed by Zheng, Meng, Guo, & Liu (2018) <doi:10.3389/fpsyg.2017.02302>. Thus, both Bayesian modal estimates and maximum likelihood estimates are available.
The robustness of many of the statistical techniques, such as factor analysis, applied in the social sciences rests upon the assumption of item-level normality. However, when dealing with real data, these assumptions are often not met. The Box-Cox transformation (Box & Cox, 1964) <http://www.jstor.org/stable/2984418> provides an optimal transformation for non-normal variables. Yet, for large datasets of continuous variables, its application in current software programs is cumbersome with analysts having to take several steps to normalise each variable. We present an R package normalr that enables researchers to make convenient optimal transformations of multiple variables in datasets. This R package enables users to quickly and accurately: (1) anchor all of their variables at 1.00, (2) select the desired precision with which the optimal lambda is estimated, (3) apply each unique exponent to its variable, (4) rescale resultant values to within their original X1 and X(n) ranges, and (5) provide original and transformed estimates of skewness, kurtosis, and other inferential assessments of normality.
Modeling spatial dependencies in dependent variables, extending traditional spatial regression approaches. It allows for the joint modeling of both the mean and the variance of the dependent variable, incorporating semiparametric effects in both models. Based on generalized additive models (GAM), the package enables the inclusion of non-parametric terms while maintaining the classical theoretical framework of spatial regression. Additionally, it implements the Generalized Spatial Autoregression (GSAR) model, which extends classical methods like logistic Spatial Autoregresive Models (SAR), probit Spatial Autoregresive Models (SAR), and Poisson Spatial Autoregresive Models (SAR), offering greater flexibility in modeling spatial dependencies and significantly improving computational efficiency and the statistical properties of the estimators. Related work includes: a) J.D. Toloza-Delgado, Melo O.O., Cruz N.A. (2024). "Joint spatial modeling of mean and non-homogeneous variance combining semiparametric SAR and GAMLSS models for hedonic prices". <doi:10.1016/j.spasta.2024.100864>. b) Cruz, N. A., Toloza-Delgado, J. D., Melo, O. O. (2024). "Generalized spatial autoregressive model". <doi:10.48550/arXiv.2412.00945>
.
GOfuncR performs a gene ontology enrichment analysis based on the ontology enrichment software FUNC. GO-annotations are obtained from OrganismDb or OrgDb packages (Homo.sapiens
by default); the GO-graph is included in the package and updated regularly. GOfuncR provides the standard candidate vs background enrichment analysis using the hypergeometric test, as well as three additional tests:
the Wilcoxon rank-sum test that is used when genes are ranked,
a binomial test that is used when genes are associated with two counts, and
a Chi-square or Fisher's exact test that is used in cases when genes are associated with four counts.
To correct for multiple testing and interdependency of the tests, family-wise error rates are computed based on random permutations of the gene-associated variables. GOfuncR also provides tools for exploring the ontology graph and the annotations, and options to take gene-length or spatial clustering of genes into account. It is also possible to provide custom gene coordinates, annotations and ontologies.
Simulate survival times from standard parametric survival distributions (exponential, Weibull, Gompertz), 2-component mixture distributions, or a user-defined hazard, log hazard, cumulative hazard, or log cumulative hazard function. Baseline covariates can be included under a proportional hazards assumption. Time dependent effects (i.e. non-proportional hazards) can be included by interacting covariates with linear time or a user-defined function of time. Clustered event times are also accommodated. The 2-component mixture distributions can allow for a variety of flexible baseline hazard functions reflecting those seen in practice. If the user wishes to provide a user-defined hazard or log hazard function then this is possible, and the resulting cumulative hazard function does not need to have a closed-form solution. For details see the supporting paper <doi:10.18637/jss.v097.i03>. Note that this package is modelled on the survsim package available in the Stata software (see Crowther and Lambert (2012) <https://www.stata-journal.com/sjpdf.html?articlenum=st0275> or Crowther and Lambert (2013) <doi:10.1002/sim.5823>).
This package provides functions to produce accessible HTML slides, HTML', Word and PDF documents from input R markdown files. Accessible PDF files are produced only on a Windows Operating System. One aspect of accessibility is providing a headings structure that is recognised by a screen reader, providing a navigational tool for a blind or partially-sighted person. A key aim is to produce documents of different formats easily from each of a collection of R markdown source files. Input R markdown files are rendered using the render()
function from the rmarkdown package <https://cran.r-project.org/package=rmarkdown>. A zip file containing multiple output files can be produced from one function call. A user-supplied template Word document can be used to determine the formatting of an output Word document. Accessible PDF files are produced from Word documents using OfficeToPDF
<https://github.com/cognidox/OfficeToPDF>
. A convenience function, install_otp()
is provided to install this software. The option to print HTML output to (non-accessible) PDF files is also available.
Three games: proton, frequon and regression. Each one is a console-based data-crunching game for younger and older data scientists. Act as a data-hacker and find Slawomir Pietraszko's credentials to the Proton server. In proton you have to solve four data-based puzzles to find the login and password. There are many ways to solve these puzzles. You may use loops, data filtering, ordering, aggregation or other tools. Only basics knowledge of R is required to play the game, yet the more functions you know, the more approaches you can try. In frequon you will help to perform statistical cryptanalytic attack on a corpus of ciphered messages. This time seven sub-tasks are pushing the bar much higher. Do you accept the challenge? In regression you will test your modeling skills in a series of eight sub-tasks. Try only if ANOVA is your close friend. It's a part of Beta and Bit project. You will find more about the Beta and Bit project at <https://github.com/BetaAndBit/Charts>
.
Developed to help researchers who need to model the kinetics of carbon dioxide (CO2) production in alcoholic fermentation of wines, beers and other fermented products. The following models are available for modeling the carbon dioxide production curve as a function of time: 5PL, Gompertz and 4PL. This package has different functions, which applied can: perform the modeling of the data obtained in the fermentation and return the coefficients, analyze the model fit and return different statistical metrics, and calculate the kinetic parameters: Maximum production of carbon dioxide; Maximum rate of production of carbon dioxide; Moment in which maximum fermentation rate occurs; Duration of the latency phase for carbon dioxide production; Carbon dioxide produced until maximum fermentation rate occurs. In addition, a function that generates graphs with the observed and predicted data from the models, isolated and combined, is available. Gava, A., Borsato, D., & Ficagna, E. (2020)."Effect of mixture of fining agents on the fermentation kinetics of base wine for sparkling wine production: Use of methodology for modeling". <doi:10.1016/j.lwt.2020.109660>.
PaIRKAT
is model framework for assessing statistical relationships between networks of metabolites (pathways) and an outcome of interest (phenotype). PaIRKAT
queries the KEGG database to determine interactions between metabolites from which network connectivity is constructed. This model framework improves testing power on high dimensional data by including graph topography in the kernel machine regression setting. Studies on high dimensional data can struggle to include the complex relationships between variables. The semi-parametric kernel machine regression model is a powerful tool for capturing these types of relationships. They provide a framework for testing for relationships between outcomes of interest and high dimensional data such as metabolomic, genomic, or proteomic pathways. PaIRKAT
uses known biological connections between high dimensional variables by representing them as edges of ‘graphs’ or ‘networks.’ It is common for nodes (e.g. metabolites) to be disconnected from all others within the graph, which leads to meaningful decreases in testing power whether or not the graph information is included. We include a graph regularization or ‘smoothing’ approach for managing this issue.
Linnorm is an R package for the analysis of RNA-seq, scRNA-seq, ChIP-seq count data or any large scale count data. It transforms such datasets for parametric tests. In addition to the transformtion function (Linnorm
), the following pipelines are implemented:
Library size/batch effect normalization (
Linnorm.Norm
)Cell subpopluation analysis and visualization using t-SNE or PCA K-means clustering or hierarchical clustering (
Linnorm.tSNE
,Linnorm.PCA
,Linnorm.HClust
)Differential expression analysis or differential peak detection using limma (
Linnorm.limma
)Highly variable gene discovery and visualization (
Linnorm.HVar
)Gene correlation network analysis and visualization (
Linnorm.Cor
)Stable gene selection for scRNA-seq data; for users without or who do not want to rely on spike-in genes (
Linnorm.SGenes
)Data imputation (
Linnorm.DataImput
).
Linnorm can work with raw count, CPM, RPKM, FPKM and TPM. Additionally, the RnaXSim
function is included for simulating RNA-seq data for the evaluation of DEG analysis methods.
This package provides a fast and flexible general-purpose implementation of Particle Swarm Optimization (PSO) and Differential Evolution (DE) for solving global minimization problems is provided. It is designed to handle complex optimization tasks with nonlinear, non-differentiable, and multi-modal objective functions defined by users. There are five types of PSO variants: Particle Swarm Optimization (PSO, Eberhart & Kennedy, 1995) <doi:10.1109/MHS.1995.494215>, Quantum-behaved particle Swarm Optimization (QPSO, Sun et al., 2004) <doi:10.1109/CEC.2004.1330875>, Locally convergent rotationally invariant particle swarm optimization (LcRiPSO
, Bonyadi & Michalewicz, 2014) <doi:10.1007/s11721-014-0095-1>, Competitive Swarm Optimizer (CSO, Cheng & Jin, 2015) <doi:10.1109/TCYB.2014.2322602> and Double exponential particle swarm optimization (DExPSO
, Stehlik et al., 2024) <doi:10.1016/j.asoc.2024.111913>. For the DE algorithm, six types in Storn, R. & Price, K. (1997) <doi:10.1023/A:1008202821328> are included: DE/rand/1, DE/rand/2, DE/best/1, DE/best/2, DE/rand_to-best/1 and DE/rand_to-best/2.
The number of biological databases is growing rapidly, but different databases use different IDs to refer to the same biological entity. The inconsistency in IDs impedes the integration of various types of biological data. To resolve the problem, we developed MantaID
', a data-driven, machine-learning based approach that automates identifying IDs on a large scale. The MantaID
model's prediction accuracy was proven to be 99%, and it correctly and effectively predicted 100,000 ID entries within two minutes. MantaID
supports the discovery and exploitation of ID patterns from large quantities of databases. (e.g., up to 542 biological databases). An easy-to-use freely available open-source software R package, a user-friendly web application, and API were also developed for MantaID
to improve applicability. To our knowledge, MantaID
is the first tool that enables an automatic, quick, accurate, and comprehensive identification of large quantities of IDs, and can therefore be used as a starting point to facilitate the complex assimilation and aggregation of biological data across diverse databases.
We provide a toolbox to fit a continuous-time fractionally integrated ARMA process (CARFIMA) on univariate and irregularly spaced time series data via both frequentist and Bayesian machinery. A general-order CARFIMA(p, H, q) model for p>q is specified in Tsai and Chan (2005) <doi:10.1111/j.1467-9868.2005.00522.x> and it involves p+q+2 unknown model parameters, i.e., p AR parameters, q MA parameters, Hurst parameter H, and process uncertainty (standard deviation) sigma. Also, the model can account for heteroscedastic measurement errors, if the information about measurement error standard deviations is known. The package produces their maximum likelihood estimates and asymptotic uncertainties using a global optimizer called the differential evolution algorithm. It also produces posterior samples of the model parameters via Metropolis-Hastings within a Gibbs sampler equipped with adaptive Markov chain Monte Carlo. These fitting procedures, however, may produce numerical errors if p>2. The toolbox also contains a function to simulate discrete time series data from CARFIMA(p, H, q) process given the model parameters and observation times.
The remit of the European Clinical Trials Data Base (EudraCT
<https://eudract.ema.europa.eu/> ), or ClinicalTrials.gov
<https://clinicaltrials.gov/>, is to provide open access to summaries of all registered clinical trial results; thus aiming to prevent non-reporting of negative results and provide open-access to results to inform future research. The amount of information required and the format of the results, however, imposes a large extra workload at the end of studies on clinical trial units. In particular, the adverse-event-reporting component requires entering: each unique combination of treatment group and safety event; for every such event above, a further 4 pieces of information (body system, number of occurrences, number of subjects, number exposed) for non-serious events, plus an extra three pieces of data for serious adverse events (numbers of causally related events, deaths, causally related deaths). This package prepares the required statistics needed by EudraCT
and formats them into the precise requirements to directly upload an XML file into the web portal, with no further data entry by hand.
Fast implementations to compute the genetic covariance matrix, the Jaccard similarity matrix, the s-matrix (the weighted Jaccard similarity matrix), and the (classic or robust) genomic relationship matrix of a (dense or sparse) input matrix (see Hahn, Lutz, Hecker, Prokopenko, Cho, Silverman, Weiss, and Lange (2020) <doi:10.1002/gepi.22356>). Full support for sparse matrices from the R-package Matrix'. Additionally, an implementation of the power method (von Mises iteration) to compute the largest eigenvector of a matrix is included, a function to perform an automated full run of global and local correlations in population stratification data, a function to compute sliding windows, and a function to invert minor alleles and to select those variants/loci exceeding a minimal cutoff value. New functionality in locStra
allows one to extract the k leading eigenvectors of the genetic covariance matrix, Jaccard similarity matrix, s-matrix, and genomic relationship matrix via fast PCA without actually computing the similarity matrices. The fast PCA to compute the k leading eigenvectors can now also be run directly from bed'+'bim'+'fam files.
This package provides tools for statistical analysis using the binscatter methods developed by Cattaneo, Crump, Farrell and Feng (2024a) <doi:10.48550/arXiv.1902.09608>
, Cattaneo, Crump, Farrell and Feng (2024b) <https://nppackages.github.io/references/Cattaneo-Crump-Farrell-Feng_2024_NonlinearBinscatter.pdf>
and Cattaneo, Crump, Farrell and Feng (2024c) <doi:10.48550/arXiv.1902.09615>
. Binscatter provides a flexible way of describing the relationship between two variables based on partitioning/binning of the independent variable of interest. binsreg()
, binsqreg()
and binsglm()
implement binscatter least squares regression, quantile regression and generalized linear regression respectively, with particular focus on constructing binned scatter plots. They also implement robust (pointwise and uniform) inference of regression functions and derivatives thereof. binstest()
implements hypothesis testing procedures for parametric functional forms of and nonparametric shape restrictions on the regression function. binspwc()
implements hypothesis testing procedures for pairwise group comparison of binscatter estimators. binsregselect()
implements data-driven procedures for selecting the number of bins for binscatter estimation. All the commands allow for covariate adjustment, smoothness restrictions and clustering.
Gene sets are fundamental for gene enrichment analysis. The package geneset enables querying gene sets from public databases including GO (Gene Ontology Consortium. (2004) <doi:10.1093/nar/gkh036>), KEGG (Minoru et al. (2000) <doi:10.1093/nar/28.1.27>), WikiPathway
(Marvin et al. (2020) <doi:10.1093/nar/gkaa1024>), MsigDb
(Arthur et al. (2015) <doi:10.1016/j.cels.2015.12.004>), Reactome (David et al. (2011) <doi:10.1093/nar/gkq1018>), MeSH
(Ish et al. (2014) <doi:10.4103/0019-5413.139827>), DisGeNET
(Janet et al. (2017) <doi:10.1093/nar/gkw943>), Disease Ontology (Lynn et al. (2011) <doi:10.1093/nar/gkr972>), Network of Cancer Genes (Dimitra et al. (2019) <doi:10.1186/s13059-018-1612-0>) and COVID-19 (Maxim et al. (2020) <doi:10.21203/rs.3.rs-28582/v1>). Gene sets are stored in the list object which provides data frame of geneset and geneset_name'. The geneset has two columns of term ID and gene ID. The geneset_name has two columns of terms ID and term description.
It provides cumulative distribution function (CDF), quantile, p-value, statistical power calculator and random number generator for a collection of group-testing procedures, including the Higher Criticism tests, the one-sided Kolmogorov-Smirnov tests, the one-sided Berk-Jones tests, the one-sided phi-divergence tests, etc. The input are a group of p-values. The null hypothesis is that they are i.i.d. Uniform(0,1). In the context of signal detection, the null hypothesis means no signals. In the context of the goodness-of-fit testing, which contrasts a group of i.i.d. random variables to a given continuous distribution, the input p-values can be obtained by the CDF transformation. The null hypothesis means that these random variables follow the given distribution. For reference, see [1]Hong Zhang, Jiashun Jin and Zheyang Wu. "Distributions and power of optimal signal-detection statistics in finite case", IEEE Transactions on Signal Processing (2020) 68, 1021-1033; [2] Hong Zhang and Zheyang Wu. "The general goodness-of-fit tests for correlated data", Computational Statistics & Data Analysis (2022) 167, 107379.
Power analysis and sample size calculation for Welch and Hsu (Hedderich and Sachs (2018), ISBN:978-3-662-56657-2) t-tests including Monte-Carlo simulations of empirical power and type-I-error. Power and sample size calculation for Wilcoxon rank sum and signed rank tests via Monte-Carlo simulations. Power and sample size required for the evaluation of a diagnostic test(-system) (Flahault et al. (2005), <doi:10.1016/j.jclinepi.2004.12.009>; Dobbin and Simon (2007), <doi:10.1093/biostatistics/kxj036>) as well as for a single proportion (Fleiss et al. (2003), ISBN:978-0-471-52629-2; Piegorsch (2004), <doi:10.1016/j.csda.2003.10.002>; Thulin (2014), <doi:10.1214/14-ejs909>), comparing two negative binomial rates (Zhu and Lakkis (2014), <doi:10.1002/sim.5947>), ANCOVA (Shieh (2020), <doi:10.1007/s11336-019-09692-3>), reference ranges (Jennen-Steinmetz and Wellek (2005), <doi:10.1002/sim.2177>), multiple primary endpoints (Sozu et al. (2015), ISBN:978-3-319-22005-5), and AUC (Hanley and McNeil
(1982), <doi:10.1148/radiology.143.1.7063747>).