This package provides tools for quickly processing and analyzing field observation data and air quality data. This tools contain functions that facilitate analysis in atmospheric chemistry (especially in ozone pollution). Some functions of time series are also applicable to other fields. For detail please view homepage<https://github.com/tianshu129/foqat>. Scientific Reference: 1. The Hydroxyl Radical (OH) Reactivity: Roger Atkinson and Janet Arey (2003) <doi:10.1021/cr0206420>. 2. Ozone Formation Potential (OFP): <http://ww2.arb.ca.gov/sites/default/files/barcu/regact/2009/mir2009/mir10.pdf>, Zhang et al.(2021) <doi:10.5194/acp-21-11053-2021>. 3. Aerosol Formation Potential (AFP): Wenjing Wu et al. (2016) <doi:10.1016/j.jes.2016.03.025>. 4. TUV model: <https://www2.acom.ucar.edu/modeling/tropospheric-ultraviolet-and-visible-tuv-radiation-model>.
Analysis and visualization of experimentally elucidated mutational signatures -- the kind of analysis and visualization in Boot et al., "In-depth characterization of the cisplatin mutational signature in human cell lines and in esophageal and liver tumors", Genome Research 2018, <doi:10.1101/gr.230219.117> and "Characterization of colibactin-associated mutational signature in an Asian oral squamous cell carcinoma and in other mucosal tumor types", Genome Research 2020 <doi:10.1101/gr.255620.119>. ICAMS stands for In-depth Characterization and Analysis of Mutational Signatures. ICAMS has functions to read in variant call files (VCFs) and to collate the corresponding catalogs of mutational spectra and to analyze and plot catalogs of mutational spectra and signatures. Handles both "counts-based" and "density-based" (i.e. representation as mutations per megabase) mutational spectra or signatures.
This package provides functions for model-based response dimension reduction. Usual dimension reduction methods in multivariate regression focus on the reduction of predictors, not responses. The response dimension reduction is theoretically founded in Yoo and Cook (2008) <doi:10.1016/j.csda.2008.07.029>. Later, three model-based response dimension reduction approaches are proposed in Yoo (2016) <doi:10.1080/02331888.2017.1410152> and Yoo (2019) <doi:10.1016/j.jkss.2019.02.001>. The method by Yoo and Cook (2008) is based on non-parametric ordinary least squares, but the model-based approaches are done through maximum likelihood estimation. For two model-based response dimension reduction methods called principal fitted response reduction and unstructured principal fitted response reduction, chi-squared tests are provided for determining the dimension of the response subspace.
We proposed a package for the classification task which uses Negative Binomial distribution within Linear Discriminant Analysis (NBLDA). It is an extension of the PoiClaClu package to Negative Binomial distribution. The classification algorithms are based on the papers Dong et al. (2016, ISSN: 1471-2105) and Witten, DM (2011, ISSN: 1932-6157) for NBLDA and PLDA, respectively. Although PLDA is a sparse algorithm and can be used for variable selection, the algorithm proposed by Dong et al. is not sparse. Therefore, it uses all variables in the classifier. Here, we extend Dong et al.'s algorithm to the sparse case by shrinking overdispersion towards 0 (Yu et al., 2013, ISSN: 1367-4803) and offset parameter towards 1 (as proposed by Witten DM, 2011). We support only the classification task with this version.
Fast and memory-less computation of the partial distance correlation for vectors and matrices. Permutation-based and asymptotic hypothesis testing for zero partial distance correlation are also performed. References include: Szekely G. J. and Rizzo M. L. (2014). "Partial distance correlation with methods for dissimilarities". The Annals Statistics, 42(6): 2382--2412. <doi:10.1214/14-AOS1255>. Shen C., Panda S. and Vogelstein J. T. (2022). "The Chi-Square Test of Distance Correlation". Journal of Computational and Graphical Statistics, 31(1): 254--262. <doi:10.1080/10618600.2021.1938585>. Szekely G. J. and Rizzo M. L. (2023). "The Energy of Data and Distance Correlation". Chapman and Hall/CRC. <ISBN:9781482242744>. Kontemeniotis N., Vargiakakis R. and Tsagris M. (2025). On independence testing using the (partial) distance correlation. <doi:10.48550/arXiv.2506.15659>.
It finds trascription factor (TF) high accumulation DNA zones, i.e., regions along the genome where there is a high presence of different transcription factors. Starting from a dataset containing the genomic positions of TF binding regions, for each base of the selected chromosome the accumulation of TFs is computed. Three different types of accumulation (TF, region and base accumulation) are available, together with the possibility of considering, in the single base accumulation computing, the TFs present not only in that single base, but also in its neighborhood, within a window of a given width. Two different methods for the search of TF high accumulation DNA zones, called "binding regions" and "overlaps", are available. In addition, some functions are provided in order to analyze, visualize and compare results obtained with different input parameters.
Mutations that rapidly accumulate in viral genomes during a pandemic can be used to track the evolution of the virus and, accordingly, unravel the viral infection network. To this extent, sequencing samples of the virus can be employed to estimate models from genomic epidemiology and may serve, for instance, to estimate the proportion of undetected infected people by uncovering cryptic transmissions, as well as to predict likely trends in the number of infected, hospitalized, dead and recovered people. VERSO is an algorithmic framework that processes variants profiles from viral samples to produce phylogenetic models of viral evolution. The approach solves a Boolean Matrix Factorization problem with phylogenetic constraints, by maximizing a log-likelihood function. VERSO includes two separate and subsequent steps; in this package we provide an R implementation of VERSO STEP 1.
We present corto (Correlation Tool), a simple package to infer gene regulatory networks and visualize master regulators from gene expression data using DPI (Data Processing Inequality) and bootstrapping to recover edges. An initial step is performed to calculate all significant edges between a list of source nodes (centroids) and target genes. Then all triplets containing two centroids and one target are tested in a DPI step which removes edges. A bootstrapping process then calculates the robustness of the network, eventually re-adding edges previously removed by DPI. The algorithm has been optimized to run outside a computing cluster, using a fast correlation implementation. The package finally provides functions to calculate network enrichment analysis from RNA-Seq and ATAC-Seq signatures as described in the article by Giorgi lab (2020) <doi:10.1093/bioinformatics/btaa223>.
doseR package is a next generation sequencing package for sex chromosome dosage compensation which can be applied broadly to detect shifts in gene expression among an arbitrary number of pre-defined groups of loci. doseR is a differential gene expression package for count data, that detects directional shifts in expression for multiple, specific subsets of genes, broad utility in systems biology research. doseR has been prepared to manage the nature of the data and the desired set of inferences. doseR uses S4 classes to store count data from sequencing experiment. It contains functions to normalize and filter count data, as well as to plot and calculate statistics of count data. It contains a framework for linear modeling of count data. The package has been tested using real and simulated data.
This package provides a set of radiative transfer models to quantitatively describe the absorption, reflectance and transmission of solar energy in vegetation, and model remotely sensed spectral signatures of vegetation at distinct spatial scales (leaf,canopy and stand). The main principle behind ccrtm is that many radiative transfer models can form a coupled chain, basically models that feed into each other in a linked chain (from leaf, to canopy, to stand, to atmosphere). It allows the simulation of spectral datasets in the solar spectrum (400-2500nm) using leaf models as PROSPECT5, 5b, and D which can be coupled with canopy models as FLIM', SAIL and SAIL2'. Currently, only a simple atmospheric model ('skyl') is implemented. Jacquemoud et al 2008 provide the most comprehensive overview of these models <doi:10.1016/j.rse.2008.01.026>.
This package provides methods to estimate serial intervals and time-varying case reproduction numbers from infectious disease outbreak data. Serial intervals measure the time between symptom onset in linked transmission pairs, while case reproduction numbers quantify how many secondary cases each infected individual generates over time. These parameters are essential for understanding transmission dynamics, evaluating control measures, and informing public health responses. The package implements the maximum likelihood framework from Vink et al. (2014) <doi:10.1093/aje/kwu209> for serial interval estimation and the retrospective method from Wallinga & Lipsitch (2007) <doi:10.1098/rspb.2006.3754> for reproduction number estimation. Originally developed for scabies transmission analysis but applicable to other infectious diseases including influenza, COVID-19, and emerging pathogens. Designed for epidemiologists, public health researchers, and infectious disease modelers working with outbreak surveillance data.
This package implements an efficient and powerful Bayesian approach for sparse high-dimensional linear regression. It uses minimal prior assumptions on the parameters through plug-in empirical Bayes estimates of hyperparameters. An efficient Parameter-Expanded Expectation-Conditional-Maximization (PX-ECM) algorithm estimates maximum a posteriori (MAP) values of regression parameters and variable selection probabilities. The PX-ECM results in a robust computationally efficient coordinate-wise optimization, which adjusts for the impact of other predictor variables. The E-step is motivated by the popular two-group approach to multiple testing. The result is a PaRtitiOned empirical Bayes Ecm (PROBE) algorithm applied to sparse high-dimensional linear regression, implemented using one-at-a-time or all-at-once type optimization. More information can be found in McLain, Zgodic, and Bondell (2022) <arXiv:2209.08139>.
This package provides functions and data to accompany the 5th edition of the book "Applied Nonparametric Statistical Methods" (4th edition: Sprent & Smeeton, 2024, ISBN:158488701X), the revisions from the 4th edition including a move from describing the output from a miscellany of statistical software packages to using R. While the output from many of the functions can also be obtained using a range of other R functions, this package provides functions in a unified setting and give output using both p-values and confidence intervals, exemplifying the book's approach of treating p-values as a guide to statistical importance and not an end product in their own right. Please note that in creating the ANSM5 package we do not claim to have produced software which is necessarily the most computationally efficient nor the most comprehensive.
Reads and writes ARFF files. ARFF (Attribute-Relation File Format) files are like CSV files, with a little bit of added meta information in a header and standardized NA values. They are quite often used for machine learning data sets and were introduced for the WEKA machine learning Java toolbox. See <https://waikato.github.io/weka-wiki/formats_and_processing/arff_stable/> for further info on ARFF and for <http://www.cs.waikato.ac.nz/ml/weka/> for more info on WEKA'. farff gets rid of the Java dependency that RWeka enforces, and it is at least a faster reader (for bigger files). It uses readr as parser back-end for the data section of the ARFF file. Consistency with RWeka is tested on Github and Travis CI with hundreds of ARFF files from OpenML'.
This package provides a set of functions for performing null hypothesis testing on samples of persistence diagrams using the theory of permutations. Currently, only two-sample testing is implemented. Inputs can be either samples of persistence diagrams themselves or vectorizations. In the former case, they are embedded in a metric space using either the Bottleneck or Wasserstein distance. In the former case, persistence data becomes functional data and inference is performed using tools available in the fdatest package. Main reference for the interval-wise testing method: Pini A., Vantini S. (2017) "Interval-wise testing for functional data" <doi:10.1080/10485252.2017.1306627>. Main reference for inference on populations of networks: Lovato, I., Pini, A., Stamm, A., & Vantini, S. (2020) "Model-free two-sample test for network-valued data" <doi:10.1016/j.csda.2019.106896>.
Advanced fuzzy logic based techniques are implemented to compute the similarity among different objects or items. Typically, application areas consist of transforming raw data into the corresponding advanced fuzzy logic representation and determining the similarity between two objects using advanced fuzzy similarity techniques in various fields of research, such as text classification, pattern recognition, software projects, decision-making, medical diagnosis, and market prediction. Functions are designed to compute the membership, non-membership, hesitant-membership, indeterminacy-membership, and refusal-membership for the input matrices. Furthermore, it also includes a large number of advanced fuzzy logic based similarity measure functions to compute the Intuitionistic fuzzy similarity (IFS), Pythagorean fuzzy similarity (PFS), and Spherical fuzzy similarity (SFS) between two objects or items based on their fuzzy relationships. It also includes working examples for each function with sample data sets.
This package provides functions to access drug regulatory data from public RESTful APIs including the FDA Open API and the Health Canada Drug Product Database API', retrieving real-time or historical information on drug approvals, adverse events, recalls, and product details. Additionally, the package includes a curated collection of open datasets focused on drugs, pharmaceuticals, treatments, and clinical studies. These datasets cover diverse topics such as treatment dosages, pharmacological studies, placebo effects, drug reactions, misuses of pain relievers, and vaccine effectiveness. The package supports reproducible research and teaching in pharmacology, medicine, and healthcare by integrating reliable international APIs and structured datasets from public, academic, and government sources. For more information on the APIs, see: FDA API <https://open.fda.gov/apis/> and Health Canada API <https://health-products.canada.ca/api/documentation/dpd-documentation-en.html>.
In stability selection (N Meinshausen, P Bühlmann (2010) <doi:10.1111/j.1467-9868.2010.00740.x>) and consensus clustering (S Monti et al (2003) <doi:10.1023/A:1023949509487>), resampling techniques are used to enhance the reliability of the results. In this package (B Bodinier et al (2025) <doi:10.18637/jss.v112.i05>), hyper-parameters are calibrated by maximising model stability, which is measured under the null hypothesis that all selection (or co-membership) probabilities are identical (B Bodinier et al (2023a) <doi:10.1093/jrsssc/qlad058> and B Bodinier et al (2023b) <doi:10.1093/bioinformatics/btad635>). Functions are readily implemented for the use of LASSO regression, sparse PCA, sparse (group) PLS or graphical LASSO in stability selection, and hierarchical clustering, partitioning around medoids, K means or Gaussian mixture models in consensus clustering.
These are miscellaneous functions for working with panel data, quantiles, and printing results. For panel data, the package includes functions for making a panel data balanced (that is, dropping missing individuals that have missing observations in any time period), converting id numbers to row numbers, and to treat repeated cross sections as panel data under the assumption of rank invariance. For quantiles, there are functions to make distribution functions from a set of data points (this is particularly useful when a distribution function is created in several steps), to combine distribution functions based on some external weights, and to invert distribution functions. Finally, there are several other miscellaneous functions for obtaining weighted means, weighted distribution functions, and weighted quantiles; to generate summary statistics and their differences for two groups; and to add or drop covariates from formulas.
This package provides functionality to perform a likelihood-free method for estimating the parameters of complex models that results in a simulated sample from the posterior distribution of model parameters given targets. The method begins with a accept/reject approximate bayes computation (ABC) step applied to a sample of points from the prior distribution of model parameters. Accepted points result in model predictions that are within the initially specified tolerance intervals around the target points. The sample is iteratively updated by drawing additional points from a mixture of multivariate normal distributions, accepting points within tolerance intervals. As the algorithm proceeds, the acceptance intervals are narrowed. The algorithm returns a set of points and sampling weights that account for the adaptive sampling scheme. For more details see Rutter, Ozik, DeYoreo, and Collier (2018) <arXiv:1804.02090>.
This package contains model fitting functions for linear and non-linear adsorption kinetic and diffusion models. Adsorption kinetics is used for characterizing the rate of solute adsorption and the time necessary for the adsorption process. Adsorption kinetics offers vital information on adsorption rate, adsorbent performance in response time, and mass transfer processes. In addition, diffusion models are included in the package as solute diffusion affects the adsorption kinetic experiments. This package consists of 20 adsorption and diffusion models, including Pseudo First Order (PFO), Pseudo Second Order (PSO), Elovich, and Weber-Morris model (commonly called the intraparticle model) stated by Plazinski et al. (2009) <doi:10.1016/j.cis.2009.07.009>. This package also contains a summary function where the statistical errors of each model are ranked for a more straightforward determination of the best fit model.
Calculates ratings for two-player or multi-player challenges. Methods included in package such as are able to estimate ratings (players strengths) and their evolution in time, also able to predict output of challenge. Algorithms are based on Bayesian Approximation Method, and they don't involve any matrix inversions nor likelihood estimation. Parameters are updated sequentially, and computation doesn't require any additional RAM to make estimation feasible. Additionally, base of the package is written in C++ what makes sport computation even faster. Methods used in the package refer to Mark E. Glickman (1999) <https://www.glicko.net/research/glicko.pdf>; Mark E. Glickman (2001) <doi:10.1080/02664760120059219>; Ruby C. Weng, Chih-Jen Lin (2011) <https://www.jmlr.org/papers/volume12/weng11a/weng11a.pdf>; W. Penny, Stephen J. Roberts (1999) <doi:10.1109/IJCNN.1999.832603>.
Salmonella enterica is a major cause of bacterial food-borne disease worldwide. Serotype identification is the most commonly used typing method to characterize Salmonella isolates. However, experimental serotyping needs great cost on manpower and resources. Recently, we found that the newly incorporated spacer in the clustered regularly interspaced short palindromic repeat (CRISPR) could serve as an effective marker for typing of Salmonella. It was further revealed by Li et. al (2014) <doi:10.1128/JCM.00696-14> that recognized types based on the combination of two newly incorporated spacer in both CRISPR loci showed high accordance with serotypes. Here, we developed an R package CSESA to predict the serotype based on this finding. Considering itâ s time saving and of high accuracy, we recommend to predict the serotypes of unknown Salmonella isolates using CSESA before doing the traditional serotyping.
Genes that are differentially expressed between two or more experimental conditions can be detected in RNA-Seq. A high biological variability may impact the discovery of these genes once it may be divergent between the fixed effects. However, this variability can be covered by the random effects. DEGRE was designed to identify the differentially expressed genes considering fixed and random effects on individuals. These effects are identified earlier in the experimental design matrix. DEGRE has the implementation of preprocessing procedures to clean the near zero gene reads in the count matrix, normalize by RLE published in the DESeq2 package, Love et al. (2014) <doi:10.1186/s13059-014-0550-8> and it fits a regression for each gene using the Generalized Linear Mixed Model with the negative binomial distribution, followed by a Wald test to assess the regression coefficients.