Chemical analysis of proteins based on their amino acid compositions. Amino acid compositions can be read from FASTA files and used to calculate chemical metrics including carbon oxidation state and stoichiometric hydration state, as described in Dick et al. (2020) <doi:10.5194/bg-17-6145-2020>. Other properties that can be calculated include protein length, grand average of hydropathy (GRAVY), isoelectric point (pI), molecular weight (MW), standard molal volume (V0), and metabolic costs (Akashi and Gojobori, 2002 <doi:10.1073/pnas.062526999>; Wagner, 2005 <doi:10.1093/molbev/msi126>; Zhang et al., 2018 <doi:10.1038/s41467-018-06461-1>). A database of amino acid compositions of human proteins derived from UniProt is provided.
Given a multivariate dataset and some knowledge about the dependencies between its features, it is customary to fit a statistical model to the features to infer parameters of interest. Such a procedure implicitly assumes that the sample is exchangeable. This package provides a flexible non-parametric test of this exchangeability assumption, allowing the user to specify the feature dependencies by hand as long as features can be grouped into disjoint independent sets. This package also allows users to test a dual hypothesis, which is, given that the sample is exchangeable, does a proposed grouping of the features into disjoint sets also produce statistically independent sets of features? See Aw, Spence and Song (2023) for the accompanying paper.
Analyzes the function calls in an R package and creates a hive plot of the calls, dividing them among functions that only make outgoing calls (sources), functions that have only incoming calls (sinks), and those that have both incoming calls and make outgoing calls (managers). Function calls can be mapped by their absolute numbers, their normalized absolute numbers, or their rank. FuncMap should be useful for comparing packages at a high level for their overall design. Plus, it's just plain fun. The hive plot concept was developed by Martin Krzywinski (www.hiveplot.com) and inspired this package. Note: this package is maintained for historical reasons. HiveR is a full package for creating hive plots.
This is a method for Allele-specific DNA Copy Number profiling for whole-Exome sequencing data. Given the allele-specific coverage and site biases at the variant loci, this program segments the genome into regions of homogeneous allele-specific copy number. It requires, as input, the read counts for each variant allele in a pair of case and control samples, as well as the site biases. For detection of somatic mutations, the case and control samples can be the tumor and normal sample from the same individual. The implemented method is based on the paper: Chen, H., Jiang, Y., Maxwell, K., Nathanson, K. and Zhang, N. (under review). Allele-specific copy number estimation by whole Exome sequencing.
This package implements the online Bayesian inference framework for joint state and parameter estimation in a stochastic Susceptible-Exposed-Infectious-Recovered (SEIR) epidemic model with a time-varying transmission rate. The log-transmission rate is modelled as a latent Ornstein-Uhlenbeck (OU) process with exact Gaussian discrete-time transitions. Inference is performed via the nested particle filter (NPF) of Crisan and Miguez (2018) <doi:10.3150/17-BEJ954>, which maintains an outer particle layer over the OU hyperparameters and, for each outer particle, an inner bootstrap filter over epidemic states. The Cori-style renewal-equation estimator follows Cori et al. (2013) <doi:10.1093/aje/kwt133>. The package also provides utilities for simulation, posterior summarisation, and forecasting.
This package provides an exact Goodness-of-Fit test for multinomial data with fixed probabilities. It can be used to determine whether a set of counts fits a given expected ratio. To see whether a set of observed counts fits an expectation, one can examine all possible outcomes with xmulti() or a random sample of them with xmonte() and find the probability of an observation deviating from the expectation by at least as much as the observed. As a measure of deviation from the expected, one can use the log-likelihood ratio, the multinomial probability, or the classic chi-square statistic. A histogram of the test statistic can also be plotted and compared with the asymptotic curve.
Differential expression analysis is a prevalent method utilised in the examination of diverse biological data. The reproducibility-optimized test statistic (ROTS) modifies a t-statistic based on the data's intrinsic characteristics and ranks features according to their statistical significance for differential expression between two or more groups (f-statistic). Focussing on proteomics and metabolomics, the current ROTS implementation cannot account for technical or biological covariates such as MS batches or gender differences among the samples. Consequently, we developed LimROTS, which employs a reproducibility-optimized test statistic utilising the limma methodology to simulate complex experimental designs. LimROTS is a hybrid method integrating empirical bayes and reproducibility-optimized statistics for robust analysis of proteomics and metabolomics data.
Interactions between proteins occur in many, if not most, biological processes. Most proteins perform their functions in networks associated with other proteins and other biomolecules. This fact has motivated the development of a variety of experimental methods for the identification of protein interactions. This variety has in turn ushered in the development of numerous different computational approaches for modeling and predicting protein interactions. Sometimes an experiment is aimed at identifying proteins closely related to some interesting proteins. A network based statistical learning method is used to infer the putative functions of proteins from the known functions of its neighboring proteins on a PPI network. This package identifies such proteins often involved in the same or similar biological functions.
Highly efficient functions for estimating various rank (centrality) measures of nodes in bipartite graphs (two-mode networks). Includes methods for estimating HITS, CoHITS, BGRM, and BiRank with implementation primarily inspired by He et al. (2016) <doi:10.1109/TKDE.2016.2611584>. Also provides easy-to-use tools for efficiently estimating PageRank in one-mode graphs, incorporating or removing edge-weights during rank estimation, projecting two-mode graphs to one-mode, and for converting edgelists and matrices to sparseMatrix format. Best of all, the package's rank estimators can work directly with common formats of network data including edgelists (class data.frame, data.table, or tbl_df) and adjacency matrices (class matrix or dgCMatrix).
Emissions are the mass of pollutants released into the atmosphere. Air quality models need emissions data, with spatial and temporal distribution, to represent air pollutant concentrations. This package, eixport, creates inputs for the air quality models WRF-Chem Grell et al (2005) <doi:10.1016/j.atmosenv.2005.04.027>, MUNICH Kim et al (2018) <doi:10.5194/gmd-11-611-2018> , BRAMS-SPM Freitas et al (2005) <doi:10.1016/j.atmosenv.2005.07.017> and RLINE Snyder et al (2013) <doi:10.1016/j.atmosenv.2013.05.074>. See the eixport website (<https://atmoschem.github.io/eixport/>) for more information, documentations and examples. More details in Ibarra-Espinosa et al (2018) <doi:10.21105/joss.00607>.
Fit occupancy models in Stan via brms'. The full variety of brms formula-based effects structures are available to use in multiple classes of occupancy model, including single-season models, models with data augmentation for never-observed species, dynamic (multiseason) models with explicit colonization and extinction processes, and dynamic models with autologistic occupancy dynamics. Formulas can be specified for all relevant distributional terms, including detection and one or more of occupancy, colonization, extinction, and autologistic depending on the model type. Several important forms of model post-processing are provided. References: Bürkner (2017) <doi:10.18637/jss.v080.i01>; Carpenter et al. (2017) <doi:10.18637/jss.v076.i01>; Socolar & Mills (2023) <doi:10.1101/2023.10.26.564080>.
This package implements the five-parameter Generalized Kumaraswamy ('gkw') distribution proposed by Carrasco, Ferrari and Cordeiro (2010) <doi:10.48550/arXiv.1004.0911> and its seven nested sub-families for modeling bounded continuous data on the unit interval (0,1). The gkw distribution extends the Kumaraswamy distribution described by Jones (2009) <doi:10.1016/j.stamet.2008.04.001>. Provides density, distribution, quantile, and random generation functions, along with analytical log-likelihood, gradient, and Hessian functions implemented in C++ via RcppArmadillo for maximum computational efficiency. Suitable for modeling proportions, rates, percentages, and indices exhibiting complex features such as asymmetry, or heavy tails and other shapes not adequately captured by standard distributions like simple Beta or Kumaraswamy.
This package provides a self-contained, static build of the HDF5 (Hierarchical Data Format 5) C library (release 2.1.1) for R package developers. Designed for use in the LinkingTo field, it enables zero-dependency integration by building the library entirely from source during installation. Additionally, it compiles and internally links a comprehensive suite of advanced compression filters and their HDF5 plugins (Zstd, LZ4, Blosc/Blosc2, Snappy, ZFP, Bzip2, LZF, Bitshuffle, szip, and gzip). These plugins are integrated out-of-the-box, allowing downstream packages to utilize high-performance compression directly through the standard HDF5 API while keeping the underlying third-party headers fully encapsulated. HDF5 is developed by The HDF Group <https://www.hdfgroup.org/>.
Instrumental variables (IVs) are a popular and powerful tool for estimating causal effects in the presence of unobserved confounding. However, classical methods rely on strong assumptions such as the exclusion criterion, which states that instrumental effects must be entirely mediated by treatments. In the so-called "leaky" IV setting, candidate instruments are allowed to have some direct influence on outcomes, rendering the average treatment effect (ATE) unidentifiable. But with limits on the amount of information leakage, we may still recover sharp bounds on the ATE, providing partial identification. This package implements methods for ATE bounding in the leaky IV setting with linear structural equations. For details, see Watson et al. (2024) <doi:10.48550/arXiv.2404.04446>.
Monte Carlo confidence intervals for free and defined parameters in models fitted in the structural equation modeling package lavaan can be generated using the semmcci package. semmcci has three main functions, namely, MC(), MCMI(), and MCStd(). The output of lavaan is passed as the first argument to the MC() function or the MCMI() function to generate Monte Carlo confidence intervals. Monte Carlo confidence intervals for the standardized estimates can also be generated by passing the output of the MC() function or the MCMI() function to the MCStd() function. A description of the package and code examples are presented in Pesigan and Cheung (2024) <doi:10.3758/s13428-023-02114-4>.
Stepwise regression is a statistical technique used for model selection. This package streamlines stepwise regression analysis by supporting multiple regression types(linear, Cox, logistic, Poisson, Gamma, and negative binomial), incorporating popular selection strategies(forward, backward, bidirectional, and subset), and offering essential metrics. It enables users to apply multiple selection strategies and metrics in a single function call, visualize variable selection processes, and export results in various formats. StepReg offers a data-splitting option to address potential issues with invalid statistical inference and a randomized forward selection option to avoid overfitting. We validated StepReg's accuracy using public datasets within the SAS software environment. For an interactive web interface, users can install the companion StepRegShiny package.
This package provides visualizations for SHAP (SHapley Additive exPlanations) such as waterfall plots, force plots, various types of importance plots, dependence plots, and interaction plots. These plots act on a shapviz object created from a matrix of SHAP values and a corresponding feature dataset. Wrappers for the R packages xgboost, lightgbm, fastshap, shapr, h2o, treeshap, DALEX, and kernelshap are added for convenience. By separating visualization and computation, it is possible to display factor variables in graphs, even if the SHAP values are calculated by a model that requires numerical features. The plots are inspired by those provided by the shap package in Python, but there is no dependency on it.
This package provides users with its associated functions for pedagogical purposes in visually learning Bayesian networks and Markov chain Monte Carlo (MCMC) computations. It enables users to: a) Create and examine the (starting) graphical structure of Bayesian networks; b) Create random Bayesian networks using a dataset with customized constraints; c) Generate Stan code for structures of Bayesian networks for sampling the data and learning parameters; d) Plot the network graphs; e) Perform Markov chain Monte Carlo computations and produce graphs for posteriors checks. The package refers to one reference item, which describes the methods and algorithms: Vuong, Quan-Hoang and La, Viet-Phuong (2019) <doi:10.31219/osf.io/w5dx6> The bayesvl R package. Open Science Framework (May 18).
Fits or generalized linear models either a regression with Autoregressive moving-average (ARMA) errors for time series data. The package makes it easy to incorporate constraints into the model's coefficients. The model is specified by an objective function (Gaussian, Binomial or Poisson) or an ARMA order (p,q), a vector of bound constraints for the coefficients (i.e beta1 > 0) and the possibility to incorporate restrictions among coefficients (i.e beta1 > beta2). The references of this packages are the same as stats package for glm() and arima() functions. See Brockwell, P. J. and Davis, R. A. (1996, ISBN-10: 9783319298528). For the different optimizers implemented, it is recommended to consult the documentation of the corresponding packages.
This package provides R utilities to build unlevered and levered discounted cash flow (DCF) tables for commercial real estate (CRE) assets. Functions generate bullet and amortising debt schedules, compute credit metrics such as debt service coverage ratios (DSCR), debt yield ratios, and forward loan-to-value ratios (LTV), and expose an explicit property-level operating chain from gross effective income (GEI) to net operating income (NOI) and property before-tax cash flow (PBTCF). The toolkit supports end-to-end scenario execution from a YAML (YAML Ain't Markup Language) configuration file parsed with yaml', includes helpers for effective rent, constrained loan underwriting, and simplified SPV-level tax simulations, and ships reproducible vignettes for methodological and applied use cases.
Compute price indices using various Hedonic and multilateral methods, including Laspeyres, Paasche, Fisher, and HMTS (Hedonic Multilateral Time series re-estimation with splicing). The central function calculate_price_index() offers a unified interface for running these methods on structured datasets. This package is designed to support index construction workflows for real estate and other domains where quality-adjusted price comparisons over time are essential. The development of this package was funded by Eurostat and Statistics Netherlands (CBS), and carried out by Statistics Netherlands. The HMTS method implemented here is described in Ishaak, Ouwehand and Remøy (2024) <doi:10.1177/0282423X241246617>. For broader methodological context, see Eurostat (2013, ISBN:978-92-79-25984-5, <doi:10.2785/34007>).
An interface to DifferentialEquations.jl <https://diffeq.sciml.ai/dev/> from the R programming language. It has unique high performance methods for solving ordinary differential equations (ODE), stochastic differential equations (SDE), delay differential equations (DDE), differential-algebraic equations (DAE), and more. Much of the functionality, including features like adaptive time stepping in SDEs, are unique and allow for multiple orders of magnitude speedup over more common methods. Supports GPUs, with support for CUDA (NVIDIA), AMD GPUs, Intel oneAPI GPUs, and Apple's Metal (M-series chip GPUs). diffeqr attaches an R interface onto the package, allowing seamless use of this tooling by R users. For more information, see Rackauckas and Nie (2017) <doi:10.5334/jors.151>.
Converts TXT and XML data curated by the United States Patent and Trademark Office (USPTO). Allows conversion of bulk data after downloading directly from the USPTO bulk data website, eliminating need for users to wrangle multiple data formats to get large patent databases in tidy, rectangular format. Data details can be found on the USPTO website <https://bulkdata.uspto.gov/>. Currently, all 3 formats: 1. TXT data (1976-2001); 2. XML format 1 data (2002-2004); and 3. XML format 2 data (2005-current) can be converted to rectangular, CSV format. Relevant literature that uses data from USPTO includes Wada (2020) <doi:10.1007/s11192-020-03674-4> and Plaza & Albert (2008) <doi:10.1007/s11192-007-1763-3>.
This package implements the Savvy Parity Regression savvyPR methodology for multivariate linear regression analysis. The package solves an optimization problem that balances the contribution of each predictor variable to ensure estimation stability in the presence of multicollinearity. It supports two distinct parameterization methods, a Budget-based approach that allocates a fixed loss contribution to each predictor, and a Target-based approach (t-tuning) that utilizes a relative elasticity weight for the response variable. The package provides comprehensive tools for model estimation, risk distribution analysis, and parameter tuning via cross-validation (PR1, PR2, and PR3 model types) to optimize predictive accuracy. Methods are based on Asimit, Chen, Ichim and Millossovich (2026) <https://openaccess.city.ac.uk/id/eprint/37017/>.