Fit Generalized Linear Models to continuous and count outcomes, as well as estimate the prevalence of misrepresentation of an important binary predictor. Misrepresentation typically arises when there is an incentive for the binary factor to be misclassified in one direction (e.g., in insurance settings where policy holders may purposely deny a risk status in order to lower the insurance premium). This is accomplished by treating a subset of the response variable as resulting from a mixture distribution. Model parameters are estimated via the Expectation Maximization algorithm and standard errors of the estimates are obtained from closed forms of the Observed Fisher Information. For an introduction to the models and the misrepresentation framework, see Xia et. al., (2023) <https://variancejournal.org/article/73151-maximum-likelihood-approaches-to-misrepresentation-models-in-glm-ratemaking-model-comparisons>.
Implementation of various estimation methods for dynamic factor models (DFMs) including principal components analysis (PCA) Stock and Watson (2002) <doi:10.1198/016214502388618960>, 2Stage Giannone et al. (2008) <doi:10.1016/j.jmoneco.2008.05.010>, expectation-maximisation (EM) Banbura and Modugno (2014) <doi:10.1002/jae.2306>, and the novel EM-sparse approach for sparse DFMs Mosley et al. (2023) <arXiv:2303.11892>
. Options to use classic multivariate Kalman filter and smoother (KFS) equations from Shumway and Stoffer (1982) <doi:10.1111/j.1467-9892.1982.tb00349.x> or fast univariate KFS equations from Koopman and Durbin (2000) <doi:10.1111/1467-9892.00186>, and options for independent and identically distributed (IID) white noise or auto-regressive (AR(1)) idiosyncratic errors. Algorithms coded in C++ and linked to R via RcppArmadillo
'.
This package provides tools to safely and efficiently organize and execute Monte Carlo simulation experiments in R. The package controls the structure and back-end of Monte Carlo simulation experiments by utilizing a generate-analyse-summarise workflow. The workflow safeguards against common simulation coding issues, such as automatically re-simulating non-convergent results, prevents inadvertently overwriting simulation files, catches error and warning messages during execution, implicitly supports parallel processing with high-quality random number generation, and provides tools for managing high-performance computing (HPC) array jobs submitted to schedulers such as SLURM. For a pedagogical introduction to the package see Sigal and Chalmers (2016) <doi:10.1080/10691898.2016.1246953>. For a more in-depth overview of the package and its design philosophy see Chalmers and Adkins (2020) <doi:10.20982/tqmp.16.4.p248>.
Collection of routines for efficient scientific computations in physics and astrophysics. These routines include utility functions, numerical computation tools, as well as visualisation tools. They can be used, for example, for generating random numbers from spherical and custom distributions, information and entropy analysis, special Fourier transforms, two-point correlation estimation (e.g. as in Landy & Szalay (1993) <doi:10.1086/172900>), binning & gridding of point sets, 2D interpolation, Monte Carlo integration, vector arithmetic and coordinate transformations. Also included is a non-exhaustive list of important constants and cosmological conversion functions. The graphics routines can be used to produce and export publication-ready scientific plots and movies, e.g. as used in Obreschkow et al. (2020, MNRAS Vol 493, Issue 3, Pages 4551â 4569). These routines include special color scales, projection functions, and bitmap handling routines.
Create lipidome-wide heatmaps of statistics with the lipidomeR
'. The lipidomeR
provides a streamlined pipeline for the systematic interpretation of the lipidome through publication-ready visualizations of regression models fitted on lipidomics data. With lipidomeR
', associations between covariates and the lipidome can be interpreted systematically and intuitively through heatmaps, where lipids are categorized by the lipid class and are presented on two-dimensional maps organized by the lipid size and level of saturation. This way, the lipidomeR
helps you gain an immediate understanding of the multivariate patterns in the lipidome already at first glance. You can create lipidome-wide heatmaps of statistical associations, changes, differences, variation, or other lipid-specific values. The heatmaps are provided with publication-ready quality and the results behind the visualizations are based on rigorous statistical models.
The Mutual Information Index (M) introduced to social science literature by Theil and Finizza (1971) <doi:10.1080/0022250X.1971.9989795> is a multigroup segregation measure that is highly decomposable and that according to Frankel and Volij (2011) <doi:10.1016/j.jet.2010.10.008> and Mora and Ruiz-Castillo (2011) <doi:10.1111/j.1467-9531.2011.01237.x> satisfies the Strong Unit Decomposability and Strong Group Decomposability properties. This package allows computing and decomposing the total index value into its "between" and "within" terms. These last terms can also be decomposed into their contributions, either by group or unit characteristics. The factors that produce each "within" term can also be displayed at the user's request. The results can be computed considering a variable or sets of variables that define separate clusters.
The biases introduced in association measures, particularly mutual information, are influenced by factors such as tumor purity, mutation burden, and hypermethylation. This package provides the estimation of conditional mutual information (CMI) and its statistical significance with a focus on its application to multi-omics data. Utilizing B-spline functions (inspired by Daub et al. (2004) <doi:10.1186/1471-2105-5-118>), the package offers tools to estimate the association between heterogeneous multi- omics data, while removing the effects of confounding factors. This helps to unravel complex biological interactions. In addition, it includes methods to evaluate the statistical significance of these associations, providing a robust framework for multi-omics data integration and analysis. This package is ideal for researchers in computational biology, bioinformatics, and systems biology seeking a comprehensive tool for understanding interdependencies in omics data.
This is the first package allowing for the estimation, visualization and prediction of the most well-known football models: double Poisson, bivariate Poisson, Skellam, student_t, diagonal-inflated bivariate Poisson, and zero-inflated Skellam. It supports both maximum likelihood estimation (MLE, for static models only) and Bayesian inference. For Bayesian methods, it incorporates several techniques: MCMC sampling with Hamiltonian Monte Carlo, variational inference using either the Pathfinder algorithm or Automatic Differentiation Variational Inference (ADVI), and the Laplace approximation. The package compiles all the CmdStan
models once during installation using the instantiate package. The model construction relies on the most well-known football references, such as Dixon and Coles (1997) <doi:10.1111/1467-9876.00065>, Karlis and Ntzoufras (2003) <doi:10.1111/1467-9884.00366> and Egidi, Pauli and Torelli (2018) <doi:10.1177/1471082X18798414>.
Facilitates modeling species ecological niches and geographic distributions based on occurrences and environments that have a vertical as well as horizontal component, and projecting models into three-dimensional geographic space. Working in three dimensions is useful in an aquatic context when the organisms one wishes to model can be found across a wide range of depths in the water column. The package also contains functions to automatically generate marine training model training regions using machine learning, and interpolate and smooth patchily sampled environmental rasters using thin plate splines. Davis Rabosky AR, Cox CL, Rabosky DL, Title PO, Holmes IA, Feldman A, McGuire
JA (2016) <doi:10.1038/ncomms11484>. Nychka D, Furrer R, Paige J, Sain S (2021) <doi:10.5065/D6W957CT>. Pateiro-Lopez B, Rodriguez-Casal A (2022) <https://CRAN.R-project.org/package=alphahull>.
This package contains a suite of functions for health economic evaluations with missing outcome data. The package can fit different types of statistical models under a fully Bayesian approach using the software JAGS (which should be installed locally and which is loaded in missingHE
via the R package R2jags'). Three classes of models can be fitted under a variety of missing data assumptions: selection models, pattern mixture models and hurdle models. In addition to model fitting, missingHE
provides a set of specialised functions to assess model convergence and fit, and to summarise the statistical and economic results using different types of measures and graphs. The methods implemented are described in Mason (2018) <doi:10.1002/hec.3793>, Molenberghs (2000) <doi:10.1007/978-1-4419-0300-6_18> and Gabrio (2019) <doi:10.1002/sim.8045>.
This package implements confidence interval and sample size methods that are especially useful in psychological research. The methods can be applied in 1-group, 2-group, paired-samples, and multiple-group designs and to a variety of parameters including means, medians, proportions, slopes, standardized mean differences, standardized linear contrasts of means, plus several measures of correlation and association. Confidence interval and sample size functions are given for single parameters as well as differences, ratios, and linear contrasts of parameters. The sample size functions can be used to approximate the sample size needed to estimate a parameter or function of parameters with desired confidence interval precision or to perform a variety of hypothesis tests (directional two-sided, equivalence, superiority, noninferiority) with desired power. For details see: Statistical Methods for Psychologists, Volumes 1 â 4, <https://dgbonett.sites.ucsc.edu/>.
Package test2norm contains functions to generate formulas for normative standards applied to cognitive tests. It takes raw test scores (e.g., number of correct responses) and converts them to scaled scores and demographically adjusted scores, using methods described in Heaton et al. (2003) <doi:10.1016/B978-012703570-3/50010-9> & Heaton et al. (2009, ISBN:9780199702800). The scaled scores are calculated as quantiles of the raw test scores, scaled to have the mean of 10 and standard deviation of 3, such that higher values always correspond to better performance on the test. The demographically adjusted scores are calculated from the residuals of a model that regresses scaled scores on demographic predictors (e.g., age). The norming procedure makes use of the mfp2()
function from the mfp2 package to explore nonlinear associations between cognition and demographic variables.
Maximum likelihood estimation of copula-based zero-inflated (and non-inflated) Poisson and negative binomial count models, based on the article <doi:10.18637/jss.v109.i01>. Supports Frank and Gaussian copulas. Allows for mixed margins (e.g., one margin Poisson, the other zero-inflated negative binomial), and several marginal link functions. Built-in methods for publication-quality tables using texreg', post-estimation diagnostics using DHARMa', and testing for marginal zero-modification via <doi:10.1177/0962280217749991>. For information on copula regression for count data, see Genest and Nešlehová (2007) <doi:10.1017/S0515036100014963> as well as Nikoloulopoulos (2013) <doi:10.1007/978-3-642-35407-6_11>. For information on zero-inflated count regression generally, see Lambert (1992) <https://www.jstor.org/stable/1269547?origin=crossref>. The author acknowledges support by NSF DMS-1925119 and DMS-212324.
This package provides a dataframe-friendly implementation of ComBat
Harmonization which uses an empirical Bayesian framework to remove batch effects. Johnson WE & Li C (2007) <doi:10.1093/biostatistics/kxj037> "Adjusting batch effects in microarray expression data using empirical Bayes methods." Fortin J-P, Cullen N, Sheline YI, Taylor WD, Aselcioglu I, Cook PA, Adams P, Cooper C, Fava M, McGrath
PJ, McInnes
M, Phillips ML, Trivedi MH, Weissman MM, & Shinohara RT (2017) <doi:10.1016/j.neuroimage.2017.11.024> "Harmonization of cortical thickness measurements across scanners and sites." Fortin J-P, Parker D, Tun<e7> B, Watanabe T, Elliott MA, Ruparel K, Roalf DR, Satterthwaite TD, Gur RC, Gur RE, Schultz RT, Verma R, & Shinohara RT (2017) <doi:10.1016/j.neuroimage.2017.08.047> "Harmonization of multi-site diffusion tensor imaging data.".
Implementation of the Generalized Pairwise Comparisons (GPC) as defined in Buyse (2010) <doi:10.1002/sim.3923> for complete observations, and extended in Peron (2018) <doi:10.1177/0962280216658320> to deal with right-censoring. GPC compare two groups of observations (intervention vs. control group) regarding several prioritized endpoints to estimate the probability that a random observation drawn from one group performs better/worse/equivalently than a random observation drawn from the other group. Summary statistics such as the net treatment benefit, win ratio, or win odds are then deduced from these probabilities. Confidence intervals and p-values are obtained based on asymptotic results (Ozenne 2021 <doi:10.1177/09622802211037067>), non-parametric bootstrap, or permutations. The software enables the use of thresholds of minimal importance difference, stratification, non-prioritized endpoints (O Brien test), and can handle right-censoring and competing-risks.
The Grouphmap was implemented in R, an open-source programming environment, and was released under the provided website. The difference analysis is based on the limma package, which can cover gene and protein expression profiles (Reference: Matthew E Ritchie , Belinda Phipson , Di Wu , Yifang Hu , Charity W Law , Wei Shi , Gordon K Smyth (2015) <doi:10.1093/nar/gkv007>). The GO enrichment analysis is based on the clusterProfiler
package and supports three common species: human, mouse, and yeast (Reference: Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He (2012) <doi:10.1089/omi.2011.0118>). The results of batch difference analysis and enrichment analysis are output in separate folders for easy viewing and further visualization of the results during the process. The results returned a heatmap in R and exported to 3 folders named DEG, go, and merge.
This package provides functions to design and apply tests that are anytime valid. The functions can be used to design hypothesis tests in the prospective/randomised control trial setting or in the observational/retrospective setting. The resulting tests remain valid under both optional stopping and optional continuation. The current version includes safe t-tests and safe tests of two proportions. For details on the theory of safe tests, see Grunwald, de Heide and Koolen (2019) "Safe Testing" <arXiv:1906.07801>
, for details on safe logrank tests see ter Schure, Perez-Ortiz, Ly and Grunwald (2020) "The Safe Logrank Test: Error Control under Continuous Monitoring with Unlimited Horizon" <arXiv:2011.06931v3>
and Turner, Ly and Grunwald (2021) "Safe Tests and Always-Valid Confidence Intervals for contingency tables and beyond" <arXiv:2106.02693>
for details on safe contingency table tests.
Data practitioners regularly use the R and Python programming languages to prepare data for analyses. Thus, they encode important data preprocessing decisions in R and Python code. The smallsets package subsequently decodes these decisions into a Smallset Timeline, a static, compact visualisation of data preprocessing decisions (Lucchesi et al. (2022) <doi:10.1145/3531146.3533175>). The visualisation consists of small data snapshots of different preprocessing steps. The smallsets package builds this visualisation from a user's dataset and preprocessing code located in an R', R Markdown', Python', or Jupyter Notebook file. Users simply add structured comments with snapshot instructions to the preprocessing code. One optional feature in smallsets requires installation of the Gurobi optimisation software and gurobi R package, available from <https://www.gurobi.com>. More information regarding the optional feature and gurobi installation can be found in the smallsets vignette.
Social network analysis is becoming commonplace in many social science disciplines, but access to useful network data, especially among marginalized populations, still remains a formidable challenge. This package mitigates that problem by providing tools to simulate spatial Bernoulli networks as proposed in Carter T. Butts (2002, ISBN:978-0-493-72676-2), "Spatial models of large-scale interpersonal networks." Using this package, network analysts can simulate a spatial point process or sequence with a given number of nodes inside a geographical boundary and estimate the probability of a tie formation between all node pairs. When simulating a network, an analyst can choose between five spatial interaction functions. The package also enables quick comparison of summary statistics for simulated networks and provides simple to use plotting methods for its classes that return plots which can be further refined with the ggplot2 package.
Efficient implementations of functions for the creation, modification and analysis of phylogenetic trees. Applications include: generation of trees with specified shapes; tree rearrangement; analysis of tree shape; rooting of trees and extraction of subtrees; calculation and depiction of split support; plotting the position of rogue taxa (Klopfstein & Spasojevic 2019) <doi:10.1371/journal.pone.0212942>; calculation of ancestor-descendant relationships, of stemwardness (Asher & Smith, 2022) <doi:10.1093/sysbio/syab072>, and of tree balance (Mir et al. 2013, Lemant et al. 2022) <doi:10.1016/j.mbs.2012.10.005>, <doi:10.1093/sysbio/syac027>; artificial extinction (Asher & Smith, 2022) <doi:10.1093/sysbio/syab072>; import and export of trees from Newick, Nexus (Maddison et al. 1997) <doi:10.1093/sysbio/46.4.590>, and TNT <https://www.lillo.org.ar/phylogeny/tnt/> formats; and analysis of splits and cladistic information.
This package provides convenient methods for accessing the data in dist objects with minimal memory and computational overhead. disttools can be used to extract the distance between any pair or combination of points encoded by a dist object using only the indices of those points. This is an improvement over existing functionality, which requires either coercing a dist object into a matrix or calculating the one dimensional index corresponding to a pair of observations. Coercion to a matrix is undesirable because doing so doubles the amount of memory required for storage. In contrast, there is no inherent downside to the latter solution. However, in part due to several edge cases, correctly and efficiently implementing such a solution can be challenging. disttools abstracts away these challenges and provides a simple interface to access the data in a dist object using the latter approach.
The Clutter model is a significant forest growth simulation tool. Grounded on individual trees and comprehensively considering factors such as competition among trees and the impact of environmental elements on growth, it can accurately reflect the growth process of forest stands. It can be applied in areas like forest resource management, harvesting planning, and ecological research. With the help of the Clutter model, people can better understand the dynamic changes of forests and provide a scientific basis for rational forest management and protecting the ecological environment. This R package can effectively realize the construction of forest growth and harvest models based on the Clutter model and achieve optimized forest management.References: Farias A, Soares C, Leite H et al(2021)<doi:10.1007/s10342-021-01380-1>. Guera O, Silva J, Ferreira R, et al(2019)<doi:10.1590/2179-8087.038117>.
This package provides functions to prepare time priors for MCMCtree analyses in the PAML software from Yang (2007)<doi:10.1093/molbev/msm088> and plot time-scaled phylogenies from any Bayesian divergence time analysis. Most time-calibrated node prior distributions require user-specified parameters. The package provides functions to refine these parameters, so that the resulting prior distributions accurately reflect confidence in known, usually fossil, time information. These functions also enable users to visualise distributions and write MCMCtree ready input files. Additionally, the package supplies flexible functions to visualise age uncertainty on a plotted tree with using node bars, using branch widths proportional to the age uncertainty, or by plotting the full posterior distributions on nodes. Time-scaled phylogenetic plots can be visualised with absolute and geological timescales . All plotting functions are applicable with output from any Bayesian software, not just MCMCtree'.
This package provides a set of functions for applying a restricted linear algebra to the analysis of count-based data. See the accompanying preprint manuscript: "Normalizing need not be the norm: count-based math for analyzing single-cell data" Church et al (2022) <doi:10.1101/2022.06.01.494334> This tool is specifically designed to analyze count matrices from single cell RNA sequencing assays. The tools implement several count-based approaches for standard steps in single-cell RNA-seq analysis, including scoring genes and cells, comparing cells and clustering, calculating differential gene expression, and several methods for rank reduction. There are many opportunities for further optimization that may prove useful in the analysis of other data. We provide the source code freely available at <https://github.com/shchurch/countland> and encourage users and developers to fork the code for their own purposes.