This package provides access to word predictability estimates using large language models (LLMs) based on transformer architectures via integration with the Hugging Face ecosystem <https://huggingface.co/>. The package interfaces with pre-trained neural networks and supports both causal/auto-regressive LLMs (e.g., GPT-2') and masked/bidirectional LLMs (e.g., BERT') to compute the probability of words, phrases, or tokens given their linguistic context. For details on GPT-2 and causal models, see Radford et al. (2019) <https://storage.prod.researchhub.com/uploads/papers/2020/06/01/language-models.pdf>, for details on BERT and masked models, see Devlin et al. (2019) <doi:10.48550/arXiv.1810.04805>
. By enabling a straightforward estimation of word predictability, the package facilitates research in psycholinguistics, computational linguistics, and natural language processing (NLP).
This package provides methods for statistical analysis of count data and quantal data. For the analysis of count data an implementation of the Closure Principle Computational Approach Test ("CPCAT") is provided (Lehmann, R et al. (2016) <doi:10.1007/s00477-015-1079-4>), as well as an implementation of a "Dunnett GLM" approach using a Quasi-Poisson regression (Hothorn, L, Kluxen, F (2020) <doi:10.1101/2020.01.15.907881>). For the analysis of quantal data an implementation of the Closure Principle Fisherâ Freemanâ Halton test ("CPFISH") is provided (Lehmann, R et al. (2018) <doi:10.1007/s00477-017-1392-1>). P-values and no/lowest observed (adverse) effect concentration values are calculated. All implemented methods include further functions to evaluate the power and the minimum detectable difference using a bootstrapping approach.
Analysis of complex plant root system architectures (RSA) using the output files created by Data Analysis of Root Tracings (DART), an open-access software dedicated to the study of plant root architecture and development across time series (Le Bot et al (2010) "DART: a software to analyse root system architecture and development from captured images", Plant and Soil, <DOI:10.1007/s11104-009-0005-2>), and RSA data encoded with the Root System Markup Language (RSML) (Lobet et al (2015) "Root System Markup Language: toward a unified root architecture description language", Plant Physiology, <DOI:10.1104/pp.114.253625>). More information can be found in Delory et al (2016) "archiDART
: an R package for the automated computation of plant root architectural traits", Plant and Soil, <DOI:10.1007/s11104-015-2673-4>.
The primary goal of phase I clinical trials is to find the maximum tolerated dose (MTD). To reach this objective, we introduce a new design for phase I clinical trials, the posterior predictive (PoP
) design. The PoP
design is an innovative model-assisted design that is as simply as the conventional algorithmic designs as its decision rules can be pre-tabulated prior to the onset of trial, but is of more flexibility of selecting diverse target toxicity rates and cohort sizes. The PoP
design has desirable properties, such as coherence and consistency. Moreover, the PoP
design provides better empirical performance than the BOIN and Keyboard design with respect to high average probabilities of choosing the MTD and slightly lower risk of treating patients at subtherapeutic or overly toxic doses.
`SPOTlight` provides a method to deconvolute spatial transcriptomics spots using a seeded NMF approach along with visualization tools to assess the results. Spatially resolved gene expression profiles are key to understand tissue organization and function. However, novel spatial transcriptomics (ST) profiling techniques lack single-cell resolution and require a combination with single-cell RNA sequencing (scRNA-seq
) information to deconvolute the spatially indexed datasets. Leveraging the strengths of both data types, we developed SPOTlight, a computational tool that enables the integration of ST with scRNA-seq
data to infer the location of cell types and states within a complex tissue. SPOTlight is centered around a seeded non-negative matrix factorization (NMF) regression, initialized using cell-type marker genes and non-negative least squares (NNLS) to subsequently deconvolute ST capture locations (spots).
Fit Generalized Linear Models to continuous and count outcomes, as well as estimate the prevalence of misrepresentation of an important binary predictor. Misrepresentation typically arises when there is an incentive for the binary factor to be misclassified in one direction (e.g., in insurance settings where policy holders may purposely deny a risk status in order to lower the insurance premium). This is accomplished by treating a subset of the response variable as resulting from a mixture distribution. Model parameters are estimated via the Expectation Maximization algorithm and standard errors of the estimates are obtained from closed forms of the Observed Fisher Information. For an introduction to the models and the misrepresentation framework, see Xia et. al., (2023) <https://variancejournal.org/article/73151-maximum-likelihood-approaches-to-misrepresentation-models-in-glm-ratemaking-model-comparisons>.
Implementation of various estimation methods for dynamic factor models (DFMs) including principal components analysis (PCA) Stock and Watson (2002) <doi:10.1198/016214502388618960>, 2Stage Giannone et al. (2008) <doi:10.1016/j.jmoneco.2008.05.010>, expectation-maximisation (EM) Banbura and Modugno (2014) <doi:10.1002/jae.2306>, and the novel EM-sparse approach for sparse DFMs Mosley et al. (2023) <arXiv:2303.11892>
. Options to use classic multivariate Kalman filter and smoother (KFS) equations from Shumway and Stoffer (1982) <doi:10.1111/j.1467-9892.1982.tb00349.x> or fast univariate KFS equations from Koopman and Durbin (2000) <doi:10.1111/1467-9892.00186>, and options for independent and identically distributed (IID) white noise or auto-regressive (AR(1)) idiosyncratic errors. Algorithms coded in C++ and linked to R via RcppArmadillo
'.
Collection of routines for efficient scientific computations in physics and astrophysics. These routines include utility functions, numerical computation tools, as well as visualisation tools. They can be used, for example, for generating random numbers from spherical and custom distributions, information and entropy analysis, special Fourier transforms, two-point correlation estimation (e.g. as in Landy & Szalay (1993) <doi:10.1086/172900>), binning & gridding of point sets, 2D interpolation, Monte Carlo integration, vector arithmetic and coordinate transformations. Also included is a non-exhaustive list of important constants and cosmological conversion functions. The graphics routines can be used to produce and export publication-ready scientific plots and movies, e.g. as used in Obreschkow et al. (2020, MNRAS Vol 493, Issue 3, Pages 4551â 4569). These routines include special color scales, projection functions, and bitmap handling routines.
We provide functions for identifying the core community phylogeny in any microbiome, drawing phylogenetic Venn diagrams, calculating the core Faithâ s PD for a set of communities, and calculating the core UniFrac
distance between two sets of communities. All functions rely on construction of a core community phylogeny, which is a phylogeny where branches are defined based on their presence in multiple samples from a single type of habitat. Our package provides two options for constructing the core community phylogeny, a tip-based approach, where the core community phylogeny is identified based on incidence of leaf nodes and a branch-based approach, where the core community phylogeny is identified based on incidence of individual branches. We suggest use of the microViz
package, which can be downloaded from the website provided under Additional repositories.
Create lipidome-wide heatmaps of statistics with the lipidomeR
'. The lipidomeR
provides a streamlined pipeline for the systematic interpretation of the lipidome through publication-ready visualizations of regression models fitted on lipidomics data. With lipidomeR
', associations between covariates and the lipidome can be interpreted systematically and intuitively through heatmaps, where lipids are categorized by the lipid class and are presented on two-dimensional maps organized by the lipid size and level of saturation. This way, the lipidomeR
helps you gain an immediate understanding of the multivariate patterns in the lipidome already at first glance. You can create lipidome-wide heatmaps of statistical associations, changes, differences, variation, or other lipid-specific values. The heatmaps are provided with publication-ready quality and the results behind the visualizations are based on rigorous statistical models.
The Mutual Information Index (M) introduced to social science literature by Theil and Finizza (1971) <doi:10.1080/0022250X.1971.9989795> is a multigroup segregation measure that is highly decomposable and that according to Frankel and Volij (2011) <doi:10.1016/j.jet.2010.10.008> and Mora and Ruiz-Castillo (2011) <doi:10.1111/j.1467-9531.2011.01237.x> satisfies the Strong Unit Decomposability and Strong Group Decomposability properties. This package allows computing and decomposing the total index value into its "between" and "within" terms. These last terms can also be decomposed into their contributions, either by group or unit characteristics. The factors that produce each "within" term can also be displayed at the user's request. The results can be computed considering a variable or sets of variables that define separate clusters.
ExCluster
flattens Ensembl and GENCODE GTF files into GFF files, which are used to count reads per non-overlapping exon bin from BAM files. This read counting is done using the function featureCounts
from the package Rsubread. Library sizes are normalized across all biological replicates, and ExCluster
then compares two different conditions to detect signifcantly differentially spliced genes. This process requires at least two independent biological repliates per condition, and ExCluster
accepts only exactly two conditions at a time. ExCluster
ultimately produces false discovery rates (FDRs) per gene, which are used to detect significance. Exon log2 fold change (log2FC) means and variances may be plotted for each significantly differentially spliced gene, which helps scientists develop hypothesis and target differential splicing events for RT-qPCR
validation in the wet lab.
This package provides tools to safely and efficiently organize and execute Monte Carlo simulation experiments in R. The package controls the structure and back-end of Monte Carlo simulation experiments by utilizing a generate-analyse-summarise workflow. The workflow safeguards against common simulation coding issues, such as automatically re-simulating non-convergent results, prevents inadvertently overwriting simulation files, catches error and warning messages during execution, implicitly supports parallel processing with high-quality random number generation, and provides tools for managing high-performance computing (HPC) array jobs submitted to schedulers such as SLURM. For a pedagogical introduction to the package see Sigal and Chalmers (2016) <doi:10.1080/10691898.2016.1246953>. For a more in-depth overview of the package and its design philosophy see Chalmers and Adkins (2020) <doi:10.20982/tqmp.16.4.p248>.
This is the first package allowing for the estimation, visualization and prediction of the most well-known football models: double Poisson, bivariate Poisson, Skellam, student_t, diagonal-inflated bivariate Poisson, and zero-inflated Skellam. It supports both maximum likelihood estimation (MLE, for static models only) and Bayesian inference. For Bayesian methods, it incorporates several techniques: MCMC sampling with Hamiltonian Monte Carlo, variational inference using either the Pathfinder algorithm or Automatic Differentiation Variational Inference (ADVI), and the Laplace approximation. The package compiles all the CmdStan
models once during installation using the instantiate package. The model construction relies on the most well-known football references, such as Dixon and Coles (1997) <doi:10.1111/1467-9876.00065>, Karlis and Ntzoufras (2003) <doi:10.1111/1467-9884.00366> and Egidi, Pauli and Torelli (2018) <doi:10.1177/1471082X18798414>.
Facilitates modeling species ecological niches and geographic distributions based on occurrences and environments that have a vertical as well as horizontal component, and projecting models into three-dimensional geographic space. Working in three dimensions is useful in an aquatic context when the organisms one wishes to model can be found across a wide range of depths in the water column. The package also contains functions to automatically generate marine training model training regions using machine learning, and interpolate and smooth patchily sampled environmental rasters using thin plate splines. Davis Rabosky AR, Cox CL, Rabosky DL, Title PO, Holmes IA, Feldman A, McGuire
JA (2016) <doi:10.1038/ncomms11484>. Nychka D, Furrer R, Paige J, Sain S (2021) <doi:10.5065/D6W957CT>. Pateiro-Lopez B, Rodriguez-Casal A (2022) <https://CRAN.R-project.org/package=alphahull>.
This package contains a suite of functions for health economic evaluations with missing outcome data. The package can fit different types of statistical models under a fully Bayesian approach using the software JAGS (which should be installed locally and which is loaded in missingHE
via the R package R2jags'). Three classes of models can be fitted under a variety of missing data assumptions: selection models, pattern mixture models and hurdle models. In addition to model fitting, missingHE
provides a set of specialised functions to assess model convergence and fit, and to summarise the statistical and economic results using different types of measures and graphs. The methods implemented are described in Mason (2018) <doi:10.1002/hec.3793>, Molenberghs (2000) <doi:10.1007/978-1-4419-0300-6_18> and Gabrio (2019) <doi:10.1002/sim.8045>.
This package implements confidence interval and sample size methods that are especially useful in psychological research. The methods can be applied in 1-group, 2-group, paired-samples, and multiple-group designs and to a variety of parameters including means, medians, proportions, slopes, standardized mean differences, standardized linear contrasts of means, plus several measures of correlation and association. Confidence interval and sample size functions are given for single parameters as well as differences, ratios, and linear contrasts of parameters. The sample size functions can be used to approximate the sample size needed to estimate a parameter or function of parameters with desired confidence interval precision or to perform a variety of hypothesis tests (directional two-sided, equivalence, superiority, noninferiority) with desired power. For details see: Statistical Methods for Psychologists, Volumes 1 â 4, <https://dgbonett.sites.ucsc.edu/>.
Package test2norm contains functions to generate formulas for normative standards applied to cognitive tests. It takes raw test scores (e.g., number of correct responses) and converts them to scaled scores and demographically adjusted scores, using methods described in Heaton et al. (2003) <doi:10.1016/B978-012703570-3/50010-9> & Heaton et al. (2009, ISBN:9780199702800). The scaled scores are calculated as quantiles of the raw test scores, scaled to have the mean of 10 and standard deviation of 3, such that higher values always correspond to better performance on the test. The demographically adjusted scores are calculated from the residuals of a model that regresses scaled scores on demographic predictors (e.g., age). The norming procedure makes use of the mfp2()
function from the mfp2 package to explore nonlinear associations between cognition and demographic variables.
Maximum likelihood estimation of copula-based zero-inflated (and non-inflated) Poisson and negative binomial count models, based on the article <doi:10.18637/jss.v109.i01>. Supports Frank and Gaussian copulas. Allows for mixed margins (e.g., one margin Poisson, the other zero-inflated negative binomial), and several marginal link functions. Built-in methods for publication-quality tables using texreg', post-estimation diagnostics using DHARMa', and testing for marginal zero-modification via <doi:10.1177/0962280217749991>. For information on copula regression for count data, see Genest and Nešlehová (2007) <doi:10.1017/S0515036100014963> as well as Nikoloulopoulos (2013) <doi:10.1007/978-3-642-35407-6_11>. For information on zero-inflated count regression generally, see Lambert (1992) <https://www.jstor.org/stable/1269547?origin=crossref>. The author acknowledges support by NSF DMS-1925119 and DMS-212324.
This package provides a dataframe-friendly implementation of ComBat
Harmonization which uses an empirical Bayesian framework to remove batch effects. Johnson WE & Li C (2007) <doi:10.1093/biostatistics/kxj037> "Adjusting batch effects in microarray expression data using empirical Bayes methods." Fortin J-P, Cullen N, Sheline YI, Taylor WD, Aselcioglu I, Cook PA, Adams P, Cooper C, Fava M, McGrath
PJ, McInnes
M, Phillips ML, Trivedi MH, Weissman MM, & Shinohara RT (2017) <doi:10.1016/j.neuroimage.2017.11.024> "Harmonization of cortical thickness measurements across scanners and sites." Fortin J-P, Parker D, Tun<e7> B, Watanabe T, Elliott MA, Ruparel K, Roalf DR, Satterthwaite TD, Gur RC, Gur RE, Schultz RT, Verma R, & Shinohara RT (2017) <doi:10.1016/j.neuroimage.2017.08.047> "Harmonization of multi-site diffusion tensor imaging data.".
Implementation of the Generalized Pairwise Comparisons (GPC) as defined in Buyse (2010) <doi:10.1002/sim.3923> for complete observations, and extended in Peron (2018) <doi:10.1177/0962280216658320> to deal with right-censoring. GPC compare two groups of observations (intervention vs. control group) regarding several prioritized endpoints to estimate the probability that a random observation drawn from one group performs better/worse/equivalently than a random observation drawn from the other group. Summary statistics such as the net treatment benefit, win ratio, or win odds are then deduced from these probabilities. Confidence intervals and p-values are obtained based on asymptotic results (Ozenne 2021 <doi:10.1177/09622802211037067>), non-parametric bootstrap, or permutations. The software enables the use of thresholds of minimal importance difference, stratification, non-prioritized endpoints (O Brien test), and can handle right-censoring and competing-risks.
The Grouphmap was implemented in R, an open-source programming environment, and was released under the provided website. The difference analysis is based on the limma package, which can cover gene and protein expression profiles (Reference: Matthew E Ritchie , Belinda Phipson , Di Wu , Yifang Hu , Charity W Law , Wei Shi , Gordon K Smyth (2015) <doi:10.1093/nar/gkv007>). The GO enrichment analysis is based on the clusterProfiler
package and supports three common species: human, mouse, and yeast (Reference: Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He (2012) <doi:10.1089/omi.2011.0118>). The results of batch difference analysis and enrichment analysis are output in separate folders for easy viewing and further visualization of the results during the process. The results returned a heatmap in R and exported to 3 folders named DEG, go, and merge.
Social network analysis is becoming commonplace in many social science disciplines, but access to useful network data, especially among marginalized populations, still remains a formidable challenge. This package mitigates that problem by providing tools to simulate spatial Bernoulli networks as proposed in Carter T. Butts (2002, ISBN:978-0-493-72676-2), "Spatial models of large-scale interpersonal networks." Using this package, network analysts can simulate a spatial point process or sequence with a given number of nodes inside a geographical boundary and estimate the probability of a tie formation between all node pairs. When simulating a network, an analyst can choose between five spatial interaction functions. The package also enables quick comparison of summary statistics for simulated networks and provides simple to use plotting methods for its classes that return plots which can be further refined with the ggplot2 package.
This package provides functions to design and apply tests that are anytime valid. The functions can be used to design hypothesis tests in the prospective/randomised control trial setting or in the observational/retrospective setting. The resulting tests remain valid under both optional stopping and optional continuation. The current version includes safe t-tests and safe tests of two proportions. For details on the theory of safe tests, see Grunwald, de Heide and Koolen (2019) "Safe Testing" <arXiv:1906.07801>
, for details on safe logrank tests see ter Schure, Perez-Ortiz, Ly and Grunwald (2020) "The Safe Logrank Test: Error Control under Continuous Monitoring with Unlimited Horizon" <arXiv:2011.06931v3>
and Turner, Ly and Grunwald (2021) "Safe Tests and Always-Valid Confidence Intervals for contingency tables and beyond" <arXiv:2106.02693>
for details on safe contingency table tests.