High-quality real-world data can be transformed into scientific real-world evidence for regulatory and healthcare decision-making using proven analytical methods and techniques. For example, propensity score (PS) methodology can be applied to select a subset of real-world data containing patients that are similar to those in the current clinical study in terms of baseline covariates, and to stratify the selected patients together with those in the current study into more homogeneous strata. Then, statistical methods such as the power prior approach or composite likelihood approach can be applied in each stratum to draw inference for the parameters of interest. This package provides functions that implement the PS-integrated real-world evidence analysis methods such as Wang et al. (2019) <doi:10.1080/10543406.2019.1657133>, Wang et al. (2020) <doi:10.1080/10543406.2019.1684309>, and Chen et al. (2020) <doi:10.1080/10543406.2020.1730877>.
Composite Kernel Machine Regression based on Likelihood Ratio Test (CKLRT): in this package, we develop a kernel machine regression framework to model the overall genetic effect of a SNP-set, considering the possible GE interaction. Specifically, we use a composite kernel to specify the overall genetic effect via a nonparametric function and we model additional covariates parametrically within the regression framework. The composite kernel is constructed as a weighted average of two kernels, one corresponding to the genetic main effect and one corresponding to the GE interaction effect. We propose a likelihood ratio test (LRT) and a restricted likelihood ratio test (RLRT) for statistical significance. We derive a Monte Carlo approach for the finite sample distributions of LRT and RLRT statistics. (N. Zhao, H. Zhang, J. Clark, A. Maity, M. Wu. Composite Kernel Machine Regression based on Likelihood Ratio Test with Application for Combined Genetic and Gene-environment Interaction Effect (Submitted).).
In a clinical trial with repeated measures designs, outcomes are often taken from subjects at fixed time-points. The focus of the trial may be to compare the mean outcome in two or more groups at some pre-specified time after enrollment. In the presence of missing data auxiliary assumptions are necessary to perform such comparisons. One commonly employed assumption is the missing at random assumption (MAR). The samon package allows the user to perform a (parameterized) sensitivity analysis of this assumption. In particular it can be used to examine the sensitivity of tests in the difference in outcomes to violations of the MAR assumption. The sensitivity analysis can be performed under two scenarios, a) where the data exhibit a monotone missing data pattern (see the samon() function), and, b) where in addition to a monotone missing data pattern the data exhibit intermittent missing values (see the samonIM() function).
This package implements data-driven identification methods for structural vector autoregressive (SVAR) models as described in Lange et al. (2021) <doi:10.18637/jss.v097.i05>. Based on an existing VAR model object (provided by e.g. VAR() from the vars package), the structural impact matrix is obtained via data-driven identification techniques (i.e. changes in volatility (Rigobon, R. (2003) <doi:10.1162/003465303772815727>), patterns of GARCH (Normadin, M., Phaneuf, L. (2004) <doi:10.1016/j.jmoneco.2003.11.002>), independent component analysis (Matteson, D. S, Tsay, R. S., (2013) <doi:10.1080/01621459.2016.1150851>), least dependent innovations (Herwartz, H., Ploedt, M., (2016) <doi:10.1016/j.jimonfin.2015.11.001>), smooth transition in variances (Luetkepohl, H., Netsunajev, A. (2017) <doi:10.1016/j.jedc.2017.09.001>) or non-Gaussian maximum likelihood (Lanne, M., Meitz, M., Saikkonen, P. (2017) <doi:10.1016/j.jeconom.2016.06.002>)).
Assume that a temporal process is composed of contiguous segments with differing slopes and replicated noise-corrupted time series measurements are observed. The unknown mean of the data generating process is modelled as a piecewise linear function of time with an unknown number of change-points. The package infers the joint posterior distribution of the number and position of change-points as well as the unknown mean parameters per time-series by MCMC sampling. A-priori, the proposed model uses an overfitting number of mean parameters but, conditionally on a set of change-points, only a subset of them influences the likelihood. An exponentially decreasing prior distribution on the number of change-points gives rise to a posterior distribution concentrating on sparse representations of the underlying sequence, but also available is the Poisson distribution. See Papastamoulis et al (2019) <doi:10.1515/ijb-2018-0052> for a detailed presentation of the method.
To overcome the memory limitations for fitting linear (LM) and Generalized Linear Models (GLMs) to large data sets, this package implements the Divide and Recombine (D&R) strategy. It basically divides the entire large data set into suitable subsets manageable in size and then fits model to each subset. Finally, results from each subset are aggregated to obtain the final estimate. This package also supports fitting GLMs to data sets that cannot fit into memory and provides methods for fitting GLMs under linear regression, binomial regression, Poisson regression, and multinomial logistic regression settings. Respective models are fitted using different D&R strategies as described by: Xi, Lin, and Chen (2009) <doi:10.1109/TKDE.2008.186>, Xi, Lin and Chen (2006) <doi:10.1109/TKDE.2006.196>, Zuo and Li (2018) <doi:10.4236/ojs.2018.81003>, Karim, M.R., Islam, M.A. (2019) <doi:10.1007/978-981-13-9776-9>.
Helps find meaningful patterns in complex genetic experiments. First gimap takes data from paired CRISPR (Clustered regularly interspaced short palindromic repeats) screens that has been pre-processed to counts table of paired gRNA (guide Ribonucleic Acid) reads. The input data will have cell counts for how well cells grow (or don't grow) when different genes or pairs of genes are disabled. The output of the gimap package is genetic interaction scores which are the distance between the observed CRISPR score and the expected CRISPR score. The expected CRISPR scores are what we expect for the CRISPR values to be for two unrelated genes. The further away an observed CRISPR score is from its expected score the more we suspect genetic interaction. The work in this package is based off of original research from the Alice Berger lab at Fred Hutchinson Cancer Center (2021) <doi:10.1016/j.celrep.2021.109597>.
Projects mean squared out-of-sample error for a linear regression based upon the methodology developed in Rohlfs (2022) <doi:10.48550/arXiv.2209.01493>. It consumes as inputs the lm object from an estimated OLS regression (based on the "training sample") and a data.frame of out-of-sample cases (the "test sample") that have non-missing values for the same predictors. The test sample may or may not include data on the outcome variable; if it does, that variable is not used. The aim of the exercise is to project what what mean squared out-of-sample error can be expected given the predictor values supplied in the test sample. Output consists of a list of three elements: the projected mean squared out-of-sample error, the projected out-of-sample R-squared, and a vector of out-of-sample "hat" or "leverage" values, as defined in the paper.
This package performs Bayesian nonparametric density estimation using Martingale posterior distributions including the Copula Resampling (CopRe) algorithm. Also included are a Gibbs sampler for the marginal Gibbs-type mixture model and an extension to include full uncertainty quantification via a predictive sequence resampling (SeqRe) algorithm. The CopRe and SeqRe samplers generate random nonparametric distributions as output, leading to complete nonparametric inference on posterior summaries. Routines for calculating arbitrary functionals from the sampled distributions are included as well as an important algorithm for finding the number and location of modes, which can then be used to estimate the clusters in the data using, for example, k-means. Implements work developed in Moya B., Walker S. G. (2022). <doi:10.48550/arxiv.2206.08418>, Fong, E., Holmes, C., Walker, S. G. (2021) <doi:10.48550/arxiv.2103.15671>, and Escobar M. D., West, M. (1995) <doi:10.1080/01621459.1995.10476550>.
Analysis of elliptical tubes with applications in biological modeling. The package is based on the references: Taheri, M., Pizer, S. M., & Schulz, J. (2024) "The Mean Shape under the Relative Curvature Condition." Journal of Computational and Graphical Statistics <doi:10.1080/10618600.2025.2535600> and arXiv <doi:10.48550/arXiv.2404.01043>. Mohsen Taheri Shalmani (2024) "Shape Statistics via Skeletal Structures", PhD Thesis, University of Stavanger, Norway <doi:10.13140/RG.2.2.34500.23685>. Key features include constructing discrete elliptical tubes, calculating transformations, validating structures under the Relative Curvature Condition (RCC), computing means, and generating simulations. Supports intrinsic and non-intrinsic mean calculations and transformations, size estimation, plotting, and random sample generation based on a reference tube. The intrinsic approach relies on the interior path of the original non-convex space, incorporating the RCC, while the non-intrinsic approach uses a basic robotic arm transformation that disregards the RCC.
Collection of tools to work with European basketball data. Functions available are related to friendly web scraping, data management and visualization. Data were obtained from <https://www.euroleaguebasketball.net/euroleague/>, <https://www.euroleaguebasketball.net/eurocup/> and <https://www.acb.com/>, following the instructions of their respectives robots.txt files, when available. Box score data are available for the three leagues. Play-by-play and spatial shooting data are also available for the Spanish league. Methods for analysis include a population pyramid, 2D plots, circular plots of players percentiles, plots of players monthly/yearly stats, team heatmaps, team shooting plots, team four factors plots, cross-tables with the results of regular season games, maps of nationalities, combinations of lineups, possessions-related variables, timeouts, performance by periods, personal fouls, offensive rebounds and different types of shooting charts. Please see Vinue (2020) <doi:10.1089/big.2018.0124> and Vinue (2024) <doi:10.1089/big.2023.0177>.
Patients Mental Health (MH) status, Substance Use (SU) status, and concurrent MH/SU status in the American/Canadian Healthcare Administrative Databases can be identified. The detection is based on given parameters of interest by clinicians including the list of plausible ICD MH/SU codes (3/4/5 characters), the required number of visits of hospital for MH/SU , the required number of visits of service physicians for MH/SU, and the maximum time span within MH visits, within SU visits, and, between MH and SU visits. Methods are described in: Khan S <https://pubmed.ncbi.nlm.nih.gov/29044442/>, Keen C, et al. (2021) <doi:10.1111/add.15580>, Lavergne MR, et al. (2022) <doi:10.1186/s12913-022-07759-z>, Casillas, S M, et al. (2022) <doi:10.1016/j.abrep.2022.100464>, CIHI (2022) <https://www.cihi.ca/en>, CDC (2024) <https://www.cdc.gov>, WHO (2019) <https://icd.who.int/en>.
This package provides generalized L-moments estimation methods for the generalized extreme value ('GEV') distribution. Implements both stationary GEV and non-stationary GEV11 models where location and scale parameters vary with time. Includes various penalty functions ('Martins'-'Stedinger', Park, Cannon, Coles'-Dixon) for shape parameter regularization. Also provides model averaging estimation ('ma.gev') that combines MLE and L-moment methods with multiple weighting schemes for robust high quantile estimation. The GLME methodology is described in Shin et al. (2025a) <doi:10.48550/arXiv.2512.20385>. The non-stationary L-moment method is based on Shin et al. (2025b) <doi:10.1007/s42952-025-00325-3>. The model averaging method is described in Shin et al. (2026) <doi:10.1007/s00477-025-03167-x>. See also Hosking (1990) <doi:10.1111/j.2517-6161.1990.tb01775.x> for L-moments theory and Martins and Stedinger (2000) <doi:10.1029/1999WR900330> for penalized likelihood methods.
The routine gof_test() in this package runs the goodness-of-fit test using various test statistic for multivariate data. Models under the null hypothesis can either be simple or allow for parameter estimation. p values are found via the parametric bootstrap (simulation). The routine gof_test_adjusted_pvalues() runs several tests and then finds a p value adjusted for simultaneous inference. The routine gof_power() allows the estimation of the power of the tests. hybrid_test() and hybrid_power() do the same by first generating a Monte Carlo data set under the null hypothesis and then running a number of two-sample methods. The routine run.studies() allows a user to quickly study the power of a new method and how it compares to those included in the package via a large number of case studies. For details of the methods and references see the included vignettes.
This package creates geographic map tiles from geospatial map files or non-geographic map tiles from simple image files. This package provides a tile generator function for creating map tile sets for use with packages such as leaflet'. In addition to generating map tiles based on a common raster layer source, it also handles the non-geographic edge case, producing map tiles from arbitrary images. These map tiles, which have a non-geographic, simple coordinate reference system (CRS), can also be used with leaflet when applying the simple CRS option. Map tiles can be created from an input file with any of the following extensions: tif, grd and nc for spatial maps and png, jpg and bmp for basic images. This package requires Python and the gdal library for Python'. Windows users are recommended to install OSGeo4W (<https://trac.osgeo.org/osgeo4w/>) as an easy way to obtain the required gdal support for Python'.
The genetic algorithm can be used directly to find the similarity of users and more effectively to increase the efficiency of the collaborative filtering method. By identifying the nearest neighbors to the active user, before the genetic algorithm, and by identifying suitable starting points, an effective method for user-based collaborative filtering method has been developed. This package uses an optimization algorithm (continuous genetic algorithm) to directly find the optimal similarities between active users (users for whom current recommendations are made) and others. First, by determining the nearest neighbor and their number, the number of genes in a chromosome is determined. Each gene represents the neighbor's similarity to the active user. By estimating the starting points of the genetic algorithm, it quickly converges to the optimal solutions. The positive point is the independence of the genetic algorithm on the number of data that for big data is an effective help in solving the problem.
This package provides a simulation-based tool made to help researchers to become familiar with multilevel variations, and to build up sampling designs for their study. This tool has two main objectives: First, it provides an educational tool useful for students, teachers and researchers who want to learn to use mixed-effects models. Users can experience how the mixed-effects model framework can be used to understand distinct biological phenomena by interactively exploring simulated multilevel data. Second, it offers research opportunities to those who are already familiar with mixed-effects models, as it enables the generation of data sets that users may download and use for a range of simulation-based statistical analyses such as power and sensitivity analysis of multilevel and multivariate data [Allegue, H., Araya-Ajoy, Y.G., Dingemanse, N.J., Dochtermann N.A., Garamszegi, L.Z., Nakagawa, S., Reale, D., Schielzeth, H. and Westneat, D.F. (2016) <doi: 10.1111/2041-210X.12659>].
Fit multiclass Classification version of Bayesian Adaptive Smoothing Splines (CBASS) to data using reversible jump MCMC. The multiclass classification problem consists of a response variable that takes on unordered categorical values with at least three levels, and a set of inputs for each response variable. The CBASS model consists of a latent multivariate probit formulation, and the means of the latent Gaussian random variables are specified using adaptive regression splines. The MCMC alternates updates of the latent Gaussian variables and the spline parameters. All the spline parameters (variables, signs, knots, number of interactions), including the number of basis functions used to model each latent mean, are inferred. Functions are provided to process inputs, initialize the chain, run the chain, and make predictions. Predictions are made on a probabilistic basis, where, for a given input, the probabilities of each categorical value are produced. See Marrs and Francom (2023) "Multiclass classification using Bayesian multivariate adaptive regression splines" Under review.
Joint and Individual Variation Explained (JIVE) is a method for decomposing multiple datasets obtained on the same subjects into shared structure, structure unique to each dataset, and noise. The two most common implementations are R.JIVE, an iterative approach, and AJIVE, which uses principal angle analysis. JIVE estimates subspaces but interpreting these subspaces can be challenging with AJIVE or R.JIVE. We expand upon insights into AJIVE as a canonical correlation analysis (CCA) of principal component scores. This reformulation, which we call CJIVE, 1) provides an ordering of joint components by the degree of correlation between corresponding canonical variables; 2) uses a computationally efficient permutation test for the number of joint components, which provides a p-value for each component; and 3) can be used to predict subject scores for out-of-sample observations. Please cite the following article when utilizing this package: Murden, R., Zhang, Z., Guo, Y., & Risk, B. (2022) <doi:10.3389/fnins.2022.969510>.
An R interface to FLINT <https://flintlib.org/>, a C library for number theory. FLINT extends GNU MPFR <https://www.mpfr.org/> and GNU MP <https://gmplib.org/> with support for operations on standard rings (the integers, the integers modulo n, finite fields, the rational, p-adic, real, and complex numbers) as well as matrices and polynomials over rings. FLINT implements midpoint-radius interval arithmetic, also known as ball arithmetic, in the real and complex numbers, enabling computation in arbitrary precision with rigorous propagation of rounding and other errors; see Johansson (2017) <doi:10.1109/TC.2017.2690633>. Finally, FLINT provides ball arithmetic implementations of many special mathematical functions, with high coverage of reference works such as the NIST Digital Library of Mathematical Functions <https://dlmf.nist.gov/>. The R interface defines S4 classes, generic functions, and methods for representation and basic operations as well as plain R functions mirroring and vectorizing entry points in the C library.
Toolbox for different kinds of spatio-temporal analyses to be performed on observed point patterns, following the growing stream of literature on point process theory. This R package implements functions to perform different kinds of analyses on point processes, proposed in the papers (Siino, Adelfio, and Mateu 2018<doi:10.1007/s00477-018-1579-0>; Siino et al. 2018<doi:10.1002/env.2463>; Adelfio et al. 2020<doi:10.1007/s00477-019-01748-1>; Dâ Angelo, Adelfio, and Mateu 2021<doi:10.1016/j.spasta.2021.100534>; Dâ Angelo, Adelfio, and Mateu 2022<doi:10.1007/s00362-022-01338-4>; Dâ Angelo, Adelfio, and Mateu 2023<doi:10.1016/j.csda.2022.107679>). The main topics include modeling, statistical inference, and simulation issues on spatio-temporal point processes on Euclidean space and linear networks. Version 1.0.0 has been updated for accompanying the journal publication D Angelo and Adelfio 2025 <doi:10.18637/jss.v113.i10>.
Due to a limited availability of observed high-resolution precipitation records with adequate length, simulations with stochastic precipitation models are used to generate series for subsequent studies [e.g. Khaliq and Cunmae, 1996, <doi:10.1016/0022-1694(95)02894-3>, Vandenberghe et al., 2011, <doi:10.1029/2009WR008388>]. This package contains an R implementation of the original Bartlett-Lewis rectangular pulse model (BLRPM), developed by Rodriguez-Iturbe et al. (1987) <doi:10.1098/rspa.1987.0039>. It contains a function for simulating a precipitation time series based on storms and cells generated by the model with given or estimated model parameters. Additionally BLRPM parameters can be estimated from a given or simulated precipitation time series. The model simulations can be plotted in a three-layer plot including an overview of generated storms and cells by the model (which can also be plotted individually), a continuous step-function and a discrete precipitation time series at a chosen aggregation level.
This package contains a mixture of statistical methods including the MCMC methods to analyze normal mixtures. Additionally, model based clustering methods are implemented to perform classification based on (multivariate) longitudinal (or otherwise correlated) data. The basis for such clustering is a mixture of multivariate generalized linear mixed models. The package is primarily related to the publications Komárek (2009, Comp. Stat. and Data Anal.) <doi:10.1016/j.csda.2009.05.006> and Komárek and Komárková (2014, J. of Stat. Soft.) <doi:10.18637/jss.v059.i12>. It also implements methods published in Komárek and Komárková (2013, Ann. of Appl. Stat.) <doi:10.1214/12-AOAS580>, Hughes, Komárek, Bonnett, Czanner, Garcà a-Fiñana (2017, Stat. in Med.) <doi:10.1002/sim.7397>, Jaspers, Komárek, Aerts (2018, Biom. J.) <doi:10.1002/bimj.201600253> and Hughes, Komárek, Czanner, Garcà a-Fiñana (2018, Stat. Meth. in Med. Res) <doi:10.1177/0962280216674496>.
Canonical correlation analysis (CCA) via reduced-rank regression with support for regularization and cross-validation. Several methods for estimating CCA in high-dimensional settings are implemented. The first set of methods, cca_rrr() (and variants: cca_group_rrr() and cca_graph_rrr()), assumes that one dataset is high-dimensional and the other is low-dimensional, while the second, ecca() (for Efficient CCA) assumes that both datasets are high-dimensional. For both methods, standard l1 regularization as well as group-lasso regularization are available. cca_graph_rrr further supports total variation regularization when there is a known graph structure among the variables of the high-dimensional dataset. In this case, the loadings of the canonical directions of the high-dimensional dataset are assumed to be smooth on the graph. For more details see Donnat and Tuzhilina (2024) <doi:10.48550/arXiv.2405.19539> and Wu, Tuzhilina and Donnat (2025) <doi:10.48550/arXiv.2507.11160>.