This package provides the function qqtest which incorporates uncertainty in its qqplot display(s) so that the user might have a better sense of the evidence against the specified distributional hypothesis. qqtest draws a quantile quantile plot for visually assessing whether the data come from a test distribution that has been defined in one of many ways. The vertical axis plots the data quantiles, the horizontal those of a test distribution. The default behaviour generates 1000 samples from the test distribution and overlays the plot with shaded pointwise interval estimates for the ordered quantiles from the test distribution. A small number of independently generated exemplar quantile plots can also be overlaid. Both the interval estimates and the exemplars provide different comparative information to assess the evidence provided by the qqplot for or against the hypothesis that the data come from the test distribution (default is normal or gaussian). Finally, a visual test of significance (a lineup plot) can also be displayed to test the null hypothesis that the data come from the test distribution.
This package provides a leadership-inference framework for multivariate time series. The framework for multiple-faction-leadership inference from coordinated activities or mFLICA
uses a notion of a leader as an individual who initiates collective patterns that everyone in a group follows. Given a set of time series of individual activities, our goal is to identify periods of coordinated activity, find factions of coordination if more than one exist, as well as identify leaders of each faction. For each time step, the framework infers following relations between individual time series, then identifying a leader of each faction whom many individuals follow but it follows no one. A faction is defined as a group of individuals that everyone follows the same leader. mFLICA
reports following relations, leaders of factions, and members of each faction for each time step. Please see Chainarong Amornbunchornvej and Tanya Berger-Wolf (2018) <doi:10.1137/1.9781611975321.62> for methodology and Chainarong Amornbunchornvej (2021) <doi:10.1016/j.softx.2021.100781> for software when referring to this package in publications.
This package provides a stand-alone function that generates a user specified number of random datasets and computes eigenvalues using the random datasets (i.e., implements Horn's [1965, Psychometrika] parallel analysis <doi:10.1007/BF02289447>). Users then compare the resulting eigenvalues (the mean or the specified percentile) from the random datasets (i.e., eigenvalues resulting from noise) to the eigenvalues generated with the user's data. Can be used for both principal components analysis (PCA) and common/exploratory factor analysis (EFA). The output table shows how large eigenvalues can be as a result of merely using randomly generated datasets. If the user's own dataset has actual eigenvalues greater than the corresponding eigenvalues, that lends support to retain that factor/component. In other words, if the i(th) eigenvalue from the actual data was larger than the percentile of the (i)th eigenvalue generated using randomly generated data, empirical support is provided to retain that factor/component. Horn, J. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 32, 179-185.
This package provides functionality to perform machine-learning-based modeling in a computation pipeline. Its functions contain the basic steps of machine-learning-based knowledge discovery workflows, including model training and optimization, model evaluation, and model testing. To perform these tasks, the package builds heavily on existing machine-learning packages, such as caret <https://github.com/topepo/caret/> and associated packages. The package can train multiple models, optimize model hyperparameters by performing a grid search or a random search, and evaluates model performance by different metrics. Models can be validated either on a test data set, or in case of a small sample size by k-fold cross validation or repeated bootstrapping. It also allows for 0-Hypotheses generation by performing permutation experiments. Additionally, it offers methods of model interpretation and item categorization to identify the most informative features from a high dimensional data space. The functions of this package can easily be integrated into computation pipelines (e.g. nextflow <https://www.nextflow.io/>) and hereby improve scalability, standardization, and re-producibility in the context of machine-learning.
This package implements multiple existing open-source algorithms for coding cause of death from verbal autopsies. The methods implemented include InterVA4
by Byass et al (2012) <doi:10.3402/gha.v5i0.19281>, InterVA5
by Byass at al (2019) <doi:10.1186/s12916-019-1333-6>, InSilicoVA
by McCormick
et al (2016) <doi:10.1080/01621459.2016.1152191>, NBC by Miasnikof et al (2015) <doi:10.1186/s12916-015-0521-2>, and a replication of Tariff method by James et al (2011) <doi:10.1186/1478-7954-9-31> and Serina, et al. (2015) <doi:10.1186/s12916-015-0527-9>. It also provides tools for data manipulation tasks commonly used in Verbal Autopsy analysis and implements easy graphical visualization of individual and population level statistics. The NBC method is implemented by the nbc4va package that can be installed from <https://github.com/rrwen/nbc4va>. Note that this package was not developed by authors affiliated with the Institute for Health Metrics and Evaluation and thus unintentional discrepancies may exist in the implementation of the Tariff method.
This package provides an efficient and very flexible framework to conduct data-driven epidemiological modeling in realistic large scale disease spread simulations. The framework integrates infection dynamics in subpopulations as continuous-time Markov chains using the Gillespie stochastic simulation algorithm and incorporates available data such as births, deaths and movements as scheduled events at predefined time-points. Using C code for the numerical solvers and OpenMP
(if available) to divide work over multiple processors ensures high performance when simulating a sample outcome. One of our design goals was to make the package extendable and enable usage of the numerical solvers from other R extension packages in order to facilitate complex epidemiological research. The package contains template models and can be extended with user-defined models. For more details see the paper by Widgren, Bauer, Eriksson and Engblom (2019) <doi:10.18637/jss.v091.i12>. The package also provides functionality to fit models to time series data using the Approximate Bayesian Computation Sequential Monte Carlo ('ABC-SMC') algorithm of Toni and others (2009) <doi:10.1098/rsif.2008.0172>.
This package provides a new diagram for the verification of vector variables (wind, current, etc) generated by multiple models against a set of observations is presented in this package. It has been designed as a generalization of the Taylor diagram to two dimensional quantities. It is based on the analysis of the two-dimensional structure of the mean squared error matrix between model and observations. The matrix is divided into the part corresponding to the relative rotation and the bias of the empirical orthogonal functions of the data. The full set of diagnostics produced by the analysis of the errors between model and observational vector datasets comprises the errors in the means, the analysis of the total variance of both datasets, the rotation matrix corresponding to the principal components in observation and model, the angle of rotation of model-derived empirical orthogonal functions respect to the ones from observations, the standard deviation of model and observations, the root mean squared error between both datasets and the squared two-dimensional correlation coefficient. See the output of function UVError()
in this package.
This software package provides Cox survival analysis for high-dimensional and multiblock datasets. It encompasses a suite of functions dedicated from the classical Cox regression to newest analysis, including Cox proportional hazards model, Stepwise Cox regression, and Elastic-Net Cox regression, Sparse Partial Least Squares Cox regression (sPLS-COX
) incorporating three distinct strategies, and two Multiblock-PLS Cox regression (MB-sPLS-COX
) methods. This tool is designed to adeptly handle high-dimensional data, and provides tools for cross-validation, plot generation, and additional resources for interpreting results. While references are available within the corresponding functions, key literature is mentioned below. Terry M Therneau (2024) <https://CRAN.R-project.org/package=survival>, Noah Simon et al. (2011) <doi:10.18637/jss.v039.i05>, Philippe Bastien et al. (2005) <doi:10.1016/j.csda.2004.02.005>, Philippe Bastien (2008) <doi:10.1016/j.chemolab.2007.09.009>, Philippe Bastien et al. (2014) <doi:10.1093/bioinformatics/btu660>, Kassu Mehari Beyene and Anouar El Ghouch (2020) <doi:10.1002/sim.8671>, Florian Rohart et al. (2017) <doi:10.1371/journal.pcbi.1005752>.
When considering count data, it is often the case that many more zero counts than would be expected of some given distribution are observed. It is well established that data such as this can be reliably modelled using zero-inflated or hurdle distributions, both of which may be applied using the functions in this package. Bayesian analysis methods are used to best model problematic count data that cannot be fit to any typical distribution. The package functions are flexible and versatile, and can be applied to varying count distributions, parameter estimation with or without explanatory variable information, and are able to allow for multiple hurdles as it is also not uncommon that count data have an abundance of large-number observations which would be considered outliers of the typical distribution. In lieu of throwing out data or misspecifying the typical distribution, these extreme observations can be applied to a second, extreme distribution. With the given functions of this package, such a two-hurdle model may be easily specified in order to best manage data that is both zero-inflated and over-dispersed.
The Ontario Marginalization Index is a socioeconomic model that is built on Statistics Canada census data. The model consists of four dimensions: In 2021, these dimensions were updated to "Material Resources" (previously called "Material Deprivation"), "Households and Dwellings" (previously called "Residential Instability"), "Age and Labour Force" (previously called "Dependency"), and "Racialized and Newcomer Populations" (previously called "Ethnic Concentration"). This update reflects a movement away from deficit-based language. 2021 data will load with these new dimension names, wheras 2011 and 2016 data will load with the historical dimension names. Each of these dimensions are imported for a variety of geographic levels (DA, CD, etc.) for the 2021, 2011 and 2016 administrations of the census. These data sets contribute to community analysis of equity with respect to Ontario's Anti-Racism Act. The Ontario Marginalization Index data is retrieved from the Public Health Ontario website: <https://www.publichealthontario.ca/en/data-and-analysis/health-equity/ontario-marginalization-index>. The shapefile data is retrieved from the Statistics Canada website: <https://www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/bound-limit-eng.cfm>.
This package provides a tool that "multiply imputes" missing data in a single cross-section (such as a survey), from a time series (like variables collected for each year in a country), or from a time-series-cross-sectional data set (such as collected by years for each of several countries). Amelia II implements our bootstrapping-based algorithm that gives essentially the same answers as the standard IP or EMis approaches, is usually considerably faster than existing approaches and can handle many more variables. Unlike Amelia I and other statistically rigorous imputation software, it virtually never crashes (but please let us know if you find to the contrary!). The program also generalizes existing approaches by allowing for trends in time series across observations within a cross-sectional unit, as well as priors that allow experts to incorporate beliefs they have about the values of missing cells in their data. Amelia II also includes useful diagnostics of the fit of multiple imputation models. The program works from the R command line or via a graphical user interface that does not require users to know R.
Geospatial data computation is parallelized by grid, hierarchy, or raster files. Based on future (Bengtsson, 2024 <doi:10.32614/CRAN.package.future>) and mirai (Gao et al., 2025 <doi:10.32614/CRAN.package.mirai>) parallel back-ends, terra (Hijmans et al., 2025 <doi:10.32614/CRAN.package.terra>) and sf (Pebesma et al., 2024 <doi:10.32614/CRAN.package.sf>) functions as well as convenience functions in the package can be distributed over multiple threads. The simplest way of parallelizing generic geospatial computation is to start from par_pad_*()
functions to par_grid()
, par_hierarchy()
, or par_multirasters()
functions. Virtually any functions accepting classes in terra or sf packages can be used in the three parallelization functions. A common raster-vector overlay operation is provided as a function extract_at()
, which uses exactextractr (Baston, 2023 <doi:10.32614/CRAN.package.exactextractr>), with options for kernel weights for summarizing raster values at vector geometries. Other convenience functions for vector-vector operations including simple areal interpolation (summarize_aw()
) and summation of exponentially decaying weights (summarize_sedc()
) are also provided.
It is a computer tool to estimate the item-sum score's reliability (composite reliability, CR) in multidimensional scales with overlapping items. An item that measures more than one domain construct is called an overlapping item. The estimation is based on factor models allowing unlimited cross-factor loadings such as exploratory structural equation modeling (ESEM) and Bayesian structural equation modeling (BSEM). The factor models include correlated-factor models and bi-factor models. Specifically for bi-factor models, a type of hierarchical factor model, the package estimates the CR hierarchical subscale/hierarchy and CR subscale/scale total. The CR estimator Omega-generic was proposed by Mai, Srivastava, and Krull (2021) <https://whova.com/embedded/subsession/enars_202103/1450751/1452993/>. The current version can only handle continuous data. Yujiao Mai contributes to the algorithms, R programming, and application example. Deo Kumar Srivastava contributes to the algorithms and the application example. Kevin R. Krull contributes to the application example. The package OmegaG
was sponsored by American Lebanese Syrian Associated Charities (ALSAC). However, the contents of OmegaG
do not necessarily represent the policy of the ALSAC.
Spatio-temporal data have become increasingly popular in many research fields. Such data often have complex structures that are difficult to describe and estimate. This package provides reliable tools for modeling complicated spatio-temporal data. It also includes tools of online process monitoring to detect possible change-points in a spatio-temporal process over time. More specifically, the package implements the spatio-temporal mean estimation procedure described in Yang and Qiu (2018) <doi:10.1002/sim.7622>, the spatio-temporal covariance estimation procedure discussed in Yang and Qiu (2019) <doi:10.1002/sim.8315>, the three-step method for the joint estimation of spatio-temporal mean and covariance functions suggested by Yang and Qiu (2022) <doi:10.1007/s10463-021-00787-2>, the spatio-temporal disease surveillance method discussed in Qiu and Yang (2021) <doi:10.1002/sim.9150> that can accommodate the covariate effect, the spatial-LASSO-based process monitoring method proposed by Qiu and Yang (2023) <doi:10.1080/00224065.2022.2081104>, and the online spatio-temporal disease surveillance method described in Yang and Qiu (2020) <doi:10.1080/24725854.2019.1696496>.
Easily export R graphs and statistical output to Microsoft Office / LibreOffice
', Latex and HTML Documents, using sensible defaults that result in publication-quality output with simple, straightforward commands. Output to Microsoft Office is in editable DrawingML
vector format for graphs, and can use corporate template documents for styling. This enables the production of standardized reports and also allows for manual tidy-up of the layout of R graphs in Powerpoint before final publication. Export of graphs is flexible, and functions enable the currently showing R graph or the currently showing R stats object to be exported, but also allow the graphical or tabular output to be passed as objects. The package relies on package officer for export to Office documents,and output files are also fully compatible with LibreOffice
'. Base R', ggplot2 and lattice plots are supported, as well as a wide variety of R stats objects, via wrappers to xtable()
, broom::tidy()
and stargazer()
, including aov()
, lm()
, glm()
, lme()
, glmnet()
and coxph()
as well as matrices and data frames and many more...
Owing to the rich shapes of Generalised Lambda Distributions (GLDs), GLD standard/quantile/Accelerated Failure Time (AFT) regression is a competitive flexible model compared to standard/quantile/AFT regression. The proposed method has some major advantages: 1) it provides a reference line which is very robust to outliers with the attractive property of zero mean residuals and 2) it gives a unified, elegant quantile regression model from the reference line with smooth regression coefficients across different quantiles. For AFT model, it also eliminates the needs to try several different AFT models, owing to the flexible shapes of GLD. The goodness of fit of the proposed model can be assessed via QQ plots and Kolmogorov-Smirnov tests and data driven smooth test, to ensure the appropriateness of the statistical inference under consideration. Statistical distributions of coefficients of the GLD regression line are obtained using simulation, and interval estimates are obtained directly from simulated data. References include the following: Su (2015) "Flexible Parametric Quantile Regression Model" <doi:10.1007/s11222-014-9457-1>, Su (2021) "Flexible parametric accelerated failure time model"<doi:10.1080/10543406.2021.1934854>.
In a typical protein labelling procedure, proteins are chemically tagged with a functional group, usually at specific sites, then digested into peptides, which are then analyzed using matrix-assisted laser desorption ionization - time of flight mass spectrometry (MALDI-TOF MS) to generate peptide fingerprint. Relative to the control, peptides that are heavier by the mass of the labelling group are informative for sequence determination. Searching for peptides with such mass shifts, however, can be difficult. This package, designed to tackle this inconvenience, takes as input the mass list of two or multiple MALDI-TOF MS mass lists, and makes pairwise comparisons between the labeled groups vs. control, and restores centroid mass spectra with highlighted peaks of interest for easier visual examination. Particularly, peaks differentiated by the mass of the labelling group are defined as a â pairâ , those with equal masses as a â matchâ , and all the other peaks as a â mismatchâ .For more bioanalytical background information, refer to following publications: Jingjing Deng (2015) <doi:10.1007/978-1-4939-2550-6_19>; Elizabeth Chang (2016) <doi:10.7171/jbt.16-2702-002>.
Forecasters predicting the chances of a future event may disagree due to differing evidence or noise. To harness the collective evidence of the crowd, Ville Satopää (2021) "Regularized Aggregation of One-off Probability Predictions" <https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3769945> proposes a Bayesian aggregator that is regularized by analyzing the forecasters disagreement and ascribing over-dispersion to noise. This aggregator requires no user intervention and can be computed efficiently even for a large numbers of predictions. The author evaluates the aggregator on subjective probability predictions collected during a four-year forecasting tournament sponsored by the US intelligence community. The aggregator improves the accuracy of simple averaging by around 20% and other state-of-the-art aggregators by 10-25%. The advantage stems almost exclusively from improved calibration. This aggregator -- know as "the revealed aggregator" -- inputs a) forecasters probability predictions (p) of a future binary event and b) the forecasters common prior (p0) of the future event. In this R-package, the function sample_aggregator(p,p0,...) allows the user to calculate the revealed aggregator. Its use is illustrated with a simple example.
Model selection algorithms for regression and classification, where the predictors can be continuous or categorical and the number of regressors may exceed the number of observations. The selected model consists of a subset of numerical regressors and partitions of levels of factors. Szymon Nowakowski, Piotr Pokarowski, Wojciech Rejchel and Agnieszka SoÅ tys, 2023. Improving Group Lasso for High-Dimensional Categorical Data. In: Computational Science â ICCS 2023. Lecture Notes in Computer Science, vol 14074, p. 455-470. Springer, Cham. <doi:10.1007/978-3-031-36021-3_47>. Aleksandra Maj-KaÅ ska, Piotr Pokarowski and Agnieszka Prochenka, 2015. Delete or merge regressors for linear model selection. Electronic Journal of Statistics 9(2): 1749-1778. <doi:10.1214/15-EJS1050>. Piotr Pokarowski and Jan Mielniczuk, 2015. Combined l1 and greedy l0 penalized least squares for linear model selection. Journal of Machine Learning Research 16(29): 961-992. <https://www.jmlr.org/papers/volume16/pokarowski15a/pokarowski15a.pdf>. Piotr Pokarowski, Wojciech Rejchel, Agnieszka SoÅ tys, MichaÅ Frej and Jan Mielniczuk, 2022. Improving Lasso for model selection and prediction. Scandinavian Journal of Statistics, 49(2): 831â 863. <doi:10.1111/sjos.12546>.
This package provides functions to impute large gaps within multivariate time series based on Dynamic Time Warping methods. Gaps of size 1 or inferior to a defined threshold are filled using simple average and weighted moving average respectively. Larger gaps are filled using the methodology provided by Phan et al. (2017) <DOI:10.1109/MLSP.2017.8168165>: a query is built immediately before/after a gap and a moving window is used to find the most similar sequence to this query using Dynamic Time Warping. To lower the calculation time, similar sequences are pre-selected using global features. Contrary to the univariate method (package DTWBI'), these global features are not estimated over the sequence containing the gap(s), but a feature matrix is built to summarize general features of the whole multivariate signal. Once the most similar sequence to the query has been identified, the adjacent sequence to this window is used to fill the gap considered. This function can deal with multiple gaps over all the sequences componing the input multivariate signal. However, for better consistency, large gaps at the same location over all sequences should be avoided.
The explosion of biobank data offers immediate opportunities for gene-environment (GxE
) interaction studies of complex diseases because of the large sample sizes and rich collection in genetic and non-genetic information. However, the extremely large sample size also introduces new computational challenges in GxE
assessment, especially for set-based GxE
variance component (VC) tests, a widely used strategy to boost overall GxE
signals and to evaluate the joint GxE
effect of multiple variants from a biologically meaningful unit (e.g., gene). We present SEAGLE', a Scalable Exact AlGorithm
for Large-scale Set-based GxE
tests, to permit GxE
VC test scalable to biobank data. SEAGLE employs modern matrix computations to achieve the same â exactâ results as the original GxE
VC tests, and does not impose additional assumptions nor relies on approximations. SEAGLE can easily accommodate sample sizes in the order of 10^5, is implementable on standard laptops, and does not require specialized equipment. The accompanying manuscript for this package can be found at Chi, Ipsen, Hsiao, Lin, Wang, Lee, Lu, and Tzeng. (2021+) <arXiv:2105.03228>
.
This package provides a comprehensive set of geostatistical, visual, and analytical methods, in conjunction with the expanded version of the acclaimed J.E. Klovan's mining dataset, are included in klovan'. This makes the package an excellent learning resource for Principal Component Analysis (PCA), Factor Analysis (FA), kriging, and other geostatistical techniques. Originally published in the 1976 book Geological Factor Analysis', the included mining dataset was assembled by Professor J. E. Klovan of the University of Calgary. Being one of the first applications of FA in the geosciences, this dataset has significant historical importance. As a well-regarded and published dataset, it is an excellent resource for demonstrating the capabilities of PCA, FA, kriging, and other geostatistical techniques in geosciences. For those interested in these methods, the klovan datasets provide a valuable and illustrative resource. Note that some methods require the RGeostats package. Please refer to the README or Additional_repositories for installation instructions. This material is based upon research in the Materials Data Science for Stockpile Stewardship Center of Excellence (MDS3-COE), and supported by the Department of Energy's National Nuclear Security Administration under Award Number DE-NA0004104.
This package performs Bayesian posterior inference for deep Gaussian processes following Sauer, Gramacy, and Higdon (2023, <doi:10.48550/arXiv.2012.08015>
). See Sauer (2023, <http://hdl.handle.net/10919/114845>) for comprehensive methodological details and <https://bitbucket.org/gramacylab/deepgp-ex/> for a variety of coding examples. Models are trained through MCMC including elliptical slice sampling of latent Gaussian layers and Metropolis-Hastings sampling of kernel hyperparameters. Vecchia-approximation for faster computation is implemented following Sauer, Cooper, and Gramacy (2023, <doi:10.48550/arXiv.2204.02904>
). Optional monotonic warpings are implemented following Barnett et al. (2024, <doi:10.48550/arXiv.2408.01540>
). Downstream tasks include sequential design through active learning Cohn/integrated mean squared error (ALC/IMSE; Sauer, Gramacy, and Higdon, 2023), optimization through expected improvement (EI; Gramacy, Sauer, and Wycoff, 2022 <doi:10.48550/arXiv.2112.07457>
), and contour location through entropy (Booth, Renganathan, and Gramacy, 2024 <doi:10.48550/arXiv.2308.04420>
). Models extend up to three layers deep; a one layer model is equivalent to typical Gaussian process regression. Incorporates OpenMP
and SNOW parallelization and utilizes C/C++ under the hood.
The Dynamic Time Warping (DTW) distance measure for time series allows non-linear alignments of time series to match similar patterns in time series of different lengths and or different speeds. IncDTW
is characterized by (1) the incremental calculation of DTW (reduces runtime complexity to a linear level for updating the DTW distance) - especially for life data streams or subsequence matching, (2) the vector based implementation of DTW which is faster because no matrices are allocated (reduces the space complexity from a quadratic to a linear level in the number of observations) - for all runtime intensive DTW computations, (3) the subsequence matching algorithm runDTW
, that efficiently finds the k-NN to a query pattern in a long time series, and (4) C++ in the heart. For details about DTW see the original paper "Dynamic programming algorithm optimization for spoken word recognition" by Sakoe and Chiba (1978) <DOI:10.1109/TASSP.1978.1163055>. For details about this package, Dynamic Time Warping and Incremental Dynamic Time Warping please see "IncDTW
: An R Package for Incremental Calculation of Dynamic Time Warping" by Leodolter et al. (2021) <doi:10.18637/jss.v099.i09>.