To meet the needs of statistical power calculation for stepped wedge cluster randomized trials, we developed this software. Different parameters can be specified by users for different scenarios, including: cross-sectional and cohort designs, binary and continuous outcomes, marginal (GEE) and conditional models (mixed effects model), three link functions (identity, log, logit links), with and without time effects (the default specification assumes no-time-effect) under exchangeable, nested exchangeable and block exchangeable correlation structures. Unequal numbers of clusters per sequence are also allowed. The methods included in this package: Zhou et al. (2020) <doi:10.1093/biostatistics/kxy031>, Li et al. (2018) <doi:10.1111/biom.12918>. Supplementary documents can be found at: <https://ysph.yale.edu/cmips/research/software/study-design-power-calculation/swdpwr/>. The Shiny app for swdpwr can be accessed at: <https://jiachenchen322.shinyapps.io/swdpwr_shinyapp/>. The package also includes functions that perform calculations for the intra-cluster correlation coefficients based on the random effects variances as input variables for continuous and binary outcomes, respectively.
Cancer genomes contain large numbers of somatic alterations but few genes drive tumor development. Identifying cancer driver genes is critical for precision oncology. Most of current approaches either identify driver genes based on mutational recurrence or using estimated scores predicting the functional consequences of mutations. driveR is a tool for personalized or batch analysis of genomic data for driver gene prioritization by combining genomic information and prior biological knowledge. As features, driveR uses coding impact metaprediction scores, non-coding impact scores, somatic copy number alteration scores, hotspot gene/double-hit gene condition, phenolyzer gene scores and memberships to cancer-related KEGG pathways. It uses these features to estimate cancer-type-specific probability for each gene of being a cancer driver using the related task of a multi-task learning classification model. The method is described in detail in Ulgen E, Sezerman OU. 2021. driveR: driveR: a novel method for prioritizing cancer driver genes using somatic genomics data. BMC Bioinformatics <doi:10.1186/s12859-021-04203-7>.
This package provides the setup and calculations needed to run a likelihood-based continual reassessment method (CRM) dose finding trial and performs simulations to assess design performance under various scenarios. 3 dose finding designs are included in this package: ordinal proportional odds model (POM) CRM, ordinal continuation ratio (CR) model CRM, and the binary 2-parameter logistic model CRM. These functions allow customization of design characteristics to vary sample size, cohort sizes, target dose-limiting toxicity (DLT) rates, discrete or continuous dose levels, combining ordinal grades 0 and 1 into one category, and incorporate safety and/or stopping rules. For POM and CR model designs, ordinal toxicity grades are specified by common terminology criteria for adverse events (CTCAE) version 4.0. Function pseudodata creates the necessary starting models for these 3 designs, and function nextdose estimates the next dose to test in a cohort of patients for a target DLT rate. We also provide the function crmsimulations to assess the performance of these 3 dose finding designs under various scenarios.
Choice models are a widely used technique across numerous scientific disciplines. The Apollo package is a very flexible tool for the estimation and application of choice models in R. Users are able to write their own model functions or use a mix of already available ones. Random heterogeneity, both continuous and discrete and at the level of individuals and choices, can be incorporated for all models. There is support for both standalone models and hybrid model structures. Both classical and Bayesian estimation is available, and multiple discrete continuous models are covered in addition to discrete choice. Multi-threading processing is supported for estimation and a large number of pre and post-estimation routines, including for computing posterior (individual-level) distributions are available. For examples, a manual, and a support forum, visit <https://www.ApolloChoiceModelling.com>. For more information on choice models see Train, K. (2009) <isbn:978-0-521-74738-7> and Hess, S. & Daly, A.J. (2014) <isbn:978-1-781-00314-5> for an overview of the field.
Estimation of treatment hierarchies in network meta-analysis using a novel frequentist approach based on treatment choice criteria (TCC) and probabilistic ranking models, as described by Evrenoglou et al. (2024) <DOI:10.48550/arXiv.2406.10612>. The TCC are defined using a rule based on the smallest worthwhile difference (SWD). Using the defined TCC, the NMA estimates (i.e., treatment effects and standard errors) are first transformed into treatment preferences, indicating either a treatment preference (e.g., treatment A > treatment B) or a tie (treatment A = treatment B). These treatment preferences are then synthesized using a probabilistic ranking model, which estimates the latent ability parameter of each treatment and produces the final treatment hierarchy. This parameter represents each treatments ability to outperform all the other competing treatments in the network. Here the terms ability to outperform indicates the propensity of each treatment to yield clinically important and beneficial effects when compared to all the other treatments in the network. Consequently, larger ability estimates indicate higher positions in the ranking list.
This package provides tools for fitting and simulating mixtures of Watson distributions. The package is described in Sablica, Hornik and Leydold (2026) <doi:10.18637/jss.v115.i04>. The random sampling scheme of the package offers two sampling algorithms that are based of the results of Sablica, Hornik and Leydold (2022) <doi:10.1080/10618600.2024.2416521>. What is more, the package offers a smart tool to combine these two methods, and based on the selected parameters, it approximates the relative sampling speed for both methods and picks the faster one. In addition, the package offers a fitting function for the mixtures of Watson distribution, that uses the expectation-maximization (EM) algorithm. Special features are the possibility to use multiple variants of the E-step and M-step, sparse matrices for the data representation and state of the art methods for numerical evaluation of needed special functions using the results of Sablica and Hornik (2022) <doi:10.1090/mcom/3690> and Sablica and Hornik (2024) <doi:10.1016/j.jmaa.2024.128262>.
Population-averaged models have been increasingly used in the design and analysis of cluster randomized trials (CRTs). To facilitate the applications of population-averaged models in CRTs, the package implements the generalized estimating equations (GEE) and matrix-adjusted estimating equations (MAEE) approaches to jointly estimate the marginal mean models correlation models both for general CRTs and stepped wedge CRTs. Despite the general GEE/MAEE approach, the package also implements a fast cluster-period GEE method by Li et al. (2022) <doi:10.1093/biostatistics/kxaa056> specifically for stepped wedge CRTs with large and variable cluster-period sizes and gives a simple and efficient estimating equations approach based on the cluster-period means to estimate the intervention effects as well as correlation parameters. In addition, the package also provides functions for generating correlated binary data with specific mean vector and correlation matrix based on the multivariate probit method in Emrich and Piedmonte (1991) <doi:10.1080/00031305.1991.10475828> or the conditional linear family method in Qaqish (2003) <doi:10.1093/biomet/90.2.455>.
This package performs genetic association analyses of case-parent triad (trio) data with multiple markers. It can also incorporate complete or incomplete control triads, for instance independent control children. Estimation is based on haplotypes, for instance SNP haplotypes, even though phase is not known from the genetic data. Haplin estimates relative risk (RR + conf.int.) and p-value associated with each haplotype. It uses maximum likelihood estimation to make optimal use of data from triads with missing genotypic data, for instance if some SNPs has not been typed for some individuals. Haplin also allows estimation of effects of maternal haplotypes and parent-of-origin effects, particularly appropriate in perinatal epidemiology. Haplin allows special models, like X-inactivation, to be fitted on the X-chromosome. A GxE analysis allows testing interactions between environment and all estimated genetic effects. The models were originally described in "Gjessing HK and Lie RT. Case-parent triads: Estimating single- and double-dose effects of fetal and maternal disease gene haplotypes. Annals of Human Genetics (2006) 70, pp. 382-396".
This package provides the function qqtest which incorporates uncertainty in its qqplot display(s) so that the user might have a better sense of the evidence against the specified distributional hypothesis. qqtest draws a quantile quantile plot for visually assessing whether the data come from a test distribution that has been defined in one of many ways. The vertical axis plots the data quantiles, the horizontal those of a test distribution. The default behaviour generates 1000 samples from the test distribution and overlays the plot with shaded pointwise interval estimates for the ordered quantiles from the test distribution. A small number of independently generated exemplar quantile plots can also be overlaid. Both the interval estimates and the exemplars provide different comparative information to assess the evidence provided by the qqplot for or against the hypothesis that the data come from the test distribution (default is normal or gaussian). Finally, a visual test of significance (a lineup plot) can also be displayed to test the null hypothesis that the data come from the test distribution.
This package provides a leadership-inference framework for multivariate time series. The framework for multiple-faction-leadership inference from coordinated activities or mFLICA uses a notion of a leader as an individual who initiates collective patterns that everyone in a group follows. Given a set of time series of individual activities, our goal is to identify periods of coordinated activity, find factions of coordination if more than one exist, as well as identify leaders of each faction. For each time step, the framework infers following relations between individual time series, then identifying a leader of each faction whom many individuals follow but it follows no one. A faction is defined as a group of individuals that everyone follows the same leader. mFLICA reports following relations, leaders of factions, and members of each faction for each time step. Please see Chainarong Amornbunchornvej and Tanya Berger-Wolf (2018) <doi:10.1137/1.9781611975321.62> for methodology and Chainarong Amornbunchornvej (2021) <doi:10.1016/j.softx.2021.100781> for software when referring to this package in publications.
Relative transcript abundance has proven to be a valuable tool for understanding the function of genes in biological systems. For the differential analysis of transcript abundance using RNA sequencing data, the negative binomial model is by far the most frequently adopted. However, common methods that are based on a negative binomial model are not robust to extreme outliers, which we found to be abundant in public datasets. So far, no rigorous and probabilistic methods for detection of outliers have been developed for RNA sequencing data, leaving the identification mostly to visual inspection. Recent advances in Bayesian computation allow large-scale comparison of observed data against its theoretical distribution given in a statistical model. Here we propose ppcseq, a key quality-control tool for identifying transcripts that include outlier data points in differential expression analysis, which do not follow a negative binomial distribution. Applying ppcseq to analyse several publicly available datasets using popular tools, we show that from 3 to 10 percent of differentially abundant transcripts across algorithms and datasets had statistics inflated by the presence of outliers.
This package provides a stand-alone function that generates a user specified number of random datasets and computes eigenvalues using the random datasets (i.e., implements Horn's [1965, Psychometrika] parallel analysis <doi:10.1007/BF02289447>). Users then compare the resulting eigenvalues (the mean or the specified percentile) from the random datasets (i.e., eigenvalues resulting from noise) to the eigenvalues generated with the user's data. Can be used for both principal components analysis (PCA) and common/exploratory factor analysis (EFA). The output table shows how large eigenvalues can be as a result of merely using randomly generated datasets. If the user's own dataset has actual eigenvalues greater than the corresponding eigenvalues, that lends support to retain that factor/component. In other words, if the i(th) eigenvalue from the actual data was larger than the percentile of the (i)th eigenvalue generated using randomly generated data, empirical support is provided to retain that factor/component. Horn, J. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 32, 179-185.
This package provides a system for calculating the optimal sampling effort, based on the ideas of "Ecological cost-benefit optimization" as developed by A. Underwood (1997, ISBN 0 521 55696 1). Data is obtained from simulated ecological communities with prep_data() which formats and arranges the initial data, and then the optimization follows the following procedure of four functions: (1) prep_data() takes the original dataset and creates simulated sets that can be used as a basis for estimating statistical power and type II error. (2) sim_beta() is used to estimate the statistical power for the different sampling efforts specified by the user. (3) sim_cbo() calculates then the optimal sampling effort, based on the statistical power and the sampling costs. Additionally, (4) scompvar() calculates the variation components necessary for (5) Underwood_cbo() to calculate the optimal combination of number of sites and samples depending on either an economic budget or on a desired statistical accuracy. Lastly, (6) plot_power() helps the user visualize the results of sim_beta().
This package implements multiple existing open-source algorithms for coding cause of death from verbal autopsies. The methods implemented include InterVA4 by Byass et al (2012) <doi:10.3402/gha.v5i0.19281>, InterVA5 by Byass at al (2019) <doi:10.1186/s12916-019-1333-6>, InSilicoVA by McCormick et al (2016) <doi:10.1080/01621459.2016.1152191>, NBC by Miasnikof et al (2015) <doi:10.1186/s12916-015-0521-2>, and a replication of Tariff method by James et al (2011) <doi:10.1186/1478-7954-9-31> and Serina, et al. (2015) <doi:10.1186/s12916-015-0527-9>. It also provides tools for data manipulation tasks commonly used in Verbal Autopsy analysis and implements easy graphical visualization of individual and population level statistics. The NBC method is implemented by the nbc4va package that can be installed from <https://github.com/rrwen/nbc4va>. Note that this package was not developed by authors affiliated with the Institute for Health Metrics and Evaluation and thus unintentional discrepancies may exist in the implementation of the Tariff method.
This package provides a new diagram for the verification of vector variables (wind, current, etc) generated by multiple models against a set of observations is presented in this package. It has been designed as a generalization of the Taylor diagram to two dimensional quantities. It is based on the analysis of the two-dimensional structure of the mean squared error matrix between model and observations. The matrix is divided into the part corresponding to the relative rotation and the bias of the empirical orthogonal functions of the data. The full set of diagnostics produced by the analysis of the errors between model and observational vector datasets comprises the errors in the means, the analysis of the total variance of both datasets, the rotation matrix corresponding to the principal components in observation and model, the angle of rotation of model-derived empirical orthogonal functions respect to the ones from observations, the standard deviation of model and observations, the root mean squared error between both datasets and the squared two-dimensional correlation coefficient. See the output of function UVError() in this package.
This software package provides Cox survival analysis for high-dimensional and multiblock datasets. It encompasses a suite of functions dedicated from the classical Cox regression to newest analysis, including Cox proportional hazards model, Stepwise Cox regression, and Elastic-Net Cox regression, Sparse Partial Least Squares Cox regression (sPLS-COX) incorporating three distinct strategies, and two Multiblock-PLS Cox regression (MB-sPLS-COX) methods. This tool is designed to adeptly handle high-dimensional data, and provides tools for cross-validation, plot generation, and additional resources for interpreting results. While references are available within the corresponding functions, key literature is mentioned below. Terry M Therneau (2024) <https://CRAN.R-project.org/package=survival>, Noah Simon et al. (2011) <doi:10.18637/jss.v039.i05>, Philippe Bastien et al. (2005) <doi:10.1016/j.csda.2004.02.005>, Philippe Bastien (2008) <doi:10.1016/j.chemolab.2007.09.009>, Philippe Bastien et al. (2014) <doi:10.1093/bioinformatics/btu660>, Kassu Mehari Beyene and Anouar El Ghouch (2020) <doi:10.1002/sim.8671>, Florian Rohart et al. (2017) <doi:10.1371/journal.pcbi.1005752>.
When considering count data, it is often the case that many more zero counts than would be expected of some given distribution are observed. It is well established that data such as this can be reliably modelled using zero-inflated or hurdle distributions, both of which may be applied using the functions in this package. Bayesian analysis methods are used to best model problematic count data that cannot be fit to any typical distribution. The package functions are flexible and versatile, and can be applied to varying count distributions, parameter estimation with or without explanatory variable information, and are able to allow for multiple hurdles as it is also not uncommon that count data have an abundance of large-number observations which would be considered outliers of the typical distribution. In lieu of throwing out data or misspecifying the typical distribution, these extreme observations can be applied to a second, extreme distribution. With the given functions of this package, such a two-hurdle model may be easily specified in order to best manage data that is both zero-inflated and over-dispersed.
The Ontario Marginalization Index is a socioeconomic model that is built on Statistics Canada census data. The model consists of four dimensions: In 2021, these dimensions were updated to "Material Resources" (previously called "Material Deprivation"), "Households and Dwellings" (previously called "Residential Instability"), "Age and Labour Force" (previously called "Dependency"), and "Racialized and Newcomer Populations" (previously called "Ethnic Concentration"). This update reflects a movement away from deficit-based language. 2021 data will load with these new dimension names, wheras 2011 and 2016 data will load with the historical dimension names. Each of these dimensions are imported for a variety of geographic levels (DA, CD, etc.) for the 2021, 2011 and 2016 administrations of the census. These data sets contribute to community analysis of equity with respect to Ontario's Anti-Racism Act. The Ontario Marginalization Index data is retrieved from the Public Health Ontario website: <https://www.publichealthontario.ca/en/data-and-analysis/health-equity/ontario-marginalization-index>. The shapefile data is retrieved from the Statistics Canada website: <https://www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/bound-limit-eng.cfm>.
This package provides a tool that "multiply imputes" missing data in a single cross-section (such as a survey), from a time series (like variables collected for each year in a country), or from a time-series-cross-sectional data set (such as collected by years for each of several countries). Amelia II implements our bootstrapping-based algorithm that gives essentially the same answers as the standard IP or EMis approaches, is usually considerably faster than existing approaches and can handle many more variables. Unlike Amelia I and other statistically rigorous imputation software, it virtually never crashes (but please let us know if you find to the contrary!). The program also generalizes existing approaches by allowing for trends in time series across observations within a cross-sectional unit, as well as priors that allow experts to incorporate beliefs they have about the values of missing cells in their data. Amelia II also includes useful diagnostics of the fit of multiple imputation models. The program works from the R command line or via a graphical user interface that does not require users to know R.
Geospatial data computation is parallelized by grid, hierarchy, or raster files. Based on future (Bengtsson, 2024 <doi:10.32614/CRAN.package.future>) and mirai (Gao et al., 2025 <doi:10.32614/CRAN.package.mirai>) parallel back-ends, terra (Hijmans et al., 2025 <doi:10.32614/CRAN.package.terra>) and sf (Pebesma et al., 2024 <doi:10.32614/CRAN.package.sf>) functions as well as convenience functions in the package can be distributed over multiple threads. The simplest way of parallelizing generic geospatial computation is to start from par_pad_*() functions to par_grid(), par_hierarchy(), or par_multirasters() functions. Virtually any functions accepting classes in terra or sf packages can be used in the three parallelization functions. A common raster-vector overlay operation is provided as a function extract_at(), which uses exactextractr (Baston, 2023 <doi:10.32614/CRAN.package.exactextractr>), with options for kernel weights for summarizing raster values at vector geometries. Other convenience functions for vector-vector operations including simple areal interpolation (summarize_aw()) and summation of exponentially decaying weights (summarize_sedc()) are also provided.
It is a computer tool to estimate the item-sum score's reliability (composite reliability, CR) in multidimensional scales with overlapping items. An item that measures more than one domain construct is called an overlapping item. The estimation is based on factor models allowing unlimited cross-factor loadings such as exploratory structural equation modeling (ESEM) and Bayesian structural equation modeling (BSEM). The factor models include correlated-factor models and bi-factor models. Specifically for bi-factor models, a type of hierarchical factor model, the package estimates the CR hierarchical subscale/hierarchy and CR subscale/scale total. The CR estimator Omega-generic was proposed by Mai, Srivastava, and Krull (2021) <https://whova.com/embedded/subsession/enars_202103/1450751/1452993/>. The current version can only handle continuous data. Yujiao Mai contributes to the algorithms, R programming, and application example. Deo Kumar Srivastava contributes to the algorithms and the application example. Kevin R. Krull contributes to the application example. The package OmegaG was sponsored by American Lebanese Syrian Associated Charities (ALSAC). However, the contents of OmegaG do not necessarily represent the policy of the ALSAC.
Spatio-temporal data have become increasingly popular in many research fields. Such data often have complex structures that are difficult to describe and estimate. This package provides reliable tools for modeling complicated spatio-temporal data. It also includes tools of online process monitoring to detect possible change-points in a spatio-temporal process over time. More specifically, the package implements the spatio-temporal mean estimation procedure described in Yang and Qiu (2018) <doi:10.1002/sim.7622>, the spatio-temporal covariance estimation procedure discussed in Yang and Qiu (2019) <doi:10.1002/sim.8315>, the three-step method for the joint estimation of spatio-temporal mean and covariance functions suggested by Yang and Qiu (2022) <doi:10.1007/s10463-021-00787-2>, the spatio-temporal disease surveillance method discussed in Qiu and Yang (2021) <doi:10.1002/sim.9150> that can accommodate the covariate effect, the spatial-LASSO-based process monitoring method proposed by Qiu and Yang (2023) <doi:10.1080/00224065.2022.2081104>, and the online spatio-temporal disease surveillance method described in Yang and Qiu (2020) <doi:10.1080/24725854.2019.1696496>.
Easily export R graphs and statistical output to Microsoft Office / LibreOffice', Latex and HTML Documents, using sensible defaults that result in publication-quality output with simple, straightforward commands. Output to Microsoft Office is in editable DrawingML vector format for graphs, and can use corporate template documents for styling. This enables the production of standardized reports and also allows for manual tidy-up of the layout of R graphs in Powerpoint before final publication. Export of graphs is flexible, and functions enable the currently showing R graph or the currently showing R stats object to be exported, but also allow the graphical or tabular output to be passed as objects. The package relies on package officer for export to Office documents,and output files are also fully compatible with LibreOffice'. Base R', ggplot2 and lattice plots are supported, as well as a wide variety of R stats objects, via wrappers to xtable(), broom::tidy() and stargazer(), including aov(), lm(), glm(), lme(), glmnet() and coxph() as well as matrices and data frames and many more...
Owing to the rich shapes of Generalised Lambda Distributions (GLDs), GLD standard/quantile/Accelerated Failure Time (AFT) regression is a competitive flexible model compared to standard/quantile/AFT regression. The proposed method has some major advantages: 1) it provides a reference line which is very robust to outliers with the attractive property of zero mean residuals and 2) it gives a unified, elegant quantile regression model from the reference line with smooth regression coefficients across different quantiles. For AFT model, it also eliminates the needs to try several different AFT models, owing to the flexible shapes of GLD. The goodness of fit of the proposed model can be assessed via QQ plots and Kolmogorov-Smirnov tests and data driven smooth test, to ensure the appropriateness of the statistical inference under consideration. Statistical distributions of coefficients of the GLD regression line are obtained using simulation, and interval estimates are obtained directly from simulated data. References include the following: Su (2015) "Flexible Parametric Quantile Regression Model" <doi:10.1007/s11222-014-9457-1>, Su (2021) "Flexible parametric accelerated failure time model"<doi:10.1080/10543406.2021.1934854>.