Cancer genomes contain large numbers of somatic alterations but few genes drive tumor development. Identifying cancer driver genes is critical for precision oncology. Most of current approaches either identify driver genes based on mutational recurrence or using estimated scores predicting the functional consequences of mutations. driveR is a tool for personalized or batch analysis of genomic data for driver gene prioritization by combining genomic information and prior biological knowledge. As features, driveR uses coding impact metaprediction scores, non-coding impact scores, somatic copy number alteration scores, hotspot gene/double-hit gene condition, phenolyzer gene scores and memberships to cancer-related KEGG pathways. It uses these features to estimate cancer-type-specific probability for each gene of being a cancer driver using the related task of a multi-task learning classification model. The method is described in detail in Ulgen E, Sezerman OU. 2021. driveR: driveR: a novel method for prioritizing cancer driver genes using somatic genomics data. BMC Bioinformatics <doi:10.1186/s12859-021-04203-7>.
An implementation of several Hierarchical Ensemble Methods (HEMs) for Directed Acyclic Graphs (DAGs). HEMDAG package: 1) reconciles flat predictions with the topology of the ontology; 2) can enhance the predictions of virtually any flat learning methods by taking into account the hierarchical relationships between ontology classes; 3) provides biologically meaningful predictions that always obey the true-path-rule, the biological and logical rule that governs the internal coherence of biomedical ontologies; 4) is specifically designed for exploiting the hierarchical relationships of DAG-structured taxonomies, such as the Human Phenotype Ontology (HPO) or the Gene Ontology (GO), but can be safely applied to tree-structured taxonomies as well (as FunCat), since trees are DAGs; 5) scales nicely both in terms of the complexity of the taxonomy and in the cardinality of the examples; 6) provides several utility functions to process and analyze graphs; 7) provides several performance metrics to evaluate HEMs algorithms. (Marco Notaro, Max Schubach, Peter N. Robinson and Giorgio Valentini (2017) <doi:10.1186/s12859-017-1854-y>).
This package provides the setup and calculations needed to run a likelihood-based continual reassessment method (CRM) dose finding trial and performs simulations to assess design performance under various scenarios. 3 dose finding designs are included in this package: ordinal proportional odds model (POM) CRM, ordinal continuation ratio (CR) model CRM, and the binary 2-parameter logistic model CRM. These functions allow customization of design characteristics to vary sample size, cohort sizes, target dose-limiting toxicity (DLT) rates, discrete or continuous dose levels, combining ordinal grades 0 and 1 into one category, and incorporate safety and/or stopping rules. For POM and CR model designs, ordinal toxicity grades are specified by common terminology criteria for adverse events (CTCAE) version 4.0. Function pseudodata creates the necessary starting models for these 3 designs, and function nextdose estimates the next dose to test in a cohort of patients for a target DLT rate. We also provide the function crmsimulations to assess the performance of these 3 dose finding designs under various scenarios.
Choice models are a widely used technique across numerous scientific disciplines. The Apollo package is a very flexible tool for the estimation and application of choice models in R. Users are able to write their own model functions or use a mix of already available ones. Random heterogeneity, both continuous and discrete and at the level of individuals and choices, can be incorporated for all models. There is support for both standalone models and hybrid model structures. Both classical and Bayesian estimation is available, and multiple discrete continuous models are covered in addition to discrete choice. Multi-threading processing is supported for estimation and a large number of pre and post-estimation routines, including for computing posterior (individual-level) distributions are available. For examples, a manual, and a support forum, visit <https://www.ApolloChoiceModelling.com>. For more information on choice models see Train, K. (2009) <isbn:978-0-521-74738-7> and Hess, S. & Daly, A.J. (2014) <isbn:978-1-781-00314-5> for an overview of the field.
Estimation of treatment hierarchies in network meta-analysis using a novel frequentist approach based on treatment choice criteria (TCC) and probabilistic ranking models, as described by Evrenoglou et al. (2024) <DOI:10.48550/arXiv.2406.10612>. The TCC are defined using a rule based on the smallest worthwhile difference (SWD). Using the defined TCC, the NMA estimates (i.e., treatment effects and standard errors) are first transformed into treatment preferences, indicating either a treatment preference (e.g., treatment A > treatment B) or a tie (treatment A = treatment B). These treatment preferences are then synthesized using a probabilistic ranking model, which estimates the latent ability parameter of each treatment and produces the final treatment hierarchy. This parameter represents each treatments ability to outperform all the other competing treatments in the network. Here the terms ability to outperform indicates the propensity of each treatment to yield clinically important and beneficial effects when compared to all the other treatments in the network. Consequently, larger ability estimates indicate higher positions in the ranking list.
Population-averaged models have been increasingly used in the design and analysis of cluster randomized trials (CRTs). To facilitate the applications of population-averaged models in CRTs, the package implements the generalized estimating equations (GEE) and matrix-adjusted estimating equations (MAEE) approaches to jointly estimate the marginal mean models correlation models both for general CRTs and stepped wedge CRTs. Despite the general GEE/MAEE approach, the package also implements a fast cluster-period GEE method by Li et al. (2022) <doi:10.1093/biostatistics/kxaa056> specifically for stepped wedge CRTs with large and variable cluster-period sizes and gives a simple and efficient estimating equations approach based on the cluster-period means to estimate the intervention effects as well as correlation parameters. In addition, the package also provides functions for generating correlated binary data with specific mean vector and correlation matrix based on the multivariate probit method in Emrich and Piedmonte (1991) <doi:10.1080/00031305.1991.10475828> or the conditional linear family method in Qaqish (2003) <doi:10.1093/biomet/90.2.455>.
This package performs genetic association analyses of case-parent triad (trio) data with multiple markers. It can also incorporate complete or incomplete control triads, for instance independent control children. Estimation is based on haplotypes, for instance SNP haplotypes, even though phase is not known from the genetic data. Haplin estimates relative risk (RR + conf.int.) and p-value associated with each haplotype. It uses maximum likelihood estimation to make optimal use of data from triads with missing genotypic data, for instance if some SNPs has not been typed for some individuals. Haplin also allows estimation of effects of maternal haplotypes and parent-of-origin effects, particularly appropriate in perinatal epidemiology. Haplin allows special models, like X-inactivation, to be fitted on the X-chromosome. A GxE analysis allows testing interactions between environment and all estimated genetic effects. The models were originally described in "Gjessing HK and Lie RT. Case-parent triads: Estimating single- and double-dose effects of fetal and maternal disease gene haplotypes. Annals of Human Genetics (2006) 70, pp. 382-396".
This package provides the function qqtest which incorporates uncertainty in its qqplot display(s) so that the user might have a better sense of the evidence against the specified distributional hypothesis. qqtest draws a quantile quantile plot for visually assessing whether the data come from a test distribution that has been defined in one of many ways. The vertical axis plots the data quantiles, the horizontal those of a test distribution. The default behaviour generates 1000 samples from the test distribution and overlays the plot with shaded pointwise interval estimates for the ordered quantiles from the test distribution. A small number of independently generated exemplar quantile plots can also be overlaid. Both the interval estimates and the exemplars provide different comparative information to assess the evidence provided by the qqplot for or against the hypothesis that the data come from the test distribution (default is normal or gaussian). Finally, a visual test of significance (a lineup plot) can also be displayed to test the null hypothesis that the data come from the test distribution.
This package provides a leadership-inference framework for multivariate time series. The framework for multiple-faction-leadership inference from coordinated activities or mFLICA uses a notion of a leader as an individual who initiates collective patterns that everyone in a group follows. Given a set of time series of individual activities, our goal is to identify periods of coordinated activity, find factions of coordination if more than one exist, as well as identify leaders of each faction. For each time step, the framework infers following relations between individual time series, then identifying a leader of each faction whom many individuals follow but it follows no one. A faction is defined as a group of individuals that everyone follows the same leader. mFLICA reports following relations, leaders of factions, and members of each faction for each time step. Please see Chainarong Amornbunchornvej and Tanya Berger-Wolf (2018) <doi:10.1137/1.9781611975321.62> for methodology and Chainarong Amornbunchornvej (2021) <doi:10.1016/j.softx.2021.100781> for software when referring to this package in publications.
This package provides a stand-alone function that generates a user specified number of random datasets and computes eigenvalues using the random datasets (i.e., implements Horn's [1965, Psychometrika] parallel analysis <doi:10.1007/BF02289447>). Users then compare the resulting eigenvalues (the mean or the specified percentile) from the random datasets (i.e., eigenvalues resulting from noise) to the eigenvalues generated with the user's data. Can be used for both principal components analysis (PCA) and common/exploratory factor analysis (EFA). The output table shows how large eigenvalues can be as a result of merely using randomly generated datasets. If the user's own dataset has actual eigenvalues greater than the corresponding eigenvalues, that lends support to retain that factor/component. In other words, if the i(th) eigenvalue from the actual data was larger than the percentile of the (i)th eigenvalue generated using randomly generated data, empirical support is provided to retain that factor/component. Horn, J. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 32, 179-185.
Relative transcript abundance has proven to be a valuable tool for understanding the function of genes in biological systems. For the differential analysis of transcript abundance using RNA sequencing data, the negative binomial model is by far the most frequently adopted. However, common methods that are based on a negative binomial model are not robust to extreme outliers, which we found to be abundant in public datasets. So far, no rigorous and probabilistic methods for detection of outliers have been developed for RNA sequencing data, leaving the identification mostly to visual inspection. Recent advances in Bayesian computation allow large-scale comparison of observed data against its theoretical distribution given in a statistical model. Here we propose ppcseq, a key quality-control tool for identifying transcripts that include outlier data points in differential expression analysis, which do not follow a negative binomial distribution. Applying ppcseq to analyse several publicly available datasets using popular tools, we show that from 3 to 10 percent of differentially abundant transcripts across algorithms and datasets had statistics inflated by the presence of outliers.
This package provides a system for calculating the optimal sampling effort, based on the ideas of "Ecological cost-benefit optimization" as developed by A. Underwood (1997, ISBN 0 521 55696 1). Data is obtained from simulated ecological communities with prep_data() which formats and arranges the initial data, and then the optimization follows the following procedure of four functions: (1) prep_data() takes the original dataset and creates simulated sets that can be used as a basis for estimating statistical power and type II error. (2) sim_beta() is used to estimate the statistical power for the different sampling efforts specified by the user. (3) sim_cbo() calculates then the optimal sampling effort, based on the statistical power and the sampling costs. Additionally, (4) scompvar() calculates the variation components necessary for (5) Underwood_cbo() to calculate the optimal combination of number of sites and samples depending on either an economic budget or on a desired statistical accuracy. Lastly, (6) plot_power() helps the user visualize the results of sim_beta().
This package provides functionality to perform machine-learning-based modeling in a computation pipeline. Its functions contain the basic steps of machine-learning-based knowledge discovery workflows, including model training and optimization, model evaluation, and model testing. To perform these tasks, the package builds heavily on existing machine-learning packages, such as caret <https://github.com/topepo/caret/> and associated packages. The package can train multiple models, optimize model hyperparameters by performing a grid search or a random search, and evaluates model performance by different metrics. Models can be validated either on a test data set, or in case of a small sample size by k-fold cross validation or repeated bootstrapping. It also allows for 0-Hypotheses generation by performing permutation experiments. Additionally, it offers methods of model interpretation and item categorization to identify the most informative features from a high dimensional data space. The functions of this package can easily be integrated into computation pipelines (e.g. nextflow <https://www.nextflow.io/>) and hereby improve scalability, standardization, and re-producibility in the context of machine-learning.
This package implements multiple existing open-source algorithms for coding cause of death from verbal autopsies. The methods implemented include InterVA4 by Byass et al (2012) <doi:10.3402/gha.v5i0.19281>, InterVA5 by Byass at al (2019) <doi:10.1186/s12916-019-1333-6>, InSilicoVA by McCormick et al (2016) <doi:10.1080/01621459.2016.1152191>, NBC by Miasnikof et al (2015) <doi:10.1186/s12916-015-0521-2>, and a replication of Tariff method by James et al (2011) <doi:10.1186/1478-7954-9-31> and Serina, et al. (2015) <doi:10.1186/s12916-015-0527-9>. It also provides tools for data manipulation tasks commonly used in Verbal Autopsy analysis and implements easy graphical visualization of individual and population level statistics. The NBC method is implemented by the nbc4va package that can be installed from <https://github.com/rrwen/nbc4va>. Note that this package was not developed by authors affiliated with the Institute for Health Metrics and Evaluation and thus unintentional discrepancies may exist in the implementation of the Tariff method.
This package provides an efficient and very flexible framework to conduct data-driven epidemiological modeling in realistic large scale disease spread simulations. The framework integrates infection dynamics in subpopulations as continuous-time Markov chains using the Gillespie stochastic simulation algorithm and incorporates available data such as births, deaths and movements as scheduled events at predefined time-points. Using C code for the numerical solvers and OpenMP (if available) to divide work over multiple processors ensures high performance when simulating a sample outcome. One of our design goals was to make the package extendable and enable usage of the numerical solvers from other R extension packages in order to facilitate complex epidemiological research. The package contains template models and can be extended with user-defined models. For more details see the paper by Widgren, Bauer, Eriksson and Engblom (2019) <doi:10.18637/jss.v091.i12>. The package also provides functionality to fit models to time series data using the Approximate Bayesian Computation Sequential Monte Carlo ('ABC-SMC') algorithm of Toni and others (2009) <doi:10.1098/rsif.2008.0172>.
This package provides a new diagram for the verification of vector variables (wind, current, etc) generated by multiple models against a set of observations is presented in this package. It has been designed as a generalization of the Taylor diagram to two dimensional quantities. It is based on the analysis of the two-dimensional structure of the mean squared error matrix between model and observations. The matrix is divided into the part corresponding to the relative rotation and the bias of the empirical orthogonal functions of the data. The full set of diagnostics produced by the analysis of the errors between model and observational vector datasets comprises the errors in the means, the analysis of the total variance of both datasets, the rotation matrix corresponding to the principal components in observation and model, the angle of rotation of model-derived empirical orthogonal functions respect to the ones from observations, the standard deviation of model and observations, the root mean squared error between both datasets and the squared two-dimensional correlation coefficient. See the output of function UVError() in this package.
This software package provides Cox survival analysis for high-dimensional and multiblock datasets. It encompasses a suite of functions dedicated from the classical Cox regression to newest analysis, including Cox proportional hazards model, Stepwise Cox regression, and Elastic-Net Cox regression, Sparse Partial Least Squares Cox regression (sPLS-COX) incorporating three distinct strategies, and two Multiblock-PLS Cox regression (MB-sPLS-COX) methods. This tool is designed to adeptly handle high-dimensional data, and provides tools for cross-validation, plot generation, and additional resources for interpreting results. While references are available within the corresponding functions, key literature is mentioned below. Terry M Therneau (2024) <https://CRAN.R-project.org/package=survival>, Noah Simon et al. (2011) <doi:10.18637/jss.v039.i05>, Philippe Bastien et al. (2005) <doi:10.1016/j.csda.2004.02.005>, Philippe Bastien (2008) <doi:10.1016/j.chemolab.2007.09.009>, Philippe Bastien et al. (2014) <doi:10.1093/bioinformatics/btu660>, Kassu Mehari Beyene and Anouar El Ghouch (2020) <doi:10.1002/sim.8671>, Florian Rohart et al. (2017) <doi:10.1371/journal.pcbi.1005752>.
When considering count data, it is often the case that many more zero counts than would be expected of some given distribution are observed. It is well established that data such as this can be reliably modelled using zero-inflated or hurdle distributions, both of which may be applied using the functions in this package. Bayesian analysis methods are used to best model problematic count data that cannot be fit to any typical distribution. The package functions are flexible and versatile, and can be applied to varying count distributions, parameter estimation with or without explanatory variable information, and are able to allow for multiple hurdles as it is also not uncommon that count data have an abundance of large-number observations which would be considered outliers of the typical distribution. In lieu of throwing out data or misspecifying the typical distribution, these extreme observations can be applied to a second, extreme distribution. With the given functions of this package, such a two-hurdle model may be easily specified in order to best manage data that is both zero-inflated and over-dispersed.
The Ontario Marginalization Index is a socioeconomic model that is built on Statistics Canada census data. The model consists of four dimensions: In 2021, these dimensions were updated to "Material Resources" (previously called "Material Deprivation"), "Households and Dwellings" (previously called "Residential Instability"), "Age and Labour Force" (previously called "Dependency"), and "Racialized and Newcomer Populations" (previously called "Ethnic Concentration"). This update reflects a movement away from deficit-based language. 2021 data will load with these new dimension names, wheras 2011 and 2016 data will load with the historical dimension names. Each of these dimensions are imported for a variety of geographic levels (DA, CD, etc.) for the 2021, 2011 and 2016 administrations of the census. These data sets contribute to community analysis of equity with respect to Ontario's Anti-Racism Act. The Ontario Marginalization Index data is retrieved from the Public Health Ontario website: <https://www.publichealthontario.ca/en/data-and-analysis/health-equity/ontario-marginalization-index>. The shapefile data is retrieved from the Statistics Canada website: <https://www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/bound-limit-eng.cfm>.
This package provides a tool that "multiply imputes" missing data in a single cross-section (such as a survey), from a time series (like variables collected for each year in a country), or from a time-series-cross-sectional data set (such as collected by years for each of several countries). Amelia II implements our bootstrapping-based algorithm that gives essentially the same answers as the standard IP or EMis approaches, is usually considerably faster than existing approaches and can handle many more variables. Unlike Amelia I and other statistically rigorous imputation software, it virtually never crashes (but please let us know if you find to the contrary!). The program also generalizes existing approaches by allowing for trends in time series across observations within a cross-sectional unit, as well as priors that allow experts to incorporate beliefs they have about the values of missing cells in their data. Amelia II also includes useful diagnostics of the fit of multiple imputation models. The program works from the R command line or via a graphical user interface that does not require users to know R.
Geospatial data computation is parallelized by grid, hierarchy, or raster files. Based on future (Bengtsson, 2024 <doi:10.32614/CRAN.package.future>) and mirai (Gao et al., 2025 <doi:10.32614/CRAN.package.mirai>) parallel back-ends, terra (Hijmans et al., 2025 <doi:10.32614/CRAN.package.terra>) and sf (Pebesma et al., 2024 <doi:10.32614/CRAN.package.sf>) functions as well as convenience functions in the package can be distributed over multiple threads. The simplest way of parallelizing generic geospatial computation is to start from par_pad_*() functions to par_grid(), par_hierarchy(), or par_multirasters() functions. Virtually any functions accepting classes in terra or sf packages can be used in the three parallelization functions. A common raster-vector overlay operation is provided as a function extract_at(), which uses exactextractr (Baston, 2023 <doi:10.32614/CRAN.package.exactextractr>), with options for kernel weights for summarizing raster values at vector geometries. Other convenience functions for vector-vector operations including simple areal interpolation (summarize_aw()) and summation of exponentially decaying weights (summarize_sedc()) are also provided.
It is a computer tool to estimate the item-sum score's reliability (composite reliability, CR) in multidimensional scales with overlapping items. An item that measures more than one domain construct is called an overlapping item. The estimation is based on factor models allowing unlimited cross-factor loadings such as exploratory structural equation modeling (ESEM) and Bayesian structural equation modeling (BSEM). The factor models include correlated-factor models and bi-factor models. Specifically for bi-factor models, a type of hierarchical factor model, the package estimates the CR hierarchical subscale/hierarchy and CR subscale/scale total. The CR estimator Omega-generic was proposed by Mai, Srivastava, and Krull (2021) <https://whova.com/embedded/subsession/enars_202103/1450751/1452993/>. The current version can only handle continuous data. Yujiao Mai contributes to the algorithms, R programming, and application example. Deo Kumar Srivastava contributes to the algorithms and the application example. Kevin R. Krull contributes to the application example. The package OmegaG was sponsored by American Lebanese Syrian Associated Charities (ALSAC). However, the contents of OmegaG do not necessarily represent the policy of the ALSAC.
Spatio-temporal data have become increasingly popular in many research fields. Such data often have complex structures that are difficult to describe and estimate. This package provides reliable tools for modeling complicated spatio-temporal data. It also includes tools of online process monitoring to detect possible change-points in a spatio-temporal process over time. More specifically, the package implements the spatio-temporal mean estimation procedure described in Yang and Qiu (2018) <doi:10.1002/sim.7622>, the spatio-temporal covariance estimation procedure discussed in Yang and Qiu (2019) <doi:10.1002/sim.8315>, the three-step method for the joint estimation of spatio-temporal mean and covariance functions suggested by Yang and Qiu (2022) <doi:10.1007/s10463-021-00787-2>, the spatio-temporal disease surveillance method discussed in Qiu and Yang (2021) <doi:10.1002/sim.9150> that can accommodate the covariate effect, the spatial-LASSO-based process monitoring method proposed by Qiu and Yang (2023) <doi:10.1080/00224065.2022.2081104>, and the online spatio-temporal disease surveillance method described in Yang and Qiu (2020) <doi:10.1080/24725854.2019.1696496>.
Easily export R graphs and statistical output to Microsoft Office / LibreOffice', Latex and HTML Documents, using sensible defaults that result in publication-quality output with simple, straightforward commands. Output to Microsoft Office is in editable DrawingML vector format for graphs, and can use corporate template documents for styling. This enables the production of standardized reports and also allows for manual tidy-up of the layout of R graphs in Powerpoint before final publication. Export of graphs is flexible, and functions enable the currently showing R graph or the currently showing R stats object to be exported, but also allow the graphical or tabular output to be passed as objects. The package relies on package officer for export to Office documents,and output files are also fully compatible with LibreOffice'. Base R', ggplot2 and lattice plots are supported, as well as a wide variety of R stats objects, via wrappers to xtable(), broom::tidy() and stargazer(), including aov(), lm(), glm(), lme(), glmnet() and coxph() as well as matrices and data frames and many more...