Various cladogenesis-related calculations that are slow in pure R are implemented in C++ with Rcpp. These include the calculation of the probability of various scenarios for the inheritance of geographic range at the divergence events on a phylogenetic tree, and other calculations necessary for models which are not continuous-time markov chains (CTMC), but where change instead occurs instantaneously at speciation events. Typically these models must assess the probability of every possible combination of (ancestor state, left descendent state, right descendent state). This means that there are up to (# of states)^3 combinations to investigate, and in biogeographical models, there can easily be hundreds of states, so calculation time becomes an issue. C++ implementation plus clever tricks (many combinations can be eliminated a priori) can greatly speed the computation time over naive R implementations. CITATION INFO: This package is the result of my Ph.D. research, please cite the package if you use it! Type: citation(package="cladoRcpp") to get the citation information.
An implementation to reconstruct individual patient data from Kaplan-Meier (K-M) survival curves, visualize and assess the accuracy of the reconstruction, then perform secondary analysis on the reconstructed data. We involve a simple function to extract the coordinates form the published K-M curves. The function is developed based on Poisot T. â s digitize package (2011) <doi:10.32614/RJ-2011-004> . For more complex and tangled together graphs, digitizing software, such as DigitizeIt (for MAC or windows) or ScanIt'(for windows) can be used to get the coordinates. Additional information should also be involved to increase the accuracy, like numbers of patients at risk (often reported at 5-10 time points under the x-axis of the K-M graph), total number of patients, and total number of events. The package implements the modified iterative K-M estimation algorithm (modified-iKM) improved upon the approach proposed by Guyot (2012) <doi:10.1186/1471-2288-12-9> with some modifications.
We proposes a framework that provides real time support for early detection of anomalous series within a large collection of streaming time series data. By definition, anomalies are rare in comparison to a system's typical behaviour. We define an anomaly as an observation that is very unlikely given the forecast distribution. The algorithm first forecasts a boundary for the system's typical behaviour using a representative sample of the typical behaviour of the system. An approach based on extreme value theory is used for this boundary prediction process. Then a sliding window is used to test for anomalous series within the newly arrived collection of series. Feature based representation of time series is used as the input to the model. To cope with concept drift, the forecast boundary for the system's typical behaviour is updated periodically. More details regarding the algorithm can be found in Talagala, P. D., Hyndman, R. J., Smith-Miles, K., et al. (2019) <doi:10.1080/10618600.2019.1617160>.
Implementation of the classic Genz algorithm and a novel tile-low-rank algorithm for computing relatively high-dimensional multivariate normal (MVN) and Student-t (MVT) probabilities. References used for this package: Foley, James, Andries van Dam, Steven Feiner, and John Hughes. "Computer Graphics: Principle and Practice". Addison-Wesley Publishing Company. Reading, Massachusetts (1987, ISBN:0-201-84840-6 1); Genz, A., "Numerical computation of multivariate normal probabilities," Journal of Computational and Graphical Statistics, 1, 141-149 (1992) <doi:10.1080/10618600.1992.10477010>; Cao, J., Genton, M. G., Keyes, D. E., & Turkiyyah, G. M. "Exploiting Low Rank Covariance Structures for Computing High-Dimensional Normal and Student- t Probabilities," Statistics and Computing, 31.1, 1-16 (2021) <doi:10.1007/s11222-020-09978-y>; Cao, J., Genton, M. G., Keyes, D. E., & Turkiyyah, G. M. "tlrmvnmvt: Computing High-Dimensional Multivariate Normal and Student-t Probabilities with Low-Rank Methods in R," Journal of Statistical Software, 101.4, 1-25 (2022) <doi:10.18637/jss.v101.i04>.
This package provides functions for the evaluation of basket trial designs with binary endpoints. Operating characteristics of a basket trial design are assessed by simulating trial data according to scenarios, analyzing the data with Bayesian hierarchical models (BHMs), and assessing decision probabilities on stratum and trial-level based on Go / No-go decision making. The package is build for high flexibility regarding decision rules, number of interim analyses, number of strata, and recruitment. The BHMs proposed by Berry et al. (2013) <doi:10.1177/1740774513497539> and Neuenschwander et al. (2016) <doi:10.1002/pst.1730>, as well as a model that combines both approaches are implemented. Functions are provided to implement Bayesian decision rules as for example proposed by Fisch et al. (2015) <doi:10.1177/2168479014533970>. In addition, posterior point estimates (mean/median) and credible intervals for response rates and some model parameters can be calculated. For simulated trial data, bias and mean squared errors of posterior point estimates for response rates can be provided.
Enrichment analysis enables researchers to uncover mechanisms underlying a phenotype. However, conventional methods for enrichment analysis do not take into account protein-protein interaction information, resulting in incomplete conclusions. pathfindR is a tool for enrichment analysis utilizing active subnetworks. The main function identifies active subnetworks in a protein-protein interaction network using a user-provided list of genes and associated p values. It then performs enrichment analyses on the identified subnetworks, identifying enriched terms (i.e. pathways or, more broadly, gene sets) that possibly underlie the phenotype of interest. pathfindR also offers functionalities to cluster the enriched terms and identify representative terms in each cluster, to score the enriched terms per sample and to visualize analysis results. The enrichment, clustering and other methods implemented in pathfindR are described in detail in Ulgen E, Ozisik O, Sezerman OU. 2019. pathfindR': An R Package for Comprehensive Identification of Enriched Pathways in Omics Data Through Active Subnetworks. Front. Genet. <doi:10.3389/fgene.2019.00858>.
Paternal recombination rate and maternal linkage disequilibrium (LD) are estimated for pairs of biallelic markers such as single nucleotide polymorphisms (SNPs) from progeny genotypes and sire haplotypes. The implementation relies on paternal half-sib families. If maternal half-sib families are used, the roles of sire/dam are swapped. Multiple families can be considered. For parameter estimation, at least one sire has to be double heterozygous at the investigated pairs of SNPs. Based on recombination rates, genetic distances between markers can be estimated. Markers with unusually large recombination rate to markers in close proximity (i.e. putatively misplaced markers) shall be discarded in this derivation. A workflow description is attached as vignette. *A pipeline is available at GitHub* <https://github.com/wittenburg/hsrecombi> Hampel, Teuscher, Gomez-Raya, Doschoris, Wittenburg (2018) "Estimation of recombination rate and maternal linkage disequilibrium in half-sibs" <doi:10.3389/fgene.2018.00186>. Gomez-Raya (2012) "Maximum likelihood estimation of linkage disequilibrium in half-sib families" <doi:10.1534/genetics.111.137521>.
This package provides a comprehensive R package for accessing and working with publicly available and free resources from the Agency for Healthcare Research and Quality (AHRQ) Healthcare Cost and Utilization Project (HCUP). The package provides streamlined access to HCUP's Clinical Classifications Software Refined (CCSR) mapping files and Summary Trend Tables, enabling researchers and analysts to efficiently map ICD-10-CM diagnosis codes and ICD-10-PCS procedure codes to CCSR categories and access HCUP statistical reports. Key features include: direct download from HCUP website, multiple output formats (long/wide/default), cross-classification support, version management, citation generation, and intelligent caching. The package does not redistribute HCUP data files but facilitates direct download from the official HCUP website, ensuring users always have access to the latest versions and maintain compliance with HCUP data use policies. This package only accesses free public tools and reports; it does NOT access HCUP databases (NIS, KID, SID, NEDS, etc.) that require purchase. For more information, see <https://hcup-us.ahrq.gov/>.
Fit survival data and perform dynamic prediction under joint frailty-copula models for tumour progression and death. Likelihood-based methods are employed for estimating model parameters, where the baseline hazard functions are modeled by the cubic M-spline or the Weibull model. The methods are applicable for meta-analytic data containing individual-patient information from several studies. Survival outcomes need information on both terminal event time (e.g., time-to-death) and non-terminal event time (e.g., time-to-tumour progression). Methodologies were published in Emura et al. (2017) <doi:10.1177/0962280215604510>, Emura et al. (2018) <doi:10.1177/0962280216688032>, Emura et al. (2020) <doi:10.1177/0962280219892295>, Shinohara et al. (2020) <doi:10.1080/03610918.2020.1855449>, Wu et al. (2020) <doi:10.1007/s00180-020-00977-1>, and Emura et al. (2021) <doi:10.1177/09622802211046390>. See also the book of Emura et al. (2019) <doi:10.1007/978-981-13-3516-7>. Survival data from ovarian cancer patients are also available.
Weighted Deming regression, also known as errors-in-variable regression, is applied with suitable weights. Weights are modeled via a precision profile; thus the methods implemented here are referred to as precision profile weighted Deming (PWD) regression. The package covers two settings â one where the precision profiles are known either from external studies or from adequate replication of the X and Y readings, and one in which there is a plausible functional form for the precision profiles but the exact (unknown) function must be estimated from the (generally singlicate) readings. The function set includes tools for: estimated standard errors (via jackknifing); standardized-residual analysis function with regression diagnostic tools for normality, linearity and constant variance; and an outlier analysis identifying significant outliers for closer investigation. The following reference provides further information on mathematical derivations and applications. Hawkins, D.M., and J.J. Kraker (2026). Precision Profile Weighted Deming Regression for Methods Comparison'. The Journal of Applied Laboratory Medicine 11, 379-392 <doi:10.1093/jalm/jfaf183>.
Help and demo in Spanish of the orloca package. Ayuda y demo en espanol del paquete orloca. Objetos y metodos para manejar y resolver el problema de localizacion de suma minima, tambien conocido como problema de Fermat-Weber. El problema de localizacion de suma minima busca un punto tal que la suma ponderada de las distancias a los puntos de demanda se minimice. Vease "The Fermat-Weber location problem revisited" por Brimberg, Mathematical Programming, 1, pag. 71-76, 1995. <DOI: 10.1007/BF01592245>. Se usan algoritmos generales de optimizacion global para resolver el problema, junto con el metodo especifico Weiszfeld, vease "Sur le point pour lequel la Somme des distance de n points donnes est minimum", por Weiszfeld, Tohoku Mathematical Journal, First Series, 43, pag. 355-386, 1937 o "On the point for which the sum of the distances to n given points is minimum", por E. Weiszfeld y F. Plastria, Annals of Operations Research, 167, pg. 7-41, 2009. <DOI:10.1007/s10479-008-0352-z>.
This package performs automatic creation of short forms of scales with an ant colony optimization algorithm and a Tabu search. As implemented in the package, the ant colony algorithm randomly selects items to build a model of a specified length, then updates the probability of item selection according to the fit of the best model within each set of searches. The algorithm continues until the same items are selected by multiple ants a given number of times in a row. On the other hand, the Tabu search changes one parameter at a time to be either free, constrained, or fixed while keeping track of the changes made and putting changes that result in worse fit in a "tabu" list so that the algorithm does not revisit them for some number of searches. See Leite, Huang, & Marcoulides (2008) <doi:10.1080/00273170802285743> for an applied example of the ant colony algorithm, and Marcoulides & Falk (2018) <doi:10.1080/10705511.2017.1409074> for an applied example of the Tabu search.
We develop a new class of distribution free multiple testing rules for false discovery rate (FDR) control under general dependence. A key element in our proposal is a symmetrized data aggregation (SDA) approach to incorporating the dependence structure via sample splitting, data screening and information pooling. The proposed SDA filter first constructs a sequence of ranking statistics that fulfill global symmetry properties, and then chooses a data driven threshold along the ranking to control the FDR. For more information, see the website below and the accompanying paper: Du et al. (2023), "False Discovery Rate Control Under General Dependence By Symmetrized Data Aggregation", <doi:10.1080/01621459.2021.1945459>. Some optional functionality uses the archived R packages â hugeâ and â pfaâ , which are not available from CRANâ s main repositories. Users who need this optional functionality can obtain them from the CRAN Archive as follows: â hugeâ at <https://cran.r-project.org/src/contrib/Archive/huge/>; â pfaâ at <https://cran.r-project.org/src/contrib/Archive/pfa/>.
Cluster-randomized trials (CRTs) assign treatment to groups rather than individuals, so valid analyses must distinguish cluster-level and individual-level effects and define estimands within a potential-outcomes framework. This package supports right-censored survival outcomes for both single-state (binary) and multi-state settings. For single-state outcomes, it provides estimands based on stage-specific survival contrasts (SPCE) and restricted mean survival time (RMST). For multi-state outcomes, it provides SPCE as well as a generalized win-based restricted mean time-in-favor estimand (RMT-IF). The package implements doubly robust estimators that accommodate covariate-dependent censoring and remain consistent if either the outcome model or the censoring model is correctly specified. Users can choose marginal Cox or gamma-frailty Cox working models for nuisance estimation, and inference is supported via leave-one-cluster-out jackknife variance and confidence interval estimation. Methods are described in Fang et al. (2025) "Estimands and doubly robust estimation for cluster-randomized trials with survival outcomes" <doi:10.48550/arXiv.2510.08438>.
This package provides a tool for comprehensive transcriptomic data analysis, with a focus on transcript-level data preprocessing, expression profiling, differential expression analysis, and functional enrichment. It enables researchers to identify key biological processes, disease biomarkers, and gene regulatory mechanisms. TransProR is aimed at researchers and bioinformaticians working with RNA-Seq data, providing an intuitive framework for in-depth analysis and visualization of transcriptomic datasets. The package includes comprehensive documentation and usage examples to guide users through the entire analysis pipeline. The differential expression analysis methods incorporated in the package include limma (Ritchie et al., 2015, <doi:10.1093/nar/gkv007>; Smyth, 2005, <doi:10.1007/0-387-29362-0_23>), edgeR (Robinson et al., 2010, <doi:10.1093/bioinformatics/btp616>), DESeq2 (Love et al., 2014, <doi:10.1186/s13059-014-0550-8>), and Wilcoxon tests (Li et al., 2022, <doi:10.1186/s13059-022-02648-4>), providing flexible and robust approaches to RNA-Seq data analysis. For more information, refer to the package vignettes and related publications.
The advent of genomic technologies has enabled the generation of two-dimensional or even multi-dimensional high-throughput data, e.g., monitoring multiple changes in gene expression in genome-wide siRNA screens across many different cell types (E Robert McDonald 3rd (2017) <doi: 10.1016/j.cell.2017.07.005> and Tsherniak A (2017) <doi: 10.1016/j.cell.2017.06.010>) or single cell transcriptomics under different experimental conditions. We found that simple computational methods based on a single statistical criterion is no longer adequate for analyzing such multi-dimensional data. We herein introduce ZetaSuite', a statistical package initially designed to score hits from two-dimensional RNAi screens.We also illustrate a unique utility of ZetaSuite in analyzing single cell transcriptomics to differentiate rare cells from damaged ones (Vento-Tormo R (2018) <doi: 10.1038/s41586-018-0698-6>). In ZetaSuite', we have the following steps: QC of input datasets, normalization using Z-transformation, Zeta score calculation and hits selection based on defined Screen Strength.
This package provides a collection of miscellaneous basic statistic functions and convenience wrappers for efficiently describing data. The author's intention was to create a toolbox, which facilitates the (notoriously time consuming) first descriptive tasks in data analysis, consisting of calculating descriptive statistics, drawing graphical summaries and reporting the results. The package contains furthermore functions to produce documents using MS Word (or PowerPoint) and functions to import data from Excel. Many of the included functions can be found scattered in other packages and other sources written partly by Titans of R. The reason for collecting them here, was primarily to have them consolidated in ONE instead of dozens of packages (which themselves might depend on other packages which are not needed at all), and to provide a common and consistent interface as far as function and arguments naming, NA handling, recycling rules etc. are concerned. Google style guides were used as naming rules (in absence of convincing alternatives). The BigCamelCase style was consequently applied to functions borrowed from contributed R packages as well.
FamSKAT-RC is a family-based association kernel test for both rare and common variants. This test is general and several special cases are known as other methods: famSKAT, which only focuses on rare variants in family-based data, SKAT, which focuses on rare variants in population-based data (unrelated individuals), and SKAT-RC, which focuses on both rare and common variants in population-based data. When one applies famSKAT-RC and sets the value of phi to 1, famSKAT-RC becomes famSKAT. When one applies famSKAT-RC and set the value of phi to 1 and the kinship matrix to the identity matrix, famSKAT-RC becomes SKAT. When one applies famSKAT-RC and set the kinship matrix (fullkins) to the identity matrix (and phi is not equal to 1), famSKAT-RC becomes SKAT-RC. We also include a small sample synthetic pedigree to demonstrate the method with. For more details see Saad M and Wijsman EM (2014) <doi:10.1002/gepi.21844>.
This package provides a collection of functions to construct A-optimal block designs for comparing test treatments with one or more control(s). Mainly A-optimal balanced treatment incomplete block designs, weighted A-optimal balanced treatment incomplete block designs, A-optimal group divisible treatment designs and A-optimal balanced bipartite block designs can be constructed using the package. The designs are constructed using algorithms based on linear integer programming. To the best of our knowledge, these facilities to construct A-optimal block designs for comparing test treatments with one or more controls are not available in the existing R packages. For more details on designs for tests versus control(s) comparisons, please see Hedayat, A. S. and Majumdar, D. (1984) <doi:10.1080/00401706.1984.10487989> A-Optimal Incomplete Block Designs for Control-Test Treatment Comparisons, Technometrics, 26, 363-370 and Mandal, B. N. , Gupta, V. K., Parsad, Rajender. (2017) <doi:10.1080/03610926.2015.1071394> Balanced treatment incomplete block designs through integer programming. Communications in Statistics - Theory and Methods 46(8), 3728-3737.
Missing values often occur in financial data due to a variety of reasons (errors in the collection process or in the processing stage, lack of asset liquidity, lack of reporting of funds, etc.). However, most data analysis methods expect complete data and cannot be employed with missing values. One convenient way to deal with this issue without having to redesign the data analysis method is to impute the missing values. This package provides an efficient way to impute the missing values based on modeling the time series with a random walk or an autoregressive (AR) model, convenient to model log-prices and log-volumes in financial data. In the current version, the imputation is univariate-based (so no asset correlation is used). In addition, outliers can be detected and removed. The package is based on the paper: J. Liu, S. Kumar, and D. P. Palomar (2019). Parameter Estimation of Heavy-Tailed AR Model With Missing Data Via Stochastic EM. IEEE Trans. on Signal Processing, vol. 67, no. 8, pp. 2159-2172. <doi:10.1109/TSP.2019.2899816>.
Navigating the shift of clinical laboratory data from primary everyday clinical use to secondary research purposes presents a significant challenge. Given the substantial time and expertise required for lab data pre-processing and cleaning and the lack of all-in-one tools tailored for this need, we developed our algorithm lab2clean as an open-source R-package. lab2clean package is set to automate and standardize the intricate process of cleaning clinical laboratory results. With a keen focus on improving the data quality of laboratory result values and units, our goal is to equip researchers with a straightforward, plug-and-play tool, making it smoother for them to unlock the true potential of clinical laboratory data in clinical research and clinical machine learning (ML) model development. Functions to clean & validate result values (Version 1.0) are described in detail in Zayed et al. (2024) <doi:10.1186/s12911-024-02652-7>. Functions to standardize & harmonize result units (added in Version 2.0) are described in detail in Zayed et al. (2025) <doi:10.1016/j.ijmedinf.2025.106131>.
We provide a toolbox to estimate the time delay between the brightness time series of gravitationally lensed quasar images via Bayesian and profile likelihood approaches. The model is based on a state-space representation for irregularly observed time series data generated from a latent continuous-time Ornstein-Uhlenbeck process. Our Bayesian method adopts scientifically motivated hyper-prior distributions and a Metropolis-Hastings within Gibbs sampler, producing posterior samples of the model parameters that include the time delay. A profile likelihood of the time delay is a simple approximation to the marginal posterior distribution of the time delay. Both Bayesian and profile likelihood approaches complement each other, producing almost identical results; the Bayesian way is more principled but the profile likelihood is easier to implement. A new functionality is added in version 1.0.9 for estimating the time delay between doubly-lensed light curves observed in two bands. See also Tak et al. (2017) <doi:10.1214/17-AOAS1027>, Tak et al. (2018) <doi:10.1080/10618600.2017.1415911>, Hu and Tak (2020) <arXiv:2005.08049>.
Autoregressive distributed lag (A[R]DL) models (and their reparameterized equivalent, the Generalized Error-Correction Model [GECM]) are the workhorse models in uncovering dynamic inferences. ADL models are simple to estimate; this is what makes them attractive. Once these models are estimated, what is less clear is how to uncover a rich set of dynamic inferences from these models. We provide tools for recovering those inferences. These tools apply to traditional time-series quantities of interest: especially instantaneous effects for any period and cumulative effects for any period (including the long-run effect). They also allow for a variety of shock histories to be applied to the independent variable (beyond just a one-time, one-unit increase) as well as the recovery of inferences in levels for shocks applies to (in)dependent variables in differences (what we call the Generalized Dynamic Response Function). These effects are also available for the general conditional dynamic model advocated by Warner, Vande Kamp, and Jordan (2026 <doi:10.1017/psrm.2026.10087>). We also provide the actual formulae for these effects.
This package implements a variety of nonparametric and parametric methods that are commonly used when the data set is a mixture of paired observations and independent samples. The package also calculates and returns values of different tests with their corresponding p-values. Bhoj, D. S. (1991) <doi:10.1002/bimj.4710330108> "Testing equality of means in the presence of correlation and missing data". Dubnicka, S. R., Blair, R. C., and Hettmansperger, T. P. (2002) <doi:10.22237/jmasm/1020254460> "Rank-based procedures for mixed paired and two-sample designs". Einsporn, R. L. and Habtzghi, D. (2013) <https://pdfs.semanticscholar.org/89a3/90bafeb2bc41ed4414533cfd5ab84a6b54b6.pdf> "Combining paired and two-sample data using a permutation test". Ekbohm, G. (1976) <doi:10.1093/biomet/63.2.299> "On comparing means in the paired case with incomplete data on both responses". Lin, P. E. and Stivers, L. E. (1974) <doi:10.1093/biomet/61.2.325> On difference of means with incomplete data". Maritz, J. S. (1995) <doi:10.1111/j.1467-842x.1995.tb00649.x> "A permutation paired test allowing for missing values".