This package provides a comprehensive toolkit for generating continuous test norms in psychometrics and biometrics, and analyzing model fit. The package offers both distribution-free modeling using Taylor polynomials and parametric modeling using the beta-binomial distribution. Originally developed for achievement tests, it is applicable to a wide range of mental, physical, or other test scores dependent on continuous or discrete explanatory variables. The package provides several advantages: It minimizes deviations from representativeness in subsamples, interpolates between discrete levels of explanatory variables, and significantly reduces the required sample size compared to conventional norming per age group. cNORM
enables graphical and analytical evaluation of model fit, accommodates a wide range of scales including those with negative and descending values, and even supports conventional norming. It generates norm tables including confidence intervals. It also includes methods for addressing representativeness issues through Iterative Proportional Fitting.
Models the relationship between dose levels and responses in a pharmacological experiment using the 4 Parameter Logistic model. Traditional packages on dose-response modelling such as drc and nplr often draw errors due to convergence failure especially when data have outliers or non-logistic shapes. This package provides robust estimation methods that are less affected by outliers and other initialization methods that work well for data lacking logistic shapes. We provide the bounds on the parameters of the 4PL model that prevent parameter estimates from diverging or converging to zero and base their justification in a statistical principle. These methods are used as remedies to convergence failure problems. Gadagkar, S. R. and Call, G. B. (2015) <doi:10.1016/j.vascn.2014.08.006> Ritz, C. and Baty, F. and Streibig, J. C. and Gerhard, D. (2015) <doi:10.1371/journal.pone.0146021>.
The univariate statistical quality control tool aims to address measurement error effects when constructing exponentially weighted moving average p control charts. The method primarily focuses on binary random variables, but it can be applied to any continuous random variables by using sign statistic to transform them to discrete ones. With the correction of measurement error effects, we can obtain the corrected control limits of exponentially weighted moving average p control chart and reasonably adjusted exponentially weighted moving average p control charts. The methods in this package can be found in some relevant references, such as Chen and Yang (2022) <arXiv
: 2203.03384>; Yang et al. (2011) <doi: 10.1016/j.eswa.2010.11.044>; Yang and Arnold (2014) <doi: 10.1155/2014/238719>; Yang (2016) <doi: 10.1080/03610918.2013.763980> and Yang and Arnold (2016) <doi: 10.1080/00949655.2015.1125901>.
Quantitative trait loci mapping and genome wide association analysis are used to find candidate molecular marker or region associated with phenotype based on linkage analysis and linkage disequilibrium. Gene expression quantitative trait loci mapping is used to find candidate molecular marker or region associated with gene expression. In this package, we applied the method in Liu W. (2011) <doi:10.1007/s00122-011-1631-7> and Gusev A. (2016) <doi:10.1038/ng.3506> to genome and transcriptome wide association study, which is aimed at revealing the association relationship between phenotype and molecular markers, expression levels, molecular markers nested within different related expression effect and expression effect nested within different related molecular marker effect. F test based on full and reduced model are performed to obtain p value or likelihood ratio statistic. The best linear model can be obtained by stepwise regression analysis.
Factor models have been widely applied in areas such as economics and finance, and the well-known heavy-tailedness of macroeconomic/financial data should be taken into account when conducting factor analysis. We propose two algorithms to do robust factor analysis by considering the Huber loss. One is based on minimizing the Huber loss of the idiosyncratic error's L2 norm, which turns out to do Principal Component Analysis (PCA) on the weighted sample covariance matrix and thereby named as Huber PCA. The other one is based on minimizing the element-wise Huber loss, which can be solved by an iterative Huber regression algorithm. In this package we also provide the code for traditional PCA, the Robust Two Step (RTS) method by He et al. (2022) and the Quantile Factor Analysis (QFA) method by Chen et al. (2021) and He et al. (2023).
This package provides a bunch of algorithms based on linear programming for estimating, under the homogeneity hypothesis, RxC
ecological contingency tables (or vote transition matrices) using mainly aggregate data (from voting units). References: Pavà a and Romero (2024) <doi:10.1177/00491241221092725>. Pavà a and Romero (2024) <doi:10.1093/jrsssa/qnae013>. Pavà a (2023) <doi:10.1007/s43545-023-00658-y>. Pavà a (2024) <doi:10.1080/0022250X.2024.2423943>. Pavà a (2024) <doi:10.1177/07591063241277064>. Pavà a and Penadés (2024). A bottom-up approach for ecological inference. Romero, Pavà a, Martà n and Romero (2020) <doi:10.1080/02664763.2020.1804842>. Acknowledgements: The authors wish to thank Consellerà a de Educación, Universidades y Empleo, Generalitat Valenciana (grants AICO/2021/257, CIAICO/2023/031) and Ministerio de Economà a e Innovación (grant PID2021-128228NB-I00) for supporting this research.
This package provides a regression and classification algorithm based on random forests, which takes the form of a short list of rules. SIRUS combines the simplicity of decision trees with a predictivity close to random forests. The core aggregation principle of random forests is kept, but instead of aggregating predictions, SIRUS aggregates the forest structure: the most frequent nodes of the forest are selected to form a stable rule ensemble model. The algorithm is fully described in the following articles: Benard C., Biau G., da Veiga S., Scornet E. (2021), Electron. J. Statist., 15:427-505 <DOI:10.1214/20-EJS1792> for classification, and Benard C., Biau G., da Veiga S., Scornet E. (2021), AISTATS, PMLR 130:937-945 <http://proceedings.mlr.press/v130/benard21a>, for regression. This R package is a fork from the project ranger (<https://github.com/imbs-hl/ranger>).
iPath
is the Bioconductor package used for calculating personalized pathway score and test the association with survival outcomes. Abundant single-gene biomarkers have been identified and used in the clinics. However, hundreds of oncogenes or tumor-suppressor genes are involved during the process of tumorigenesis. We believe individual-level expression patterns of pre-defined pathways or gene sets are better biomarkers than single genes. In this study, we devised a computational method named iPath
to identify prognostic biomarker pathways, one sample at a time. To test its utility, we conducted a pan-cancer analysis across 14 cancer types from The Cancer Genome Atlas and demonstrated that iPath
is capable of identifying highly predictive biomarkers for clinical outcomes, including overall survival, tumor subtypes, and tumor stage classifications. We found that pathway-based biomarkers are more robust and effective than single genes.
This package contains a collection of functions to deal with nonparametric measurement error problems using deconvolution kernel methods. We focus two measurement error models in the package: (1) an additive measurement error model, where the goal is to estimate the density or distribution function from contaminated data; (2) nonparametric regression model with errors-in-variables. The R functions allow the measurement errors to be either homoscedastic or heteroscedastic. To make the deconvolution estimators computationally more efficient in R, we adapt the "Fast Fourier Transform" (FFT) algorithm for density estimation with error-free data to the deconvolution kernel estimation. Several methods for the selection of the data-driven smoothing parameter are also provided in the package. See details in: Wang, X.F. and Wang, B. (2011). Deconvolution estimation in measurement error models: The R package decon. Journal of Statistical Software, 39(10), 1-24.
An update to the Joint Location-Scale (JLS) testing framework that identifies associated SNPs, gene-sets and pathways with main and/or interaction effects on quantitative traits (Soave et al., 2015; <doi:10.1016/j.ajhg.2015.05.015>). The JLS method simultaneously tests the null hypothesis of equal mean and equal variance across genotypes, by aggregating association evidence from the individual location/mean-only and scale/variance-only tests using Fisher's method. The generalized joint location-scale (gJLS
) framework has been developed to deal specifically with sample correlation and group uncertainty (Soave and Sun, 2017; <doi:10.1111/biom.12651>). The current release: gJLS2
, include additional functionalities that enable analyses of X-chromosome genotype data through novel methods for location (Chen et al., 2021; <doi:10.1002/gepi.22422>) and scale (Deng et al., 2019; <doi:10.1002/gepi.22247>).
Population ratio estimator (calibrated) under two-phase random sampling design has gained enormous popularity in the recent time. This package provides functions for estimation population ratio (calibrated) under two phase sampling design, including the approximate variance of the ratio estimator. The improved ratio estimator can be applicable for both the case, when auxiliary data is available at unit level or aggregate level (eg., mean or total) for first phase sampled. Calibration weight of each unit of the second phase sample was calculated. Single and combined inclusion probabilities were also estimated for both phases under two phase random [simple random sampling without replacement (SRSWOR)] sampling. The improved ratio estimator's percentage coefficient of variation was also determined as a measure of accuracy. This package has been developed based on the theoretical development of Islam et al. (2021) and Ozgul (2020) <doi:10.1080/00949655.2020.1844702>.
This package provides a modular and computationally efficient R package for parameterizing, simulating, and analyzing health economic simulation models. The package supports cohort discrete time state transition models (Briggs et al. 1998) <doi:10.2165/00019053-199813040-00003>, N-state partitioned survival models (Glasziou et al. 1990) <doi:10.1002/sim.4780091106>, and individual-level continuous time state transition models (Siebert et al. 2012) <doi:10.1016/j.jval.2012.06.014>, encompassing both Markov (time-homogeneous and time-inhomogeneous) and semi-Markov processes. Decision uncertainty from a cost-effectiveness analysis is quantified with standard graphical and tabular summaries of a probabilistic sensitivity analysis (Claxton et al. 2005, Barton et al. 2008) <doi:10.1002/hec.985>, <doi:10.1111/j.1524-4733.2008.00358.x>. Use of C++ and data.table make individual-patient simulation, probabilistic sensitivity analysis, and incorporation of patient heterogeneity fast.
This package provides the hyphenation algorithm used for TeX'/'LaTeX
and similar software, as proposed by Liang (1983, <https://tug.org/docs/liang/>). Mainly contains the function hyphen()
to be used for hyphenation/syllable counting of text objects. It was originally developed for and part of the koRpus
package, but later released as a separate package so it's lighter to have this particular functionality available for other packages. Support for various languages needs be added on-the-fly or by plugin packages (<https://undocumeantit.github.io/repos/>); this package does not include any language specific data. Due to some restrictions on CRAN, the full package sources are only available from the project homepage. To ask for help, report bugs, request features, or discuss the development of the package, please subscribe to the koRpus-dev
mailing list (<http://korpusml.reaktanz.de>).
This package contains various functions to be used for simulation education, including simple Monte Carlo simulation functions, queueing simulation functions, variate generation functions capable of producing independent streams and antithetic variates, functions for illustrating random variate generation for various discrete and continuous distributions, and functions to compute time-persistent statistics. Also contains functions for visualizing: event-driven details of a single-server queue model; a Lehmer random number generator; variate generation via acceptance-rejection; and of generating a non-homogeneous Poisson process via thinning. Also contains two queueing data sets (one fabricated, one real-world) to facilitate input modeling. More details on the use of these functions can be found in Lawson and Leemis (2015) <doi:10.1109/WSC.2017.8248124>, in Kudlay, Lawson, and Leemis (2020) <doi:10.1109/WSC48552.2020.9384010>, and in Lawson and Leemis (2021) <doi:10.1109/WSC52266.2021.9715299>.
Analyses of Proportions can be performed on the Anscombe (arcsine-related) transformed data. The ANOPA package can analyze proportions obtained from up to four factors. The factors can be within-subject or between-subject or a mix of within- and between-subject. The main, omnibus analysis can be followed by additive decompositions into interaction effects, main effects, simple effects, contrast effects, etc., mimicking precisely the logic of ANOVA. For that reason, we call this set of tools ANOPA (Analysis of Proportion using Anscombe transform) to highlight its similarities with ANOVA. The ANOPA framework also allows plots of proportions easy to obtain along with confidence intervals. Finally, effect sizes and planning statistical power are easily done under this framework. Only particularity, the ANOPA computes F statistics which have an infinite degree of freedom on the denominator. See Laurencelle and Cousineau (2023) <doi:10.3389/fpsyg.2022.1045436>.
Collection of tools to work with European basketball data. Functions available are related to friendly web scraping, data management and visualization. Data were obtained from <https://www.euroleaguebasketball.net/euroleague/>, <https://www.euroleaguebasketball.net/eurocup/> and <https://www.acb.com/>, following the instructions of their respectives robots.txt files, when available. Box score data are available for the three leagues. Play-by-play data are also available for the Spanish league. Methods for analysis include a population pyramid, 2D plots, circular plots of players percentiles, plots of players monthly/yearly stats, team heatmaps, team shooting plots, team four factors plots, cross-tables with the results of regular season games, maps of nationalities, combinations of lineups, possessions-related variables, timeouts, performance by periods, personal fouls and offensive rebounds. Please see Vinue (2020) <doi:10.1089/big.2018.0124> and Vinue (2024) <doi:10.1089/big.2023.0177>.
Estimation of Bayesian Global Vector Autoregressions (BGVAR) with different prior setups and the possibility to introduce stochastic volatility. Built-in priors include the Minnesota, the stochastic search variable selection and Normal-Gamma (NG) prior. For a reference see also Crespo Cuaresma, J., Feldkircher, M. and F. Huber (2016) "Forecasting with Global Vector Autoregressive Models: a Bayesian Approach", Journal of Applied Econometrics, Vol. 31(7), pp. 1371-1391 <doi:10.1002/jae.2504>. Post-processing functions allow for doing predictions, structurally identify the model with short-run or sign-restrictions and compute impulse response functions, historical decompositions and forecast error variance decompositions. Plotting functions are also available. The package has a companion paper: Boeck, M., Feldkircher, M. and F. Huber (2022) "BGVAR: Bayesian Global Vector Autoregressions with Shrinkage Priors in R", Journal of Statistical Software, Vol. 104(9), pp. 1-28 <doi:10.18637/jss.v104.i09>.
This software does Multi-Reader, Multi-Case (MRMC) analyses of data from imaging studies where clinicians (readers) evaluate patient images (cases). What does this mean? ... Many imaging studies are designed so that every reader reads every case in all modalities, a fully-crossed study. In this case, the data is cross-correlated, and we consider the readers and cases to be cross-correlated random effects. An MRMC analysis accounts for the variability and correlations from the readers and cases when estimating variances, confidence intervals, and p-values. The functions in this package can treat arbitrary study designs and studies with missing data, not just fully-crossed study designs. An overview of this software, including references presenting details on the methods, can be found here: <https://www.fda.gov/medical-devices/science-and-research-medical-devices/imrmc-software-do-multi-reader-multi-case-statistical-analysis-reader-studies>.
Optimal Subset Cardinality Regression (OSCAR) models offer regularized linear regression using the L0-pseudonorm, conventionally known as the number of non-zero coefficients. The package estimates an optimal subset of features using the L0-penalization via cross-validation, bootstrapping and visual diagnostics. Effective Fortran implementations are offered along the package for finding optima for the DC-decomposition, which is used for transforming the discrete L0-regularized optimization problem into a continuous non-convex optimization task. These optimization modules include DBDC ('Double Bundle method for nonsmooth DC optimization as described in Joki et al. (2018) <doi:10.1137/16M1115733>) and LMBM ('Limited Memory Bundle Method for large-scale nonsmooth optimization as in Haarala et al. (2004) <doi:10.1080/10556780410001689225>). The OSCAR models are comprehensively exemplified in Halkola et al. (2023) <doi:10.1371/journal.pcbi.1010333>). Multiple regression model families are supported: Cox, logistic, and Gaussian.
Forms a query to submit for US Treasury yield curve data, posting this query to the US Treasury web site's data feed service. By default the download includes data yield data for 12 products from January 1, 1990, some of which are NA during this span. The caller can pass parameters to limit the query to a certain year or year and month, but the full download is not especially large. The download data from the service is in XML format. The package's main function transforms that XML data into a numeric data frame with treasury product items (constant maturity yields for 12 kinds of bills, notes, and bonds) as columns and dates as row names. The function returns a list which includes an item for this data frame as well as query-related values for reference and the update date from the service.
mitch is an R package for multi-contrast enrichment analysis. At it’s heart, it uses a rank-MANOVA based statistical approach to detect sets of genes that exhibit enrichment in the multidimensional space as compared to the background. The rank-MANOVA concept dates to work by Cox and Mann (https://doi.org/10.1186/1471-2105-13-S16-S12). mitch is useful for pathway analysis of profiling studies with one, two or more contrasts, or in studies with multiple omics profiling, for example proteomic, transcriptomic, epigenomic analysis of the same samples. mitch is perfectly suited for pathway level differential analysis of scRNA-seq
data. We have an established routine for pathway enrichment of Infinium Methylation Array data (see vignette). The main strengths of mitch are that it can import datasets easily from many upstream tools and has advanced plotting features to visualise these enrichments.
Perform association tests using generalized linear mixed models (GLMMs) in genome-wide association studies (GWAS) and sequencing association studies. First, GMMAT fits a GLMM with covariate adjustment and random effects to account for population structure and familial or cryptic relatedness. For GWAS, GMMAT performs score tests for each genetic variant as proposed in Chen et al. (2016) <DOI:10.1016/j.ajhg.2016.02.012>. For candidate gene studies, GMMAT can also perform Wald tests to get the effect size estimate for each genetic variant. For rare variant analysis from sequencing association studies, GMMAT performs the variant Set Mixed Model Association Tests (SMMAT) as proposed in Chen et al. (2019) <DOI:10.1016/j.ajhg.2018.12.012>, including the burden test, the sequence kernel association test (SKAT), SKAT-O and an efficient hybrid test of the burden test and SKAT, based on user-defined variant sets.
This package implements the discrete nonlinear filter (DNF) of Kitagawa (1987) <doi:10.1080/01621459.1987.10478534> to a wide class of stochastic volatility (SV) models with return and volatility jumps following the work of Bégin and Boudreault (2021) <doi:10.1080/10618600.2020.1840995> to obtain likelihood evaluations and maximum likelihood parameter estimates. Offers several built-in SV models and a flexible framework for users to create customized models by specifying drift and diffusion functions along with an arrival distribution for the return and volatility jumps. Allows for the estimation of factor models with stochastic volatility (e.g., heteroskedastic volatility CAPM) by incorporating expected return predictors. Also includes functions to compute filtering and prediction distribution estimates, to simulate data from built-in and custom SV models with jumps, and to forecast future returns and volatility values using Monte Carlo simulation from a given SV model.
Fast categorization of items based on external code data identified by regular expressions. A typical use case considers patient with medically coded data, such as codes from the International Classification of Diseases ('ICD') or the Anatomic Therapeutic Chemical ('ATC') classification system. Functions of the package relies on a triad of objects: (1) case data with unit id:s and possible dates of interest; (2) external code data for corresponding units in (1) and with optional dates of interest and; (3) a classification scheme ('classcodes object) with regular expressions to identify and categorize relevant codes from (2). It is easy to introduce new classification schemes ('classcodes objects) or to use default schemes included in the package. Use cases includes patient categorization based on comorbidity indices such as Charlson', Elixhauser', RxRisk
V', or the comorbidity-polypharmacy score (CPS), as well as adverse events after hip and knee replacement surgery.