It contains six common multi-category classification accuracy evaluation measures. All of these measures could be found in Li and Ming (2019) <doi:10.1002/sim.8103>. Specifically, Hypervolume Under Manifold (HUM), described in Li and Fine (2008) <doi:10.1093/biostatistics/kxm050>. Correct Classification Percentage (CCP), Integrated Discrimination Improvement (IDI), Net Reclassification Improvement (NRI), R-Squared Value (RSQ), described in Li, Jiang and Fine (2013) <doi:10.1093/biostatistics/kxs047>. Polytomous Discrimination Index (PDI), described in Van Calster et al. (2012) <doi:10.1007/s10654-012-9733-3>. Li et al. (2018) <doi:10.1177/0962280217692830>. We described all these above measures and our mcca package in Li, Gao and D'Agostino (2019) <doi:10.1002/sim.8103>.
Price volatility refers to the degree of variation in series over a certain period of time. This volatility is especially noticeable in agricultural commodities, adding uncertainty for farmers, traders, and others in the agricultural supply chain. Commonly and popularly used four volatility models viz, GARCH, Glosten Jagannatan Runkle-GARCH (GJR-GARCH) model, exponentially weighted moving average (EWMA) model and Multiplicative Error Model (MEM) are selected and implemented. PWAVE, weighted ensemble model based on particle swarm optimization (PSO) is proposed to combine the forecast obtained from all the candidate models. This package has been developed using algorithm of Paul et al. <doi:10.1007/s40009-023-01218-x> and Yeasin and Paul (2024) <doi:10.1007/s11227-023-05542-3>.
Although model selection is ubiquitous in scientific discovery, the stability and uncertainty of the selected model is often hard to evaluate. How to characterize the random behavior of the model selection procedure is the key to understand and quantify the model selection uncertainty. This R package offers several graphical tools to visualize the distribution of the selected model. For example, Gplot()
, Hplot()
, VDSM_scatterplot()
and VDSM_heatmap()
. To the best of our knowledge, this is the first attempt to visualize such a distribution. About what distribution of selected model is and how it work please see Qin,Y.and Wang,L. (2021) "Visualization of Model Selection Uncertainty" <https://homepages.uc.edu/~qinyn/VDSM/VDSM.html>.
This package provides a Bayesian tool to test for population trends and changes in trends under arbitrary designs, including before-after (BA), control-intervention (CI) and before-after-control-intervention (BACI) designs commonly used to assess conservation impact. It infers changes in trends jointly from data obtained with multiple survey methods, as well as from limited and noisy data not necessarily collected in standardized ecological surveys. Observed counts can be modeled as following either a Poisson or a negative binomial model, and both deterministic and stochastic trend models are available. For more details on the model see Singer et al. (2025) <doi:10.1101/2025.01.08.631844>, and the file AUTHORS for a list of copyright holders and contributors.
Estimation of the average treatment effect when controlling for high-dimensional confounders using debiased inverse propensity score weighting (DIPW). DIPW relies on the propensity score following a sparse logistic regression model, but the regression curves are not required to be estimable. Despite this, our package also allows the users to estimate the regression curves and take the estimated curves as input to our methods. Details of the methodology can be found in Yuhao Wang and Rajen D. Shah (2020) "Debiased Inverse Propensity Score Weighting for Estimation of Average Treatment Effects with High-Dimensional Confounders" <arXiv:2011.08661>
. The package relies on the optimisation software MOSEK <https://www.mosek.com/> which must be installed separately; see the documentation for Rmosek'.
User-friendly general package providing standard methods for meta-analysis and supporting Schwarzer, Carpenter, and Rücker <DOI:10.1007/978-3-319-21416-0>, "Meta-Analysis with R" (2015): - common effect and random effects meta-analysis; - several plots (forest, funnel, Galbraith / radial, L'Abbe, Baujat, bubble); - three-level meta-analysis model; - generalised linear mixed model; - logistic regression with penalised likelihood for rare events; - Hartung-Knapp method for random effects model; - Kenward-Roger method for random effects model; - prediction interval; - statistical tests for funnel plot asymmetry; - trim-and-fill method to evaluate bias in meta-analysis; - meta-regression; - cumulative meta-analysis and leave-one-out meta-analysis; - import data from RevMan
5'; - produce forest plot summarising several (subgroup) meta-analyses.
Computation of predictive information criteria (PIC) from select model object classes for model selection in predictive contexts. In contrast to the more widely used Akaike Information Criterion (AIC), which are derived under the assumption that target(s) of prediction (i.e. validation data) are independently and identically distributed to the fitting data, the PIC are derived under less restrictive assumptions and thus generalize AIC to the more practically relevant case of training/validation data heterogeneity. The methodology featured in this package is based on Flores (2021) <https://iro.uiowa.edu/esploro/outputs/doctoral/A-new-class-of-information-criteria/9984097169902771?institution=01IOWA_INST> "A new class of information criteria for improved prediction in the presence of training/validation data heterogeneity".
This package provides a novel meta-learning framework for forecast model selection using time series features. Many applications require a large number of time series to be forecast. Providing better forecasts for these time series is important in decision and policy making. We propose a classification framework which selects forecast models based on features calculated from the time series. We call this framework FFORMS (Feature-based FORecast Model Selection). FFORMS builds a mapping that relates the features of time series to the best forecast model using a random forest. seer package is the implementation of the FFORMS algorithm. For more details see our paper at <https://www.monash.edu/business/econometrics-and-business-statistics/research/publications/ebs/wp06-2018.pdf>.
In randomized controlled trial (RCT), balancing covariate is often one of the most important concern. CARM package provides functions to balance the covariates and generate allocation sequence by covariate-adjusted Adaptive Randomization via Mahalanobis-distance (ARM) for RCT. About what ARM is and how it works please see Y. Qin, Y. Li, W. Ma, H. Yang, and F. Hu (2022). "Adaptive randomization via Mahalanobis distance" Statistica Sinica. <doi:10.5705/ss.202020.0440>. In addition, the package is also suitable for the randomization process of multi-arm trials. For details, please see Yang H, Qin Y, Wang F, et al. (2023). "Balancing covariates in multi-arm trials via adaptive randomization" Computational Statistics & Data Analysis.<doi:10.1016/j.csda.2022.107642>.
The single largest source of dams in the United States is the National Inventory of Dams (NID) <http://nid.usace.army.mil> from the US Army Corps of Engineers. Entire data from the NID cannot be obtained all at once and NID's website limits extraction of more than a couple of thousand records at a time. Moreover, selected data from the NID's user interface cannot not be saved to a file. In order to make the analysis of this data easier, all the data from NID was extracted manually. Subsequently, the raw data was checked for potential errors and cleaned. This package provides sample cleaned data from the NID and provides functionality to access the entire cleaned NID data.
This package performs iterative proportional updating given a seed table and an arbitrary number of marginal distributions. This is commonly used in population synthesis, survey raking, matrix rebalancing, and other applications. For example, a household survey may be weighted to match the known distribution of households by size from the census. An origin/ destination trip matrix might be balanced to match traffic counts. The approach used by this package is based on a paper from Arizona State University (Ye, Xin, et. al. (2009) <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.537.723&rep=rep1&type=pdf>). Some enhancements have been made to their work including primary and secondary target balance/importance, general marginal agreement, and weight restriction.
Data smoothing with penalized splines is a popular method and is well established for one- or two-dimensional covariates. The extension to multiple covariates is straightforward but suffers from exponentially increasing memory requirements and computational complexity. This toolbox provides a matrix-free implementation of a conjugate gradient (CG) method for the regularized least squares problem resulting from tensor product B-spline smoothing with multivariate and scattered data. It further provides matrix-free preconditioned versions of the CG-algorithm where the user can choose between a simpler diagonal preconditioner and an advanced geometric multigrid preconditioner. The main advantage is that all algorithms are performed matrix-free and therefore require only a small amount of memory. For further detail see Siebenborn & Wagner (2021).
Causal and statistical inference on an arbitrary treatment effect curve requires care in both estimation and inference. This package, implements the Method of Direct Estimation and Inference as introduced in "Estimation and Inference on Nonlinear and Heterogeneous Effects" by Ratkovic and Tingley (2023) <doi:10.1086/723811>. The method takes an outcome, variable of theoretical interest (treatment), and set of variables and then returns a partial derivative (marginal effect) of the treatment variable at each point along with uncertainty intervals. The approach offers two advances. First, a split-sample approach is used as a guard against over-fitting. Second, the method uses a data-driven interval derived from conformal inference, rather than relying on a normality assumption on the error terms.
The multiple instance data set consists of many independent subjects (called bags) and each subject is composed of several components (called instances). The outcomes of such data set are binary or categorical responses, and, we can only observe the subject-level outcomes. For example, in manufacturing processes, a subject is labeled as "defective" if at least one of its own components is defective, and otherwise, is labeled as "non-defective". The milr package focuses on the predictive model for the multiple instance data set with binary outcomes and performs the maximum likelihood estimation with the Expectation-Maximization algorithm under the framework of logistic regression. Moreover, the LASSO penalty is attached to the likelihood function for simultaneous parameter estimation and variable selection.
With the provision of several tools and templates the MOSAIC project (DFG-Grant Number HO 1937/2-1) supports the implementation of a central data management in epidemiological research projects. The MOQA package enables epidemiologists with none or low experience in R to generate basic data quality reports for a wide range of application scenarios. See <https://mosaic-greifswald.de/> for more information. Please read and cite the corresponding open access publication (using the former package-name) in METHODS OF INFORMATION IN MEDICINE by M. Bialke, H. Rau, T. Schwaneberg, R. Walk, T. Bahls and W. Hoffmann (2017) <doi:10.3414/ME16-01-0123>. <https://methods.schattauer.de/en/contents/most-recent-articles/issue/2483/issue/special/manuscript/27573/show.html>.
An updated and extended version of spm package, by introducing some further novel functions for modern statistical methods (i.e., generalised linear models, glmnet, generalised least squares), thin plate splines, support vector machine, kriging methods (i.e., simple kriging, universal kriging, block kriging, kriging with an external drift), and novel hybrid methods (228 hybrids plus numerous variants) of modern statistical methods or machine learning methods with mathematical and/or univariate geostatistical methods for spatial predictive modelling. For each method, two functions are provided, with one function for assessing the predictive errors and accuracy of the method based on cross-validation, and the other for generating spatial predictions. It also contains a couple of functions for data preparation and predictive accuracy assessment.
This package provides three methods proposed by Shang and Apley (2019) <doi:10.1080/00224065.2019.1705207> to generate fully-sequential space-filling designs inside a unit hypercube. A fully-sequential space-filling design means a sequence of nested designs (as the design size varies from one point up to some maximum number of points) with the design points added one at a time and such that the design at each size has good space-filling properties. Two methods target the minimum pairwise distance criterion and generate maximin designs, among which one method is more efficient when design size is large. One method targets the maximum hole size criterion and uses a heuristic to generate what is closer to a minimax design.
Gradient boosting is a powerful statistical learning method known for its ability to model complex relationships between predictors and outcomes while performing inherent variable selection. However, traditional gradient boosting methods lack flexibility in handling longitudinal data where within-subject correlations play a critical role. In this package, we propose a novel approach Mixed Effect Gradient Boosting ('MEGB'), designed specifically for high-dimensional longitudinal data. MEGB incorporates a flexible semi-parametric model that embeds random effects within the gradient boosting framework, allowing it to account for within-individual covariance over time. Additionally, the method efficiently handles scenarios where the number of predictors greatly exceeds the number of observations (p>>n) making it particularly suitable for genomics data and other large-scale biomedical studies.
Interaction between a genetic variant (e.g., a single nucleotide polymorphism) and an environmental variable (e.g., physical activity) can have a shared effect on multiple phenotypes (e.g., blood lipids). We implement a two-step method to test for an overall interaction effect on multiple phenotypes. In first step, the method tests for an overall marginal genetic association between the genetic variant and the multivariate phenotype. The genetic variants which show an evidence of marginal overall genetic effect in the first step are prioritized while testing for an overall gene-environment interaction effect in the second step. Methodology is available from: A Majumdar, KS Burch, S Sankararaman, B Pasaniuc, WJ Gauderman, JS Witte (2020) <doi:10.1101/2020.07.06.190256>.
Supplements for a book, "iTOS
" = "Introduction to the Theory of Observational Studies." Data sets are aHDL
from Rosenbaum (2023a) <doi:10.1111/biom.13558> and bingeM
from Rosenbaum (2023b) <doi:10.1111/biom.13921>. The function makematch()
uses two-criteria matching from Zhang et al. (2023) <doi:10.1080/01621459.2021.1981337> to create the matched data bingeM
from binge'. The makematch()
function also implements optimal matching (Rosenbaum (1989) <doi:10.2307/2290079>) and matching with fine or near-fine balance (Rosenbaum et al. (2007) <doi:10.1198/016214506000001059> and Yang et al (2012) <doi:10.1111/j.1541-0420.2011.01691.x>). The book makes use of two other R packages, weightedRank
and tightenBlock
'.
Plug-in and difference-based long-run covariance matrix estimation for time series regression. Two applications of hypothesis testing are also provided. The first one is for testing for structural stability in coefficient functions. The second one is aimed at detecting long memory in time series regression. Lujia Bai and Weichi Wu (2024)<doi:10.3150/23-BEJ1680> Zhou Zhou and Wei Biao Wu(2010)<doi:10.1111/j.1467-9868.2010.00743.x> Jianqing Fan and Wenyang Zhang<doi:10.1214/aos/1017939139> Lujia Bai and Weichi Wu(2024)<doi:10.1093/biomet/asae013> Dimitris N. Politis, Joseph P. Romano, Michael Wolf(1999)<doi:10.1007/978-1-4612-1554-7> Weichi Wu and Zhou Zhou(2018)<doi:10.1214/17-AOS1582>.
Training of neural networks for classification and regression tasks using mini-batch gradient descent. Special features include a function for training autoencoders, which can be used to detect anomalies, and some related plotting functions. Multiple activation functions are supported, including tanh, relu, step and ramp. For the use of the step and ramp activation functions in detecting anomalies using autoencoders, see Hawkins et al. (2002) <doi:10.1007/3-540-46145-0_17>. Furthermore, several loss functions are supported, including robust ones such as Huber and pseudo-Huber loss, as well as L1 and L2 regularization. The possible options for optimization algorithms are RMSprop, Adam and SGD with momentum. The package contains a vectorized C++ implementation that facilitates fast training through mini-batch learning.
This package provides a toolkit for Flux Balance Analysis and related metabolic modeling techniques. Functions are provided for: parsing models in tabular format, converting parsed metabolic models to input formats for common linear programming solvers, and evaluating and applying gene-protein-reaction mappings. In addition, there are wrappers to parse a model, select a solver, find the metabolic fluxes, and return the results applied to the original model. Compared to other packages in this field, this package puts a much heavier focus on providing reusable components that can be used in the design of new implementation of new techniques, in particular those that involve large parameter sweeps. For a background on the theory, see What is Flux Balance Analysis <doi:10.1038/nbt.1614>.
This package provides a novel searching scheme for tuning parameter in high-dimensional penalized regression. We propose a new estimate of the regularization parameter based on an estimated lower bound of the proportion of false null hypotheses (Meinshausen and Rice (2006) <doi:10.1214/009053605000000741>). The bound is estimated by applying the empirical null distribution of the higher criticism statistic, a second-level significance testing, which is constructed by dependent p-values from a multi-split regression and aggregation method (Jeng, Zhang and Tzeng (2019) <doi:10.1080/01621459.2018.1518236>). An estimate of tuning parameter in penalized regression is decided corresponding to the lower bound of the proportion of false null hypotheses. Different penalized regression methods are provided in the multi-split algorithm.