We propose a new procedure, called model uncertainty variance, which can quantify the uncertainty of model selection on Autoregressive Moving Average models. The model uncertainty variance not pay attention to the accuracy of prediction, but focus on model selection uncertainty and providing more information of the model selection results. And to estimate the model measures, we propose an simplify and faster algorithm based on bootstrap method, which is proven to be effective and feasible by Monte-Carlo simulation. At the same time, we also made some optimizations and adjustments to the Model Confidence Bounds algorithm, so that it can be applied to the time series model selection method. The consistency of the algorithm result is also verified by Monte-Carlo simulation. We propose a new procedure, called model uncertainty variance, which can quantify the uncertainty of model selection on Autoregressive Moving Average models. The model uncertainty variance focuses on model selection uncertainty and providing more information of the model selection results. To estimate the model uncertainty variance, we propose an simplified and faster algorithm based on bootstrap method, which is proven to be effective and feasible by Monte-Carlo simulation. At the same time, we also made some optimizations and adjustments to the Model Confidence Bounds algorithm, so that it can be applied to the time series model selection method. The consistency of the algorithm result is also verified by Monte-Carlo simulation. Please see Li,Y., Luo,Y., Ferrari,D., Hu,X. and Qin,Y. (2019) Model Confidence Bounds for Variable Selection. Biometrics, 75:392-403.<DOI:10.1111/biom.13024> for more information.
Machine learning algorithms have been used for performing single missing data imputation and most recently, multiple imputations. However, this is the first attempt for using automated machine learning algorithms for performing both single and multiple imputation. Automated machine learning is a procedure for fine-tuning the model automatic, performing a random search for a model that results in less error, without overfitting the data. The main idea is to allow the model to set its own parameters for imputing each variable separately instead of setting fixed predefined parameters to impute all variables of the dataset. Using automated machine learning, the package fine-tunes an Elastic Net (default) or Gradient Boosting, Random Forest, Deep Learning, Extreme Gradient Boosting, or Stacked Ensemble machine learning model (from one or a combination of other supported algorithms) for imputing the missing observations. This procedure has been implemented for the first time by this package and is expected to outperform other packages for imputing missing data that do not fine-tune their models. The multiple imputation is implemented via bootstrapping without letting the duplicated observations to harm the cross-validation procedure, which is the way imputed variables are evaluated. Most notably, the package implements automated procedure for handling imputing imbalanced data (class rarity problem), which happens when a factor variable has a level that is far more prevalent than the other(s). This is known to result in biased predictions, hence, biased imputation of missing data. However, the autobalancing procedure ensures that instead of focusing on maximizing accuracy (classification error) in imputing factor variables, a fairer procedure and imputation method is practiced.
Loss reserving generally focuses on identifying a single model that can generate superior predictive performance. However, different loss reserving models specialise in capturing different aspects of loss data. This is recognised in practice in the sense that results from different models are often considered, and sometimes combined. For instance, actuaries may take a weighted average of the prediction outcomes from various loss reserving models, often based on subjective assessments. This package allows for the use of a systematic framework to objectively combine (i.e. ensemble) multiple stochastic loss reserving models such that the strengths offered by different models can be utilised effectively. Our framework is developed in Avanzi et al. (2023). Firstly, our criteria model combination considers the full distributional properties of the ensemble and not just the central estimate - which is of particular importance in the reserving context. Secondly, our framework is that it is tailored for the features inherent to reserving data. These include, for instance, accident, development, calendar, and claim maturity effects. Crucially, the relative importance and scarcity of data across accident periods renders the problem distinct from the traditional ensemble techniques in statistical learning. Our framework is illustrated with a complex synthetic dataset. In the results, the optimised ensemble outperforms both (i) traditional model selection strategies, and (ii) an equally weighted ensemble. In particular, the improvement occurs not only with central estimates but also relevant quantiles, such as the 75th percentile of reserves (typically of interest to both insurers and regulators). Reference: Avanzi B, Li Y, Wong B, Xian A (2023) "Ensemble distributional forecasting for insurance loss reserving" <doi:10.48550/arXiv.2206.08541>
.
This package provides tools for estimating length-based indicators from length frequency data to assess fish stock status and manage fisheries sustainably. Implements methods from Cope and Punt (2009) <doi:10.1577/C08-025.1> for data-limited stock assessment and Froese (2004) <doi:10.1111/j.1467-2979.2004.00144.x> for detecting overfishing using simple indicators. Key functions include: FrequencyTable()
: Calculate the frequency table from the collected and also the extract the length frequency data from the frequency table with the upper length_range. A numeric value specifying the bin width for class intervals. If not provided, the bin width is automatically calculated using Sturges (1926) <doi:10.1080/01621459.1926.10502161> formula. CalPar()
: Calculates various lengths used in fish stock assessment as biological length indicators such as asymptotic length (Linf), maximum length (Lmax), length at sexual maturity (Lm), and optimal length (Lopt). FishPar()
: Calculates length-based indicators (LBIs) proposed by Froese (2004) <doi:10.1111/j.1467-2979.2004.00144.x> such as the percentage of mature fish (Pmat), percentage of optimal length fish (Popt), percentage of mega spawners (Pmega), and the sum of these as Pobj. This function also estimates confidence intervals for different lengths, visualizes length frequency distributions, and provides data frames containing calculated values. FishSS()
: Makes decisions based on input from Cope and Punt (2009) <doi:10.1577/C08-025.1> and parameters calculated by FishPar()
(e.g., Pobj, Pmat, Popt, LM_ratio) to determine stock status as target spawning biomass (TSB40) and limit spawning biomass (LSB25). These tools support fisheries management decisions by providing robust, data-driven insights.
Pre-made models that can be rapidly tailored to various chemicals and species using chemical-specific in vitro data and physiological information. These tools allow incorporation of chemical toxicokinetics ("TK") and in vitro-in vivo extrapolation ("IVIVE") into bioinformatics, as described by Pearce et al. (2017) (<doi:10.18637/jss.v079.i04>). Chemical-specific in vitro data characterizing toxicokinetics have been obtained from relatively high-throughput experiments. The chemical-independent ("generic") physiologically-based ("PBTK") and empirical (for example, one compartment) "TK" models included here can be parameterized with in vitro data or in silico predictions which are provided for thousands of chemicals, multiple exposure routes, and various species. High throughput toxicokinetics ("HTTK") is the combination of in vitro data and generic models. We establish the expected accuracy of HTTK for chemicals without in vivo data through statistical evaluation of HTTK predictions for chemicals where in vivo data do exist. The models are systems of ordinary differential equations that are developed in MCSim and solved using compiled (C-based) code for speed. A Monte Carlo sampler is included for simulating human biological variability (Ring et al., 2017 <doi:10.1016/j.envint.2017.06.004>) and propagating parameter uncertainty (Wambaugh et al., 2019 <doi:10.1093/toxsci/kfz205>). Empirically calibrated methods are included for predicting tissue:plasma partition coefficients and volume of distribution (Pearce et al., 2017 <doi:10.1007/s10928-017-9548-7>). These functions and data provide a set of tools for using IVIVE to convert concentrations from high-throughput screening experiments (for example, Tox21, ToxCast
) to real-world exposures via reverse dosimetry (also known as "RTK") (Wetmore et al., 2015 <doi:10.1093/toxsci/kfv171>).
Multivariate Information-based Inductive Causation, better known by its acronym MIIC, is a causal discovery method, based on information theory principles, which learns a large class of causal or non-causal graphical models from purely observational data, while including the effects of unobserved latent variables. Starting from a complete graph, the method iteratively removes dispensable edges, by uncovering significant information contributions from indirect paths, and assesses edge-specific confidences from randomization of available data. The remaining edges are then oriented based on the signature of causality in observational data. The recent more interpretable MIIC extension (iMIIC
) further distinguishes genuine causes from putative and latent causal effects, while scaling to very large datasets (hundreds of thousands of samples). Since the version 2.0, MIIC also includes a temporal mode (tMIIC
) to learn temporal causal graphs from stationary time series data. MIIC has been applied to a wide range of biological and biomedical data, such as single cell gene expression data, genomic alterations in tumors, live-cell time-lapse imaging data (CausalXtract
), as well as medical records of patients. MIIC brings unique insights based on causal interpretation and could be used in a broad range of other data science domains (technology, climatology, economy, ...). For more information, you can refer to: Simon et al., eLife
2024, <doi:10.1101/2024.02.06.579177>, Ribeiro-Dantas et al., iScience
2024, <doi:10.1016/j.isci.2024.109736>, Cabeli et al., NeurIPS
2021, <https://why21.causalai.net/papers/WHY21_24.pdf>, Cabeli et al., Comput. Biol. 2020, <doi:10.1371/journal.pcbi.1007866>, Li et al., NeurIPS
2019, <https://papers.nips.cc/paper/9573-constraint-based-causal-structure-learning-with-consistent-separating-sets>, Verny et al., PLoS
Comput. Biol. 2017, <doi:10.1371/journal.pcbi.1005662>, Affeldt et al., UAI 2015, <https://auai.org/uai2015/proceedings/papers/293.pdf>. Changes from the previous 1.5.3 release on CRAN are available at <https://github.com/miicTeam/miic_R_package/blob/master/NEWS.md>
.
In many cases, experiments must be repeated across multiple seasons or locations to ensure applicability of findings. A single experiment conducted in one location and season may yield limited conclusions, as results can vary under different environmental conditions. In agricultural research, treatment à location and treatment à season interactions play a crucial role. Analyzing a series of experiments across diverse conditions allows for more generalized and reliable recommendations. The CANE package facilitates the pooled analysis of experiments conducted over multiple years, seasons, or locations. It is designed to assess treatment interactions with environmental factors (such as location and season) using various experimental designs. The package supports pooled analysis of variance (ANOVA) for the following designs: (1) PooledCRD()
': completely randomized design; (2) PooledRBD()
': randomized block design; (3) PooledLSD()
': Latin square design; (4) PooledSPD()
': split plot design; and (5) PooledStPD()
': strip plot design. Each function provides the following outputs: (i) Individual ANOVA tables based on independent analysis for each location or year; (ii) Testing of homogeneity of error variances among distinct locations using Bartlettâ s Chi-Square test; (iii) If Bartlettâ s test is significant, Aitkenâ s transformation, defined as the ratio of the response to the square root of the error mean square, is applied to the response variable; otherwise, the data is used as is; (iv) Combined analysis to obtain a pooled ANOVA table; (v) Multiple comparison tests, including Tukey's honestly significant difference (Tukey's HSD) test, Duncanâ s multiple range test (DMRT), and the least significant difference (LSD) test, for treatment comparisons. The statistical theory and steps of analysis of these designs are available in Dean et al. (2017)<doi:10.1007/978-3-319-52250-0> and Ruà z et al. (2024)<doi:10.1007/978-3-031-65575-3>. By broadening the scope of experimental conclusions, CANE enables researchers to derive robust, widely applicable recommendations. This package is particularly valuable in agricultural research, where accounting for treatment à location and treatment à season interactions is essential for ensuring the validity of findings across multiple settings.
EQ-5D is a popular health related quality of life instrument used in the clinical and economic evaluation of health care. Developed by the EuroQol
group <https://euroqol.org/>, the instrument consists of two components: health state description and evaluation. For the description component a subject self-rates their health in terms of five dimensions; mobility, self-care, usual activities, pain/discomfort, and anxiety/depression using either a three-level (EQ-5D-3L, <https://euroqol.org/information-and-support/euroqol-instruments/eq-5d-3l/>) or a five-level (EQ-5D-5L, <https://euroqol.org/information-and-support/euroqol-instruments/eq-5d-5l/>) scale. Frequently the scores on these five dimensions are converted to a single utility index using country specific value sets, which can be used in the clinical and economic evaluation of health care as well as in population health surveys. The eq5d package provides methods to calculate index scores from a subject's dimension scores. 32 TTO and 11 VAS EQ-5D-3L value sets including those for countries in Szende et al (2007) <doi:10.1007/1-4020-5511-0> and Szende et al (2014) <doi:10.1007/978-94-007-7596-1>, 46 EQ-5D-5L EQ-VT value sets, the EQ-5D-5L crosswalk value sets developed by van Hout et al. (2012) <doi:10.1016/j.jval.2012.02.008>, the crosswalk value sets for Bermuda, Jordan and Russia and the reverse crosswalk value sets. 10 EQ-5D-Y value sets are also included as are the NICE DSU age-sex based EQ-5D-3L to EQ-5D-5L and EQ-5D-5L to EQ-5D-3L mappings. Methods are also included for the analysis of EQ-5D profiles, including those from the book "Methods for Analyzing and Reporting EQ-5D data" by Devlin et al. (2020) <doi:10.1007/978-3-030-47622-9>. Additionally a shiny web tool is included to enable the calculation, visualisation and automated statistical analysis of EQ-5D data via a web browser using EQ-5D dimension scores stored in CSV or Excel files.
Understanding the current status of forest resources is essential for monitoring changes in forest ecosystems and generating related statistics. In South Korea, the National Forest Inventory (NFI) surveys over 4,500 sample plots nationwide every five years and records 70 items, including forest stand, forest resource, and forest vegetation surveys. Many researchers use NFI as the primary data for research, such as biomass estimation or analyzing the importance value of each species over time and space, depending on the research purpose. However, the large volume of accumulated forest survey data from across the country can make it challenging to manage and utilize such a vast dataset. To address this issue, we developed an R package that efficiently handles large-scale NFI data across time and space. The package offers a comprehensive workflow for NFI data analysis. It starts with data processing, where read_nfi()
function reconstructs NFI data according to the researcher's needs while performing basic integrity checks for data quality.Following this, the package provides analytical tools that operate on the verified data. These include functions like summary_nfi()
for summary statistics, diversity_nfi()
for biodiversity analysis, iv_nfi()
for calculating species importance value, and biomass_nfi()
and cwd_biomass_nfi()
for biomass estimation. Finally, for visualization, the tsvis_nfi()
function generates graphs and maps, allowing users to visualize forest ecosystem changes across various spatial and temporal scales. This integrated approach and its specialized functions can enhance the efficiency of processing and analyzing NFI data, providing researchers with insights into forest ecosystems. The NFI Excel files (.xlsx) are not included in the R package and must be downloaded separately. Users can access these NFI Excel files by visiting the Korea Forest Service Forestry Statistics Platform <https://kfss.forest.go.kr/stat/ptl/article/articleList.do?curMenu=11694&bbsId=microdataboard>
to download the annual NFI Excel files, which are bundled in .zip archives. Please note that this website is only available in Korean, and direct download links can be found in the notes section of the read_nfi()
function.
This package contains the heteroscedastic ANOVA tests for normal and two-parameter exponential distributed populations. For normal distributions, Alexander-Govern test by Alexandern and Govern (1994) <doi:10.2307/1165140>, Alvandi et al. Generalized F test by Alvandi et al. (2012) <doi:10.1080/03610926.2011.573160>, Approximate F test by Asiribo and Gurland (1990) <doi:10.1080/03610929008830427>, Box F test by Box (1954) <doi:10.1214/aoms/1177728786>, Brown-Forsythe test by Brown and Forsythe (1974) <do:10.2307/1267501>, B2 test by Ozdemir and Kurt (2006) <http://sjam.selcuk.edu.tr/sjam/article/view/174>, Cochran F test by Cochran (1937) <https://www.jstor.org/stable/pdf/2984123.pdf>, Fiducial Approach test by Li et al. (2011) <doi:10.1016/j.csda.2010.12.009>, Generalized F test by Weerahandi (1995) <doi:10.2307/2532947>, Johansen F test by Johansen (1980) <doi:10.1093/biomet/67.1.85>, Modified Brown-Forsythe test by Mehrotra (1997) <doi:10.1080/03610919708813431>, Modified Welch test by Hartung et al.(2002) <doi:10.1007/s00362-002-0097-8>, One-Stage test by Chen and Chen (1998) <doi:10.1080/03610919808813501>, One-Stage Range test by Chen and Chen (2000) <doi:10.1080/01966324.2000.10737505>, Parametric Bootstrap test by Krishnamoorhty et al.(2007) <doi:10.1016/j.csda.2006.09.039>, Permutation F test by Berry and Mielke (2002) <doi:10.2466/pr0.2002.90.2.495>, Scott-Smith test by Scott and Smith (1971) <doi:10.2307/2346757>, Welch test by Welch(1951) <doi:10.2307/2332579>, and Welch-Aspin test by Aspin (1948) <doi:10.1093/biomet/35.1-2.88>. These tests are used to test the equality of group means under unequal variance. Also, a modified version of Generalized F-test is improved to test the equality of non-normal group means under unequal variances and a revised version of Generalized F-test is given to test the equality of non-normal group means caused by skewness. Furthermore, it consists some procedures for testing equality of several two-parameter exponentially distributed population means under unequal scale parameters such as generalized p-value, parametric bootstrap and fiducial approach test by Malekzadeh and Jafari (2019) <doi:10.1080/03610918.2018.1538452>. There is also Hsieh test by Hsieh (1986) <doi:10.2307/1270452> for testing equality of location parameters of two-parameter exponentially distributed populations under unequal scale parameters.
Enables: (1) plotting two-dimensional confidence regions, (2) coverage analysis of confidence region simulations, (3) calculating confidence intervals and the associated actual coverage for binomial proportions, (4) calculating the support values and the probability mass function of the Kaplan-Meier product-limit estimator, and (5) plotting the actual coverage function associated with a confidence interval for the survivor function from a randomly right-censored data set. Each is given in greater detail next. (1) Plots the two-dimensional confidence region for probability distribution parameters (supported distribution suffixes: cauchy, gamma, invgauss, logis, llogis, lnorm, norm, unif, weibull) corresponding to a user-given complete or right-censored dataset and level of significance. The crplot()
algorithm plots more points in areas of greater curvature to ensure a smooth appearance throughout the confidence region boundary. An alternative heuristic plots a specified number of points at roughly uniform intervals along its boundary. Both heuristics build upon the radial profile log-likelihood ratio technique for plotting confidence regions given by Jaeger (2016) <doi:10.1080/00031305.2016.1182946>, and are detailed in a publication by Weld et al. (2019) <doi:10.1080/00031305.2018.1564696>. (2) Performs confidence region coverage simulations for a random sample drawn from a user- specified parametric population distribution, or for a user-specified dataset and point of interest with coversim()
. (3) Calculates confidence interval bounds for a binomial proportion with binomTest()
, calculates the actual coverage with binomTestCoverage()
, and plots the actual coverage with binomTestCoveragePlot()
. Calculates confidence interval bounds for the binomial proportion using an ensemble of constituent confidence intervals with binomTestEnsemble()
. Calculates confidence interval bounds for the binomial proportion using a complete enumeration of all possible transitions from one actual coverage acceptance curve to another which minimizes the root mean square error for n <= 15 and follows the transitions for well-known confidence intervals for n > 15 using binomTestMSE()
. (4) The km.support()
function calculates the support values of the Kaplan-Meier product-limit estimator for a given sample size n using an induction algorithm described in Qin et al. (2023) <doi:10.1080/00031305.2022.2070279>. The km.outcomes()
function generates a matrix containing all possible outcomes (all possible sequences of failure times and right-censoring times) of the value of the Kaplan-Meier product-limit estimator for a particular sample size n. The km.pmf()
function generates the probability mass function for the support values of the Kaplan-Meier product-limit estimator for a particular sample size n, probability of observing a failure h at the time of interest expressed as the cumulative probability percentile associated with X = min(T, C), where T is the failure time and C is the censoring time under a random-censoring scheme. The km.surv()
function generates multiple probability mass functions of the Kaplan-Meier product-limit estimator for the same arguments as those given for km.pmf()
. (5) The km.coverage()
function plots the actual coverage function associated with a confidence interval for the survivor function from a randomly right-censored data set for one or more of the following confidence intervals: Greenwood, log-minus-log, Peto, arcsine, and exponential Greenwood. The actual coverage function is plotted for a small number of items on test, stated coverage, failure rate, and censoring rate. The km.coverage()
function can print an optional table containing all possible failure/censoring orderings, along with their contribution to the actual coverage function.
An ODBC database interface.
Queries data from RDAP servers.
Color palettes from famous artists and paintings.
Communications simulation package supporting forward error correction.
Client for the Ocean Biodiversity Information System (<https://obis.org>).
This package provides a common framework for calculating distance matrices.
This package provides functions for performing spatial microsimulation ('raking') in R.
The Rmisc library contains functions for data analysis and utility operations.
Constrained clustering, transfer functions, and other methods for analysing Quaternary science data.
This package provides recursive partitioning functions for classification, regression and survival trees.
.
Create production-ready Rich Text Format (RTF) tables and figures with flexible format.
Interactive viewing and exploration of graphs, connecting R to Cytoscape.js, using websockets.