Statistical procedures to perform stability analysis in plant breeding and to identify stable genotypes under diverse environments. It is possible to calculate coefficient of homeostaticity by Khangildin et al. (1979), variance of specific adaptive ability by Kilchevsky&Khotyleva (1989), weighted homeostaticity index by Martynov (1990), steadiness of stability index by Udachin (1990), superiority measure by Lin&Binn (1988) <doi:10.4141/cjps88-018>, regression on environmental index by Erberhart&Rassel (1966) <doi:10.2135/cropsci1966.0011183X000600010011x>, Tai's (1971) stability parameters <doi:10.2135/cropsci1971.0011183X001100020006x>, stability variance by Shukla (1972) <doi:10.1038/hdy.1972.87>, ecovalence by Wricke (1962), nonparametric stability parameters by Nassar&Huehn (1987) <doi:10.2307/2531947>, Francis&Kannenberg's parameters of stability (1978) <doi:10.4141/cjps78-157>.
This package provides a tool for matching ICD-10 codes to corresponding Clinical Classification Software Refined (CCSR) codes. The main function, CCSRfind()
, identifies each CCSR code that applies to an individual given their diagnosis codes. It also provides a summary of CCSR codes that are matched to a dataset. The package contains 3 datasets: DXCCSR (mapping of ICD-10 codes to CCSR codes), Legend (conversion of DXCCSR to CCSRfind-usable format for CCSR codes with less than or equal to 1000 ICD-10 diagnosis codes), and LegendExtend
(conversion of DXCCSR to CCSRfind-usable format for CCSR codes with more than 1000 ICD-10 dx codes). The disc()
function applies grepl()
('base') to multiple columns and is used in CCSRfind()
.
Estimation of the generalized beta distribution of the second kind (GB2) and related models using grouped data in form of income shares. The GB2 family is a general class of distributions that provides an accurate fit to income data. GB2group includes functions to estimate the GB2, the Singh-Maddala, the Dagum, the Beta 2, the Lognormal and the Fisk distributions. GB2group deploys two different econometric strategies to estimate these parametric distributions, the equally weighted minimum distance (EWMD) estimator and the optimally weighted minimum distance (OMD) estimator. Asymptotic standard errors are reported for the OMD estimates. Standard errors of the EWMD estimates are obtained by Monte Carlo simulation. See Jorda et al. (2018) <arXiv:1808.09831>
for a detailed description of the estimation procedure.
Corset plots are a visualization technique used strictly to visualize repeat measures at 2 time points (such as pre- and post- data). The distribution of measurements are visualized at each time point, whilst the trajectories of individual change are visualized by connecting the pre- and post- values linearly. These lines can be coloured to represent the magnitude of change, or other user-defined value. This method of visualization is ideal for showing the heterogeneity of data, including differences by sub-groups. The package relies on ggplot2 allowing for easy integration so that users can customize their visualizations as required. Users can create corset plots using data in either wide or long format using the functions gg_corset()
or gg_corset_elongated()
, respectively.
Computing statistical hypothesis testing for loading in principal component analysis (PCA) (Yamamoto, H. et al. (2014) <doi:10.1186/1471-2105-15-51>), orthogonal smoothed PCA (OS-PCA) (Yamamoto, H. et al. (2021) <doi:10.3390/metabo11030149>), one-sided kernel PCA (Yamamoto, H. (2023) <doi:10.51094/jxiv.262>), partial least squares (PLS) and PLS discriminant analysis (PLS-DA) (Yamamoto, H. et al. (2009) <doi:10.1016/j.chemolab.2009.05.006>), PLS with rank order of groups (PLS-ROG) (Yamamoto, H. (2017) <doi:10.1002/cem.2883>), regularized canonical correlation analysis discriminant analysis (RCCA-DA) (Yamamoto, H. et al. (2008) <doi:10.1016/j.bej.2007.12.009>), multiset PLS and PLS-ROG (Yamamoto, H. (2022) <doi:10.1101/2022.08.30.505949>).
This package performs the O2PLS data integration method for two datasets, yielding joint and data-specific parts for each dataset. The algorithm automatically switches to a memory-efficient approach to fit O2PLS to high dimensional data. It provides a rigorous and a faster alternative cross-validation method to select the number of components, as well as functions to report proportions of explained variation and to construct plots of the results. See the software article by el Bouhaddani et al (2018) <doi:10.1186/s12859-018-2371-3>, and Trygg and Wold (2003) <doi:10.1002/cem.775>. It also performs Sparse Group (Penalized) O2PLS, see Gu et al (2020) <doi:10.1186/s12859-021-03958-3> and cross-validation for the degree of sparsity.
The hybrid model is a highly effective forecasting approach that integrates decomposition techniques with machine learning to enhance time series prediction accuracy. Each decomposition technique breaks down a time series into multiple intrinsic mode functions (IMFs), which are then individually modeled and forecasted using machine learning algorithms. The final forecast is obtained by aggregating the predictions of all IMFs, producing an ensemble output for the time series. The performance of the developed models is evaluated using international monthly maize price data, assessed through metrics such as root mean squared error (RMSE), mean absolute percentage error (MAPE), and mean absolute error (MAE). For method details see Choudhary, K. et al. (2023). <https://ssca.org.in/media/14_SA44052022_R3_SA_21032023_Girish_Jha_FINAL_Finally.pdf>.
This package performs robust multiple testing for means in the presence of known and unknown latent factors presented in Fan et al.(2019) "FarmTest
: Factor-Adjusted Robust Multiple Testing With Approximate False Discovery Control" <doi:10.1080/01621459.2018.1527700>. Implements a series of adaptive Huber methods combined with fast data-drive tuning schemes proposed in Ke et al.(2019) "User-Friendly Covariance Estimation for Heavy-Tailed Distributions" <doi:10.1214/19-STS711> to estimate model parameters and construct test statistics that are robust against heavy-tailed and/or asymmetric error distributions. Extensions to two-sample simultaneous mean comparison problems are also included. As by-products, this package contains functions that compute adaptive Huber mean, covariance and regression estimators that are of independent interest.
Novel method to unbiasedly include studies with Non-statistically Significant Unreported Effects (NSUEs) in a meta-analysis. First, the function calculates the interval where the unreported effects (e.g., t-values) should be according to the threshold of statistical significance used in each study. Afterward, the method uses maximum likelihood techniques to impute the expected effect size of each study with NSUEs, accounting for between-study heterogeneity and potential covariates. Multiple imputations of the NSUEs are then randomly created based on the expected value, variance, and statistical significance bounds. Finally, it conducts a restricted-maximum likelihood random-effects meta-analysis separately for each set of imputations, and it performs estimations from these meta-analyses. Please read the reference in metansue for details of the procedure.
This package provides a collection of white noise hypothesis tests for functional time series and related visualizations. These include tests based on the norms of autocovariance operators that are built under both strong and weak white noise assumptions. Additionally, tests based on the spectral density operator and on principal component dimensional reduction are included, which are built under strong white noise assumptions. Also, this package provides goodness-of-fit tests for functional autoregressive of order 1 models. These methods are described in Kokoszka et al. (2017) <doi:10.1016/j.jmva.2017.08.004>, Characiejus and Rice (2019) <doi:10.1016/j.ecosta.2019.01.003>, Gabrys and Kokoszka (2007) <doi:10.1198/016214507000001111>, and Kim et al. (2023) <doi: 10.1214/23-SS143> respectively.
Simulates age-at-onset traits associated with a segregating major gene in family data obtained from population-based, clinic-based, or multi-stage designs. Appropriate ascertainment correction is utilized to estimate age-dependent penetrance functions either parametrically from the fitted model or nonparametrically from the data. The Expectation and Maximization algorithm can infer missing genotypes and carrier probabilities estimated from family's genotype and phenotype information or from a fitted model. Plot functions include pedigrees of simulated families and predicted penetrance curves based on specified parameter values. For more information see Choi, Y.-H., Briollais, L., He, W. and Kopciuk, K. (2021) FamEvent
: An R Package for Generating and Modeling Time-to-Event Data in Family Designs, Journal of Statistical Software 97 (7), 1-30.
This package provides functions for performing least-squares bilinear clustering of three-way data. The method uses the bilinear decomposition (or bi-additive model) to model two-way matrix slices while clustering over the third way. Up to four different types of clusters are included, one for each term of the bilinear decomposition. In this way, matrices are clustered simultaneously on (a subset of) their overall means, row margins, column margins and row-column interactions. The orthogonality of the bilinear model results in separability of the joint clustering problem into four separate ones. Three of these sub-problems are specific k-means problems, while a special algorithm is implemented for the interactions. Plotting methods are provided, including biplots for the low-rank approximations of the interactions.
Visualization of decision rules for binary classification and Receiver Operating Characteristic (ROC) curve estimation under different generalizations proposed in the literature: - making the classification subsets flexible to cover those scenarios where both extremes of the marker are associated with a higher risk of being positive, considering two thresholds (gROC()
function); - transforming the marker by a proper function trying to improve the classification performance (hROC()
function); - when dealing with multivariate markers, considering a proper transformation to univariate space trying to maximize the resulting AUC of the TPR for each FPR (multiROC()
function). The classification regions behind each point of the ROC curve are displayed in both static graphics (plot_buildROC()
, plot_regions()
or plot_funregions()
function) or videos (movieROC()
function).
Collision Risk Models for avian fauna (seabird and migratory birds) at offshore wind farms. The base deterministic model is derived from Band (2012) <https://tethys.pnnl.gov/publications/using-collision-risk-model-assess-bird-collision-risks-offshore-wind-farms>. This was further expanded on by Masden (2015) <doi:10.7489/1659-1> and code used here is heavily derived from this work with input from Dr A. Cook at the British Trust for Ornithology. These collision risk models are useful for marine ornithologists who are working in the offshore wind industry, particularly in UK waters. However, many of the species included in the stochastic collision risk models can also be found in the North Atlantic in the United States and Canada, and could be applied there.
This package implements functions that calculate upper prediction bounds on the false discovery proportion (FDP) in the list of discoveries returned by competition-based setups, implementing Ebadi et al. (2022) <arXiv:2302.11837>
. Such setups include target-decoy competition (TDC) in computational mass spectrometry and the knockoff construction in linear regression (note this package typically uses the terminology of TDC). Included is the standardized (TDC-SB) and uniform (TDC-UB) bound on TDC's FDP, and the simultaneous standardized and uniform bands. Requires pre-computed Monte Carlo statistics available at <https://github.com/uni-Arya/fdpbandsdata>. This data can be downloaded by running the command devtools::install_github("uni-Arya/fdpbandsdata") in R and restarting R after installation. The size of this data is roughly 81Mb.
Identification of causal effects from arbitrary observational and experimental probability distributions via do-calculus and standard probability manipulations using a search-based algorithm by Tikka, Hyttinen and Karvanen (2021) <doi:10.18637/jss.v099.i05>. Allows for the presence of mechanisms related to selection bias (Bareinboim and Tian, 2015) <doi:10.1609/aaai.v29i1.9679>, transportability (Bareinboim and Pearl, 2014) <http://ftp.cs.ucla.edu/pub/stat_ser/r443.pdf>, missing data (Mohan, Pearl, and Tian, 2013) <http://ftp.cs.ucla.edu/pub/stat_ser/r410.pdf>) and arbitrary combinations of these. Also supports identification in the presence of context-specific independence (CSI) relations through labeled directed acyclic graphs (LDAG). For details on CSIs see (Corander et al., 2019) <doi:10.1016/j.apal.2019.04.004>.
Estimation of DIFferential COexpressed NETworks using diverse and user metrics. This package is basically used for three functions related to the estimation of differential coexpression. First, to estimate differential coexpression where the coexpression is estimated, by default, by Spearman correlation. For this, a metric to compare two correlation distributions is needed. The package includes 6 metrics. Some of them needs a threshold. A new metric can also be specified as a user function with specific parameters (see difconet.run). The significance is be estimated by permutations. Second, to generate datasets with controlled differential correlation data. This is done by either adding noise, or adding specific correlation structure. Third, to show the results of differential correlation analyses. Please see <http://bioinformatica.mty.itesm.mx/difconet> for further information.
Hybrid model is the most promising forecasting method by combining decomposition and deep learning techniques to improve the accuracy of time series forecasting. Each decomposition technique decomposes a time series into a set of intrinsic mode functions (IMFs), and the obtained IMFs are modelled and forecasted separately using the deep learning models. Finally, the forecasts of all IMFs are combined to provide an ensemble output for the time series. The prediction ability of the developed models are calculated using international monthly price series of maize in terms of evaluation criteria like root mean squared error, mean absolute percentage error and, mean absolute error. For method details see Choudhary, K. et al. (2023). <https://ssca.org.in/media/14_SA44052022_R3_SA_21032023_Girish_Jha_FINAL_Finally.pdf>.
Generate reports that enable quick visual review of temporal shifts in record-level data. Time series plots showing aggregated values are automatically created for each data field (column) depending on its contents (e.g. min/max/mean values for numeric data, no. of distinct values for categorical data), as well as overviews for missing values, non-conformant values, and duplicated rows. The resulting reports are shareable and can contribute to forming a transparent record of the entire analysis process. It is designed with Electronic Health Records in mind, but can be used for any type of record-level temporal data (i.e. tabular data where each row represents a single "event", one column contains the "event date", and other columns contain any associated values for the event).
Routines for model-based functional cluster analysis for functional data with optional covariates. The idea is to cluster functional subjects (often called functional objects) into homogenous groups by using spline smoothers (for functional data) together with scalar covariates. The spline coefficients and the covariates are modelled as a multivariate Gaussian mixture model, where the number of mixtures corresponds to the number of clusters. The parameters of the model are estimated by maximizing the observed mixture likelihood via an EM algorithm (Arnqvist and Sjöstedt de Luna, 2019) <doi:10.48550/arXiv.1904.10265>
. The clustering method is used to analyze annual lake sediment from lake Kassjön (Northern Sweden) which cover more than 6400 years and can be seen as historical records of weather and climate.
It offers a sophisticated and versatile tool for creating and evaluating artificial intelligence based neural network models tailored for regression analysis on datasets with continuous target variables. Leveraging the power of neural networks, it allows users to experiment with various hidden neuron configurations across two layers, optimizing model performance through "5 fold"" or "10 fold"" cross validation. The package normalizes input data to ensure efficient training and assesses model accuracy using key metrics such as R squared (R2), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Percentage Error (PER). By storing and visualizing the best performing models, it provides a comprehensive solution for precise and efficient regression modeling making it an invaluable tool for data scientists and researchers aiming to harness AI for predictive analytics.
This package provides a method for the multiresolution analysis of spatial fields and images to capture scale-dependent features. mrbsizeR
is based on scale space smoothing and uses differences of smooths at neighbouring scales for finding features on different scales. To infer which of the captured features are credible, Bayesian analysis is used. The scale space multiresolution analysis has three steps: (1) Bayesian signal reconstruction. (2) Using differences of smooths, scale-dependent features of the reconstructed signal can be found. (3) Posterior credibility analysis of the differences of smooths created. The method has first been proposed by Holmstrom, Pasanen, Furrer, Sain (2011) <DOI:10.1016/j.csda.2011.04.011> and extended in Flury, Gerber, Schmid and Furrer (2021) <DOI:10.1016/j.spasta.2020.100483>.
Penalized regression methods, such as lasso and elastic net, are used in many biomedical applications when simultaneous regression coefficient estimation and variable selection is desired. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will likely lead to different sets of selected predictors, making it difficult to ascertain a final active set without resorting to ad hoc combination rules. miselect presents Stacked Adaptive Elastic Net (saenet) and Grouped Adaptive LASSO (galasso) for continuous and binary outcomes, developed by Du et al (2022) <doi:10.1080/10618600.2022.2035739>. They, by construction, force selection of the same variables across multiply imputed data. miselect also provides cross validated variants of these methods.
Bayesian regularized quantile regression utilizing sparse priors to impose exact sparsity leads to efficient Bayesian shrinkage estimation, variable selection and statistical inference. In this package, we have implemented robust Bayesian variable selection with spike-and-slab priors under high-dimensional linear regression models (Fan et al. (2024) <doi:10.3390/e26090794> and Ren et al. (2023) <doi:10.1111/biom.13670>), and regularized quantile varying coefficient models (Zhou et al.(2023) <doi:10.1016/j.csda.2023.107808>). In particular, valid robust Bayesian inferences under both models in the presence of heavy-tailed errors can be validated on finite samples. Additional models including robust Bayesian group LASSO are also included. The Markov Chain Monte Carlo (MCMC) algorithms of the proposed and alternative models are implemented in C++.