This package provides a wrapper for machine learning (ML) methods to select among a portfolio of algorithms based on the value of a key performance indicator (KPI). A number of features is used to adjust a model to predict the value of the KPI for each algorithm, then, for a new value of the features the KPI is estimated and the algorithm with the best one is chosen. To learn it can use the regression methods in caret package or a custom function defined by the user. Several graphics available to analyze the results obtained. This library has been used in Ghaddar et al. (2023) <doi:10.1287/ijoc.2022.0090>).
This package provides a comprehensive framework for early epidemic detection through school absenteeism surveillance. The package offers three core functionalities: (1) simulation of population structures, epidemic spread, and resulting school absenteeism patterns; (2) implementation of surveillance models that generate alerts for impending epidemics based on absenteeism data and (3) evaluation of alert timeliness and accuracy through alert time quality metrics to optimize model parameters. These tools enable public health officials and researchers to develop and assess early warning systems before implementation. Methods are based on research published in Vanderkruk et al. (2023) <doi:10.1186/s12889-023-15747-z> and Ward et al. (2019) <doi:10.1186/s12889-019-7521-7>.
This package performs genomic mediation analysis with adaptive confounding adjustment (GMAC) proposed by Yang et al. (2017) <doi:10.1101/078683>. It implements large scale mediation analysis and adaptively selects potential confounding variables to adjust for each mediation test from a pool of candidate confounders. The package is tailored for but not limited to genomic mediation analysis (e.g., cis-gene mediating trans-gene regulation pattern where an eQTL
, its cis-linking gene transcript, and its trans-gene transcript play the roles as treatment, mediator and the outcome, respectively), restricting to scenarios with the presence of cis-association (i.e., treatment-mediator association) and random eQTL
(i.e., treatment).
Higher order likelihood inference is a promising approach for analyzing small sample size data. The holi package provides web applications for higher order likelihood inference. It currently supports linear, logistic, and Poisson generalized linear models through the rstar_glm()
function, based on Pierce and Bellio (2017) <doi:10.1111/insr.12232> and likelihoodAsy
'. The package offers two main features: LA_rstar()
, which launches an interactive shiny application allowing users to fit models with rstar_glm()
through their web browser, and sim_rstar_glm_pgsql()
, which streamlines the process of launching a web-based shiny simulation application that saves results to a user-created PostgreSQL
database.
This package provides a collection of statistical hypothesis tests and other techniques for identifying certain spatial relationships/phenomena in DNA sequences. In particular, it provides tests and graphical methods for determining whether or not DNA sequences comply with Chargaff's second parity rule or exhibit purine-pyrimidine parity. In addition, there are functions for efficiently simulating discrete state space Markov chains and testing arbitrary symbolic sequences of symbols for the presence of first-order Markovianness. Also, it has functions for counting words/k-mers (and cylinder patterns) in arbitrary symbolic sequences. Functions which take a DNA sequence as input can handle sequences stored as SeqFastadna
objects from the seqinr package.
This package provides a flexible and robust joint test of the single nucleotide polymorphism (SNP) main effect and genotype-by-treatment interaction effect for continuous and binary endpoints. Two analytic procedures, Cauchy weighted joint test (CWOT) and adaptively weighted joint test (AWOT), are proposed to accurately calculate the joint test p-value. The proposed methods are evaluated through extensive simulations under various scenarios. The results show that the proposed AWOT and CWOT control type I error well and outperform existing methods in detecting the most interesting signal patterns in pharmacogenetics (PGx) association studies. For reference, see Hong Zhang, Devan Mehrotra and Judong Shen (2022) <doi:10.13140/RG.2.2.28323.53280>.
The gamma-Orthogonal Matching Pursuit (gamma-OMP) is a recently suggested modification of the OMP feature selection algorithm for a wide range of response variables. The package offers many alternative regression models, such linear, robust, survival, multivariate etc., including k-fold cross-validation. References: Tsagris M., Papadovasilakis Z., Lakiotaki K. and Tsamardinos I. (2018). "Efficient feature selection on gene expression data: Which algorithm to use?" BioRxiv
. <doi:10.1101/431734>. Tsagris M., Papadovasilakis Z., Lakiotaki K. and Tsamardinos I. (2022). "The gamma-OMP algorithm for feature selection with application to gene expression data". IEEE/ACM Transactions on Computational Biology and Bioinformatics 19(2): 1214--1224. <doi:10.1109/TCBB.2020.3029952>.
This package provides a multivariate Gaussian mixture model framework to integrate multiple types of genomic data and allow modeling of inter-data-type correlations for association analysis. IMIX can be implemented to test whether a disease is associated with genes in multiple genomic data types, such as DNA methylation, copy number variation, gene expression, etc. It can also study the integration of multiple pathways. IMIX uses the summary statistics of association test outputs and conduct integration analysis for two or three types of genomics data. IMIX features statistically-principled model selection, global FDR control and computational efficiency. Details are described in Ziqiao Wang and Peng Wei (2020) <doi:10.1093/bioinformatics/btaa1001>.
The Minorize-Maximization(MM) algorithm based on Assembly-Decomposition(AD) technology can be used for model estimation of parametric models, semi-parametric models and non-parametric models. We selected parametric models including left truncated normal distribution, type I multivariate zero-inflated generalized poisson distribution and multivariate compound zero-inflated generalized poisson distribution; semiparametric models include Cox model and gamma frailty model; nonparametric model is estimated for type II interval-censored data. These general methods are proposed based on the following papers, Tian, Huang and Xu (2019) <doi:10.5705/SS.202016.0488>, Huang, Xu and Tian (2019) <doi:10.5705/ss.202016.0516>, Zhang and Huang (2022) <doi:10.1117/12.2642737>.
Bayesian network learning using the PCHC, FEDHC, MMHC and variants of these algorithms. PCHC stands for PC Hill-Climbing, a new hybrid algorithm that uses PC to construct the skeleton of the BN and then applies the Hill-Climbing greedy search. More algorithms and variants have been added, such as MMHC, FEDHC, and the Tabu search variants, PCTABU, MMTABU and FEDTABU. The relevant papers are: a) Tsagris M. (2021). "A new scalable Bayesian network learning algorithm with applications to economics". Computational Economics, 57(1): 341-367. <doi:10.1007/s10614-020-10065-7>. b) Tsagris M. (2022). "The FEDHC Bayesian Network Learning Algorithm". Mathematics 2022, 10(15): 2604. <doi:10.3390/math10152604>.
The two-parameter Xgamma and Poisson Xgamma distributions are analyzed, covering standard distribution and regression functions, maximum likelihood estimation, quantile functions, probability density and mass functions, cumulative distribution functions, and random number generation. References include: "Sen, S., Chandra, N. and Maiti, S. S. (2018). On properties and applications of a two-parameter XGamma distribution. Journal of Statistical Theory and Applications, 17(4): 674--685. <doi:10.2991/jsta.2018.17.4.9>." "Wani, M. A., Ahmad, P. B., Para, B. A. and Elah, N. (2023). A new regression model for count data with applications to health care data. International Journal of Data Science and Analytics. <doi:10.1007/s41060-023-00453-1>.".
This is a package for the analysis of discrete response data using unidimensional and multidimensional item analysis models under the Item Response Theory paradigm (Chalmers (2012) <doi:10.18637/jss.v048.i06>). Exploratory and confirmatory item factor analysis models are estimated with quadrature (EM) or stochastic (MHRM) methods. Confirmatory bi-factor and two-tier models are available for modeling item testlets using dimension reduction EM algorithms, while multiple group analyses and mixed effects designs are included for detecting differential item, bundle, and test functioning, and for modeling item and person covariates. Finally, latent class models such as the DINA, DINO, multidimensional latent class, mixture IRT models, and zero-inflated response models are supported.
Learning and inference over dynamic Bayesian networks of arbitrary Markovian order. Extends some of the functionality offered by the bnlearn package to learn the networks from data and perform exact inference. It offers three structure learning algorithms for dynamic Bayesian networks: Trabelsi G. (2013) <doi:10.1007/978-3-642-41398-8_34>, Santos F.P. and Maciel C.D. (2014) <doi:10.1109/BRC.2014.6880957>, Quesada D., Bielza C. and Larrañaga P. (2021) <doi:10.1007/978-3-030-86271-8_14>. It also offers the possibility to perform forecasts of arbitrary length. A tool for visualizing the structure of the net is also provided via the visNetwork
package.
Import data of tests and questionnaires from FormScanner
. FormScanner
is an open source software that converts scanned images to data using optical mark recognition (OMR) and it can be downloaded from <http://sourceforge.net/projects/formscanner/>. The spreadsheet file created by FormScanner
is imported in a convenient format to perform the analyses provided by the package. These analyses include the conversion of multiple responses to binary (correct/incorrect) data, the computation of the number of corrected responses for each subject or item, scoring using weights,the computation and the graphical representation of the frequencies of the responses to each item and the report of the responses of a few subjects.
Kernel Fisher Discriminant Analysis (KFDA) is performed using Kernel Principal Component Analysis (KPCA) and Fisher Discriminant Analysis (FDA). There are some similar packages. First, lfda is a package that performs Local Fisher Discriminant Analysis (LFDA) and performs other functions. In particular, lfda seems to be impossible to test because it needs the label information of the data in the function argument. Also, the ks package has a limited dimension, which makes it difficult to analyze properly. This package is a simple and practical package for KFDA based on the paper of Yang, J., Jin, Z., Yang, J. Y., Zhang, D., and Frangi, A. F. (2004) <DOI:10.1016/j.patcog.2003.10.015>.
Solves kernel ridge regression, within the the mixed model framework, for the linear, polynomial, Gaussian, Laplacian and ANOVA kernels. The model components (i.e. fixed and random effects) and variance parameters are estimated using the expectation-maximization (EM) algorithm. All the estimated components and parameters, e.g. BLUP of dual variables and BLUP of random predictor effects for the linear kernel (also known as RR-BLUP), are available. The kernel ridge mixed model (KRMM) is described in Jacquin L, Cao T-V and Ahmadi N (2016) A Unified and Comprehensible View of Parametric and Kernel Methods for Genomic Prediction with Application to Rice. Front. Genet. 7:145. <doi:10.3389/fgene.2016.00145>.
Maximum likelihood estimation for stochastic frontier analysis (SFA) of production (profit) and cost functions. The package includes the basic stochastic frontier for cross-sectional or pooled data with several distributions for the one-sided error term (i.e., Rayleigh, gamma, Weibull, lognormal, uniform, generalized exponential and truncated skewed Laplace), the latent class stochastic frontier model (LCM) as described in Dakpo et al. (2021) <doi:10.1111/1477-9552.12422>, for cross-sectional and pooled data, and the sample selection model as described in Greene (2010) <doi:10.1007/s11123-009-0159-1>, and applied in Dakpo et al. (2021) <doi:10.1111/agec.12683>. Several possibilities in terms of optimization algorithms are proposed.
This package performs approximate GP regression for large computer experiments and spatial datasets. The approximation is based on finding small local designs for prediction (independently) at particular inputs. OpenMP
and SNOW parallelization are supported for prediction over a vast out-of-sample testing set; GPU acceleration is also supported for an important subroutine. OpenMP
and GPU features may require special compilation. An interface to lower-level (full) GP inference and prediction is provided. Wrapper routines for blackbox optimization under mixed equality and inequality constraints via an augmented Lagrangian scheme, and for large scale computer model calibration, are also provided. For details and tutorial, see Gramacy (2016 <doi:10.18637/jss.v072.i01>.
Several functions can be used to analyze multiblock multivariable data. If the input is a single matrix, then principal components analysis (PCA) is implemented. If the input is a list of matrices, then multiblock PCA is implemented. If the input is two matrices, for exploratory and objective variables, then partial least squares (PLS) analysis is implemented. If the input is two lists of matrices, for exploratory and objective variables, then multiblock PLS analysis is implemented. Additionally, if an extra outcome variable is specified, then a supervised version of the methods above is implemented. For each method, sparse modeling is also incorporated. Functions for selecting the number of components and regularized parameters are also provided.
In ecology, spatial data is often represented using polygons. These polygons can represent a variety of spatial entities, such as ecological patches, animal home ranges, or gaps in the forest canopy. Researchers often need to determine if two spatial processes, represented by these polygons, are independent of each other. For instance, they might want to test if the home range of a particular animal species is influenced by the presence of a certain type of vegetation. To address this, Godoy et al. (2022) (<doi:10.1016/j.spasta.2022.100695>) developed conditional Monte Carlo tests. These tests are designed to assess spatial independence while taking into account the shape and size of the polygons.
An implementation of the feature Selection procedure by Partitioning the entire Solution Paths (namely SPSP) to identify the relevant features rather than using a single tuning parameter. By utilizing the entire solution paths, this procedure can obtain better selection accuracy than the commonly used approach of selecting only one tuning parameter based on existing criteria, cross-validation (CV), generalized CV, AIC, BIC, and extended BIC (Liu, Y., & Wang, P. (2018) <doi:10.1214/18-EJS1434>). It is more stable and accurate (low false positive and false negative rates) than other variable selection approaches. In addition, it can be flexibly coupled with the solution paths of Lasso, adaptive Lasso, ridge regression, and other penalized estimators.
This package provides tools to evaluate the value of using a risk prediction instrument to decide treatment or intervention (versus no treatment or intervention). Given one or more risk prediction instruments (risk models) that estimate the probability of a binary outcome, rmda provides functions to estimate and display decision curves and other figures that help assess the population impact of using a risk model for clinical decision making. Here, "population" refers to the relevant patient population. Decision curves display estimates of the (standardized) net benefit over a range of probability thresholds used to categorize observations as high risk'. The curves help evaluate a treatment policy that recommends treatment for patients who are estimated to be high risk by comparing the population impact of a risk-based policy to "treat all" and "treat none" intervention policies. Curves can be estimated using data from a prospective cohort. In addition, rmda can estimate decision curves using data from a case-control study if an estimate of the population outcome prevalence is available. Version 1.4 of the package provides an alternative framing of the decision problem for situations where treatment is the standard-of-care and a risk model might be used to recommend that low-risk patients (i.e., patients below some risk threshold) opt out of treatment. Confidence intervals calculated using the bootstrap can be computed and displayed. A wrapper function to calculate cross-validated curves using k-fold cross-validation is also provided.
The Central Bank of the Republic of Turkey (CBRT) provides one of the most comprehensive time series databases on the Turkish economy. The CBRT package provides functions for accessing the CBRT's electronic data delivery system <https://evds2.tcmb.gov.tr/>. It contains the lists of all data categories and data groups for searching the available variables (data series). As of November 3, 2024, there were 40,826 variables in the dataset. The lists of data categories and data groups can be updated by the user at any time. A specific variable, a group of variables, or all variables in a data group can be downloaded at different frequencies using a variety of aggregation methods.
The method proposed in this package takes into account the impact of dependence on the multiple testing procedures for high-throughput data as proposed by Friguet et al. (2009). The common information shared by all the variables is modeled by a factor analysis structure. The number of factors considered in the model is chosen to reduce the false discoveries variance in multiple tests. The model parameters are estimated thanks to an EM algorithm. Adjusted tests statistics are derived, as well as the associated p-values. The proportion of true null hypotheses (an important parameter when controlling the false discovery rate) is also estimated from the FAMT model. Graphics are proposed to interpret and describe the factors.