Univariate and multivariate temporal and spatial diversity indices, rank abundance curves, and community stability measures. The functions implement measures that are either explicitly temporal and include the option to calculate them over multiple replicates, or spatial and include the option to calculate them over multiple time points. Functions fall into five categories: static diversity indices, temporal diversity indices, spatial diversity indices, rank abundance curves, and community stability measures. The diversity indices are temporal and spatial analogs to traditional diversity indices. Specifically, the package includes functions to calculate community richness, evenness and diversity at a given point in space and time. In addition, it contains functions to calculate species turnover, mean rank shifts, and lags in community similarity between two time points. Details of the methods are available in Hallett et al. (2016) <doi:10.1111/2041-210X.12569> and Avolio et al. (2019) <doi:10.1002/ecs2.2881>.
Analyses of frequencies can be performed using an alternative test based on the G statistic. The test has similar type-I error rates and power as the chi-square test. However, it is based on a total statistic that can be decomposed in an additive fashion into interaction effects, main effects, simple effects, contrast effects, etc., mimicking precisely the logic of ANOVA. We call this set of tools ANOFA (Analysis of Frequency data) to highlight its similarities with ANOVA. This framework also renders plots of frequencies along with confidence intervals. Finally, effect sizes and planning statistical power are easily done under this framework. The ANOFA is a tool that assesses the significance of effects instead of the significance of parameters; as such, it is more intuitive to most researchers than alternative approaches based on generalized linear models. See Laurencelle and Cousineau (2023) <doi:10.20982/tqmp.19.2.p173>.
This package provides a software package for using DEXi models. DEXi models are hierarchical qualitative multi-criteria decision models developed according to the method DEX (Decision EXpert, <https://dex.ijs.si/documentation/DEX_Method/DEX_Method.html>), using the program DEXi (<https://kt.ijs.si/MarkoBohanec/dexi.html>
) or DEXiWin
(<https://dex.ijs.si/dexisuite/dexiwin.html>). A typical workflow with DEXiR
consists of: (1) reading a .dxi file, previously made using the DEXi software (function read_dexi()
), (2) making a data frame containing input values of one or more decision alternatives, (3) evaluating those alternatives (function evaluate()
), (4) analyzing alternatives (selective_explanation()
, plus_minus()
, compare_alternatives()
), (5) drawing charts. DEXiR
is restricted to using models produced externally by the DEXi software and does not provide functionality for creating and/or editing DEXi models directly in R'.
Gene-Ranking Analysis of Pathway Expression (GRAPE) is a tool for summarizing the consensus behavior of biological pathways in the form of a template, and for quantifying the extent to which individual samples deviate from the template. GRAPE templates are based only on the relative rankings of the genes within the pathway and can be used for classification of tissue types or disease subtypes. GRAPE can be used to represent gene-expression samples as vectors of pathway scores, where each pathway score indicates the departure from a given collection of reference samples. The resulting pathway- space representation can be used as the feature set for various applications, including survival analysis and drug-response prediction. Users of GRAPE should use the following citation: Klein MI, Stern DF, and Zhao H. GRAPE: A pathway template method to characterize tissue-specific functionality from gene expression profiles. BMC Bioinformatics, 18:317 (June 2017).
Graphical model is an informative and powerful tool to explore the conditional dependence relationships among variables. The traditional Gaussian graphical model and its extensions either have a Gaussian assumption on the data distribution or assume the data are homogeneous. However, there are data with complex distributions violating these two assumptions. For example, the air pollutant concentration records are non-negative and, hence, non-Gaussian. Moreover, due to climate changes, distributions of these concentration records in different months of a year can be far different, which means it is uncertain whether datasets from different months are homogeneous. Methods with a Gaussian or homogeneous assumption may incorrectly model the conditional dependence relationships among variables. Therefore, we propose a heterogeneous graphical model for non-negative data (HGMND) to simultaneously cluster multiple datasets and estimate the conditional dependence matrix of variables from a non-Gaussian and non-negative exponential family in each cluster.
Gaze data from the Visual World Paradigm requires significant preprocessing prior to plotting and analyzing the data. This package provides functions for preparing visual world eye-tracking data for statistical analysis and plotting. It can prepare data for linear analyses (e.g., ANOVA, Gaussian-family LMER, Gaussian-family GAMM) as well as logistic analyses (e.g., binomial-family LMER and binomial-family GAMM). Additionally, it contains various plotting functions for creating grand average and conditional average plots. See the vignette for samples of the functionality. Currently, the functions in this package are designed for handling data collected with SR Research Eyelink eye trackers using Sample Reports created in SR Research Data Viewer. While we would like to add functionality for data collected with other systems in the future, the current package is considered to be feature-complete; further updates will mainly entail maintenance and the addition of minor functionality.
Assume that a temporal process is composed of contiguous segments with differing slopes and replicated noise-corrupted time series measurements are observed. The unknown mean of the data generating process is modelled as a piecewise linear function of time with an unknown number of change-points. The package infers the joint posterior distribution of the number and position of change-points as well as the unknown mean parameters per time-series by MCMC sampling. A-priori, the proposed model uses an overfitting number of mean parameters but, conditionally on a set of change-points, only a subset of them influences the likelihood. An exponentially decreasing prior distribution on the number of change-points gives rise to a posterior distribution concentrating on sparse representations of the underlying sequence, but also available is the Poisson distribution. See Papastamoulis et al (2017) <arXiv:1709.06111>
for a detailed presentation of the method.
This package provides analysis tools for big data where the sample size is very large. It offers a suite of functions for fitting and predicting joint models, which allow for the simultaneous analysis of longitudinal and time-to-event data. This statistical methodology is particularly useful in medical research where there is often interest in understanding the relationship between a longitudinal biomarker and a clinical outcome, such as survival or disease progression. This can be particularly useful in a clinical setting where it is important to be able to predict how a patient's health status may change over time. Overall, this package provides a comprehensive set of tools for joint modeling of BIG data obtained as survival and longitudinal outcomes with both Bayesian and non-Bayesian approaches. Its versatility and flexibility make it a valuable resource for researchers in many different fields, particularly in the medical and health sciences.
Composite Kernel Machine Regression based on Likelihood Ratio Test (CKLRT): in this package, we develop a kernel machine regression framework to model the overall genetic effect of a SNP-set, considering the possible GE interaction. Specifically, we use a composite kernel to specify the overall genetic effect via a nonparametric function and we model additional covariates parametrically within the regression framework. The composite kernel is constructed as a weighted average of two kernels, one corresponding to the genetic main effect and one corresponding to the GE interaction effect. We propose a likelihood ratio test (LRT) and a restricted likelihood ratio test (RLRT) for statistical significance. We derive a Monte Carlo approach for the finite sample distributions of LRT and RLRT statistics. (N. Zhao, H. Zhang, J. Clark, A. Maity, M. Wu. Composite Kernel Machine Regression based on Likelihood Ratio Test with Application for Combined Genetic and Gene-environment Interaction Effect (Submitted).).
This package implements data-driven identification methods for structural vector autoregressive (SVAR) models as described in Lange et al. (2021) <doi:10.18637/jss.v097.i05>. Based on an existing VAR model object (provided by e.g. VAR()
from the vars package), the structural impact matrix is obtained via data-driven identification techniques (i.e. changes in volatility (Rigobon, R. (2003) <doi:10.1162/003465303772815727>), patterns of GARCH (Normadin, M., Phaneuf, L. (2004) <doi:10.1016/j.jmoneco.2003.11.002>), independent component analysis (Matteson, D. S, Tsay, R. S., (2013) <doi:10.1080/01621459.2016.1150851>), least dependent innovations (Herwartz, H., Ploedt, M., (2016) <doi:10.1016/j.jimonfin.2015.11.001>), smooth transition in variances (Luetkepohl, H., Netsunajev, A. (2017) <doi:10.1016/j.jedc.2017.09.001>) or non-Gaussian maximum likelihood (Lanne, M., Meitz, M., Saikkonen, P. (2017) <doi:10.1016/j.jeconom.2016.06.002>)).
In a clinical trial with repeated measures designs, outcomes are often taken from subjects at fixed time-points. The focus of the trial may be to compare the mean outcome in two or more groups at some pre-specified time after enrollment. In the presence of missing data auxiliary assumptions are necessary to perform such comparisons. One commonly employed assumption is the missing at random assumption (MAR). The samon package allows the user to perform a (parameterized) sensitivity analysis of this assumption. In particular it can be used to examine the sensitivity of tests in the difference in outcomes to violations of the MAR assumption. The sensitivity analysis can be performed under two scenarios, a) where the data exhibit a monotone missing data pattern (see the samon()
function), and, b) where in addition to a monotone missing data pattern the data exhibit intermittent missing values (see the samonIM()
function).
To overcome the memory limitations for fitting linear (LM) and Generalized Linear Models (GLMs) to large data sets, this package implements the Divide and Recombine (D&R) strategy. It basically divides the entire large data set into suitable subsets manageable in size and then fits model to each subset. Finally, results from each subset are aggregated to obtain the final estimate. This package also supports fitting GLMs to data sets that cannot fit into memory and provides methods for fitting GLMs under linear regression, binomial regression, Poisson regression, and multinomial logistic regression settings. Respective models are fitted using different D&R strategies as described by: Xi, Lin, and Chen (2009) <doi:10.1109/TKDE.2008.186>, Xi, Lin and Chen (2006) <doi:10.1109/TKDE.2006.196>, Zuo and Li (2018) <doi:10.4236/ojs.2018.81003>, Karim, M.R., Islam, M.A. (2019) <doi:10.1007/978-981-13-9776-9>.
Helps find meaningful patterns in complex genetic experiments. First gimap takes data from paired CRISPR (Clustered regularly interspaced short palindromic repeats) screens that has been pre-processed to counts table of paired gRNA
(guide Ribonucleic Acid) reads. The input data will have cell counts for how well cells grow (or don't grow) when different genes or pairs of genes are disabled. The output of the gimap package is genetic interaction scores which are the distance between the observed CRISPR score and the expected CRISPR score. The expected CRISPR scores are what we expect for the CRISPR values to be for two unrelated genes. The further away an observed CRISPR score is from its expected score the more we suspect genetic interaction. The work in this package is based off of original research from the Alice Berger lab at Fred Hutchinson Cancer Center (2021) <doi:10.1016/j.celrep.2021.109597>.
Projects mean squared out-of-sample error for a linear regression based upon the methodology developed in Rohlfs (2022) <doi:10.48550/arXiv.2209.01493>
. It consumes as inputs the lm object from an estimated OLS regression (based on the "training sample") and a data.frame of out-of-sample cases (the "test sample") that have non-missing values for the same predictors. The test sample may or may not include data on the outcome variable; if it does, that variable is not used. The aim of the exercise is to project what what mean squared out-of-sample error can be expected given the predictor values supplied in the test sample. Output consists of a list of three elements: the projected mean squared out-of-sample error, the projected out-of-sample R-squared, and a vector of out-of-sample "hat" or "leverage" values, as defined in the paper.
This package performs Bayesian nonparametric density estimation using Martingale posterior distributions including the Copula Resampling (CopRe
) algorithm. Also included are a Gibbs sampler for the marginal Gibbs-type mixture model and an extension to include full uncertainty quantification via a predictive sequence resampling (SeqRe
) algorithm. The CopRe
and SeqRe
samplers generate random nonparametric distributions as output, leading to complete nonparametric inference on posterior summaries. Routines for calculating arbitrary functionals from the sampled distributions are included as well as an important algorithm for finding the number and location of modes, which can then be used to estimate the clusters in the data using, for example, k-means. Implements work developed in Moya B., Walker S. G. (2022). <doi:10.48550/arxiv.2206.08418>, Fong, E., Holmes, C., Walker, S. G. (2021) <doi:10.48550/arxiv.2103.15671>, and Escobar M. D., West, M. (1995) <doi:10.1080/01621459.1995.10476550>.
Estimation, selection and comparison of several families of transformations. The families of transformations included in the package are the following: Bickel-Doksum (Bickel and Doksum 1981 <doi:10.2307/2287831>), Box-Cox, Dual (Yang 2006 <doi:10.1016/j.econlet.2006.01.011>), Glog (Durbin et al. 2002 <doi:10.1093/bioinformatics/18.suppl_1.S105>), gpower (Kelmansky et al. 2013 <doi:10.1515/sagmb-2012-0030>), Log, Log-shift opt (Feng et al. 2016 <doi:10.1002/sta4.104>), Manly, modulus (John and Draper 1980 <doi:10.2307/2986305>), Neglog (Whittaker et al. 2005 <doi:10.1111/j.1467-9876.2005.00520.x>), Reciprocal and Yeo-Johnson. The package simplifies to compare linear models with untransformed and transformed dependent variable as well as linear models where the dependent variable is transformed with different transformations. Furthermore, the package employs maximum likelihood approaches, moments optimization and divergence minimization to estimate the optimal transformation parameter.
Patients Mental Health (MH) status, Substance Use (SU) status, and concurrent MH/SU status in the American/Canadian Healthcare Administrative Databases can be identified. The detection is based on given parameters of interest by clinicians including the list of plausible ICD MH/SU codes (3/4/5 characters), the required number of visits of hospital for MH/SU , the required number of visits of service physicians for MH/SU, and the maximum time span within MH visits, within SU visits, and, between MH and SU visits. Methods are described in: Khan S <https://pubmed.ncbi.nlm.nih.gov/29044442/>, Keen C, et al. (2021) <doi:10.1111/add.15580>, Lavergne MR, et al. (2022) <doi:10.1186/s12913-022-07759-z>, Casillas, S M, et al. (2022) <doi:10.1016/j.abrep.2022.100464>, CIHI (2022) <https://www.cihi.ca/en>, CDC (2024) <https://www.cdc.gov>, WHO (2019) <https://icd.who.int/en>.
This package creates geographic map tiles from geospatial map files or non-geographic map tiles from simple image files. This package provides a tile generator function for creating map tile sets for use with packages such as leaflet'. In addition to generating map tiles based on a common raster layer source, it also handles the non-geographic edge case, producing map tiles from arbitrary images. These map tiles, which have a non-geographic, simple coordinate reference system (CRS), can also be used with leaflet when applying the simple CRS option. Map tiles can be created from an input file with any of the following extensions: tif, grd and nc for spatial maps and png, jpg and bmp for basic images. This package requires Python and the gdal library for Python'. Windows users are recommended to install OSGeo4W (<https://trac.osgeo.org/osgeo4w/>) as an easy way to obtain the required gdal support for Python'.
The genetic algorithm can be used directly to find the similarity of users and more effectively to increase the efficiency of the collaborative filtering method. By identifying the nearest neighbors to the active user, before the genetic algorithm, and by identifying suitable starting points, an effective method for user-based collaborative filtering method has been developed. This package uses an optimization algorithm (continuous genetic algorithm) to directly find the optimal similarities between active users (users for whom current recommendations are made) and others. First, by determining the nearest neighbor and their number, the number of genes in a chromosome is determined. Each gene represents the neighbor's similarity to the active user. By estimating the starting points of the genetic algorithm, it quickly converges to the optimal solutions. The positive point is the independence of the genetic algorithm on the number of data that for big data is an effective help in solving the problem.
An R interface to FLINT <https://flintlib.org/>, a C library for number theory. FLINT extends GNU MPFR <https://www.mpfr.org/> and GNU MP <https://gmplib.org/> with support for operations on standard rings (the integers, the integers modulo n, finite fields, the rational, p-adic, real, and complex numbers) as well as matrices and polynomials over rings. FLINT implements midpoint-radius interval arithmetic, also known as ball arithmetic, in the real and complex numbers, enabling computation in arbitrary precision with rigorous propagation of rounding errors; see Johansson (2017) <doi:10.1109/TC.2017.2690633>. Finally, FLINT provides ball arithmetic implementations of many special mathematical functions, with high coverage of reference works such as the NIST Digital Library of Mathematical Functions <https://dlmf.nist.gov/>. The R interface defines S4 classes, generic functions, and methods for representation and basic operations as well as plain R functions mirroring and vectorizing entry points in the C library.
This package provides a simulation-based tool made to help researchers to become familiar with multilevel variations, and to build up sampling designs for their study. This tool has two main objectives: First, it provides an educational tool useful for students, teachers and researchers who want to learn to use mixed-effects models. Users can experience how the mixed-effects model framework can be used to understand distinct biological phenomena by interactively exploring simulated multilevel data. Second, it offers research opportunities to those who are already familiar with mixed-effects models, as it enables the generation of data sets that users may download and use for a range of simulation-based statistical analyses such as power and sensitivity analysis of multilevel and multivariate data [Allegue, H., Araya-Ajoy, Y.G., Dingemanse, N.J., Dochtermann N.A., Garamszegi, L.Z., Nakagawa, S., Reale, D., Schielzeth, H. and Westneat, D.F. (2016) <doi: 10.1111/2041-210X.12659>].
This package implements a spatial Bayesian non-parametric factor analysis model with inference in a Bayesian setting using Markov chain Monte Carlo (MCMC). Spatial correlation is introduced in the columns of the factor loadings matrix using a Bayesian non-parametric prior, the probit stick-breaking process. Areal spatial data is modeled using a conditional autoregressive (CAR) prior and point-referenced spatial data is treated using a Gaussian process. The response variable can be modeled as Gaussian, probit, Tobit, or Binomial (using Polya-Gamma augmentation). Temporal correlation is introduced for the latent factors through a hierarchical structure and can be specified as exponential or first-order autoregressive. Full details of the package can be found in the accompanying vignette. Furthermore, the details of the package can be found in "Bayesian Non-Parametric Factor Analysis for Longitudinal Spatial Surfaces", by Berchuck et al (2019), <arXiv:1911.04337>
. The paper is in press at the journal Bayesian Analysis.
Fit multiclass Classification version of Bayesian Adaptive Smoothing Splines (CBASS) to data using reversible jump MCMC. The multiclass classification problem consists of a response variable that takes on unordered categorical values with at least three levels, and a set of inputs for each response variable. The CBASS model consists of a latent multivariate probit formulation, and the means of the latent Gaussian random variables are specified using adaptive regression splines. The MCMC alternates updates of the latent Gaussian variables and the spline parameters. All the spline parameters (variables, signs, knots, number of interactions), including the number of basis functions used to model each latent mean, are inferred. Functions are provided to process inputs, initialize the chain, run the chain, and make predictions. Predictions are made on a probabilistic basis, where, for a given input, the probabilities of each categorical value are produced. See Marrs and Francom (2023) "Multiclass classification using Bayesian multivariate adaptive regression splines" Under review.
Joint and Individual Variation Explained (JIVE) is a method for decomposing multiple datasets obtained on the same subjects into shared structure, structure unique to each dataset, and noise. The two most common implementations are R.JIVE, an iterative approach, and AJIVE, which uses principal angle analysis. JIVE estimates subspaces but interpreting these subspaces can be challenging with AJIVE or R.JIVE. We expand upon insights into AJIVE as a canonical correlation analysis (CCA) of principal component scores. This reformulation, which we call CJIVE, 1) provides an ordering of joint components by the degree of correlation between corresponding canonical variables; 2) uses a computationally efficient permutation test for the number of joint components, which provides a p-value for each component; and 3) can be used to predict subject scores for out-of-sample observations. Please cite the following article when utilizing this package: Murden, R., Zhang, Z., Guo, Y., & Risk, B. (2022) <doi:10.3389/fnins.2022.969510>.