This package provides a design-based approach to statistical inference, with a focus on spatial data. Spatially balanced samples are selected using the Generalized Random Tessellation Stratified (GRTS) algorithm. The GRTS algorithm can be applied to finite resources (point geometries) and infinite resources (linear / linestring and areal / polygon geometries) and flexibly accommodates a diverse set of sampling design features, including stratification, unequal inclusion probabilities, proportional (to size) inclusion probabilities, legacy (historical) sites, a minimum distance between sites, and two options for replacement sites (reverse hierarchical order and nearest neighbor). Data are analyzed using a wide range of analysis functions that perform categorical variable analysis, continuous variable analysis, attributable risk analysis, risk difference analysis, relative risk analysis, change analysis, and trend analysis. spsurvey can also be used to summarize objects, visualize objects, select samples that are not spatially balanced, select panel samples, measure the amount of spatial balance in a sample, adjust design weights, and more. For additional details, see Dumelle et al. (2023) <doi:10.18637/jss.v105.i03>.
Implement different Item Response Theory (IRT) based procedures for the development of static short test forms (STFs) from a test. Two main procedures are considered (Epifania, Anselmi & Robusto, 2022 <doi:10.1007/978-3-031-27781-8_7>). The procedures differ in how the most informative items are selected for the inclusion in the STF, either by considering their item information functions without any reference to any specific latent trait level (benchmark procedure) or by considering their information with respect to specific latent trait levels, denoted as theta targets (theta target procedure). Three methods are implemented for the definition of the theta targets: (i) as the midpoints of equal intervals on the latent trait, (ii) as the centroids of the clusters obtained by clustering the latent trait, and (iii) as user-defined values. Importantly, the number of theta targets defines the number of items included in the STF. For further details on the procedure, please refer to Epifania, Anselmi & Robusto (2022) <doi:10.1007/978-3-031-27781-8_7>.
This package implements measures of tree similarity, including information-based generalized Robinson-Foulds distances (Phylogenetic Information Distance, Clustering Information Distance, Matching Split Information Distance; Smith 2020) <doi:10.1093/bioinformatics/btaa614>; Jaccard-Robinson-Foulds distances (Bocker et al. 2013) <doi:10.1007/978-3-642-40453-5_13>, including the Nye et al. (2006) metric <doi:10.1093/bioinformatics/bti720>; the Matching Split Distance (Bogdanowicz & Giaro 2012) <doi:10.1109/TCBB.2011.48>; the Hierarchical Mutual Information (Perotti et al. 2015) <doi:10.1103/PhysRevE.92.062825>; Maximum Agreement Subtree distances; the Kendall-Colijn (2016) distance <doi:10.1093/molbev/msw124>, and the Nearest Neighbour Interchange (NNI) distance, approximated per Li et al. (1996) <doi:10.1007/3-540-61332-3_168>. Includes tools for visualizing mappings of tree space (Smith 2022) <doi:10.1093/sysbio/syab100>, for identifying islands of trees (Silva and Wilkinson 2021) <doi:10.1093/sysbio/syab015>, for calculating the median of sets of trees, and for computing the information content of trees and splits.
Seahtrue organizes oxygen consumption and extracellular acidification analysis data from experiments performed on an XF analyzer into structured nested tibbles.This allows for detailed processing of raw data and advanced data visualization and statistics. Seahtrue introduces an open and reproducible way to analyze these XF experiments. It uses file paths to .xlsx files. These .xlsx files are supplied by the userand are generated by the user in the Wave software from Agilent from the assay result files (.asyr). The .xlsx file contains different sheets of important data for the experiment; 1. Assay Information - Details about how the experiment was set up. 2. Rate Data - Information about the OCR and ECAR rates. 3. Raw Data - The original raw data collected during the experiment. 4. Calibration Data - Data related to calibrating the instrument. Seahtrue focuses on getting the specific data needed for analysis. Once this data is extracted, it is prepared for calculations through preprocessing. To make sure everything is accurate, both the initial data and the preprocessed data go through thorough checks.
This package provides a series of R functions that come in handy while working with metabarcoding data. The reasoning of doing this is to have the same functions we use all the time stored in a curated, reproducible way. In a way it is all about putting together the grammar of the tidyverse from Wickham et al.(2019) <doi:10.21105/joss.01686> with the functions we have used in community ecology compiled in packages like vegan from Dixon (2003) <doi:10.1111/j.1654-1103.2003.tb02228.x> and phyloseq McMurdie & Holmes (2013) <doi:10.1371/journal.pone.0061217>. The package includes functions to read sequences from FAST(A/Q) into a tibble ('fasta_reader and fastq_reader'), to process cutadapt Martin (2011) <doi:10.14806/ej.17.1.200> info-file output. When it comes to sequence counts across samples, the package works with the long format in mind (a three column tibble with Sample, Sequence and counts ), with functions to move from there to the wider format.
The EMDomics algorithm is used to perform a supervised multi-class analysis to measure the magnitude and statistical significance of observed continuous genomics data between groups. Usually the data will be gene expression values from array-based or sequence-based experiments, but data from other types of experiments can also be analyzed (e.g. copy number variation). Traditional methods like Significance Analysis of Microarrays (SAM) and Linear Models for Microarray Data (LIMMA) use significance tests based on summary statistics (mean and standard deviation) of the distributions. This approach lacks power to identify expression differences between groups that show high levels of intra-group heterogeneity. The Earth Mover's Distance (EMD) algorithm instead computes the "work" needed to transform one distribution into another, thus providing a metric of the overall difference in shape between two distributions. Permutation of sample labels is used to generate q-values for the observed EMD scores. This package also incorporates the Komolgorov-Smirnov (K-S) test and the Cramer von Mises test (CVM), which are both common distribution comparison tests.
This package provides useful tools for cognitive diagnosis modeling (CDM). The package includes functions for empirical Q-matrix estimation and validation, such as the Hull method (Nájera, Sorrel, de la Torre, & Abad, 2021, <doi:10.1111/bmsp.12228>) and the discrete factor loading method (Wang, Song, & Ding, 2018, <doi:10.1007/978-3-319-77249-3_29>). It also contains dimensionality assessment procedures for CDM, including parallel analysis and automated fit comparison as explored in Nájera, Abad, and Sorrel (2021, <doi:10.3389/fpsyg.2021.614470>). Other relevant methods and features for CDM applications, such as the restricted DINA model (Nájera et al., 2023; <doi:10.3102/10769986231158829>), the general nonparametric classification method (Chiu et al., 2018; <doi:10.1007/s11336-017-9595-4>), and corrected estimation of the classification accuracy via multiple imputation (Kreitchmann et al., 2022; <doi:10.3758/s13428-022-01967-5>) are also available. Lastly, the package provides some useful functions for CDM simulation studies, such as random Q-matrix generation and detection of complete/identified Q-matrices.
Offers tools for visualizing and analyzing size and power properties of tests for equal predictive accuracy, including Diebold-Mariano and related procedures. Provides multiple Diebold-Mariano test implementations based on fixed-smoothing approaches, including fixed-b methods such as Kiefer and Vogelsang (2005) <doi:10.1017/S0266466605050565>, and applications to tests for equal predictive accuracy as in Coroneo and Iacone (2020) <doi:10.1002/jae.2756>, alongside conventional large-sample approximations. HAR inference involves nonparametric estimation of the long-run variance, and a key tuning parameter (the truncation parameter) trades off size and power. Lazarus, Lewis, and Stock (2021) <doi:10.3982/ECTA15404> theoretically characterize the size-power frontier for the Gaussian multivariate location model. ForeComp computes and visualizes the finite-sample size-power frontier of the Diebold-Mariano test based on fixed-b asymptotics together with the Bartlett kernel. To compute finite-sample size and power, it fits a best approximating ARMA process to the input data and reports how the truncation parameter performs and how robust testing outcomes are to its choice.
This package provides a suite of functions that fit models that use PPM type priors for partitions. Models include hierarchical Gaussian and probit ordinal models with a (covariate dependent) PPM. If a covariate dependent product partition model is selected, then all the options detailed in Page, G.L.; Quintana, F.A. (2018) <doi:10.1007/s11222-017-9777-z> are available. If covariate values are missing, then the approach detailed in Page, G.L.; Quintana, F.A.; Mueller, P (2020) <doi:10.1080/10618600.2021.1999824> is employed. Also included in the package is a function that fits a Gaussian likelihood spatial product partition model that is detailed in Page, G.L.; Quintana, F.A. (2016) <doi:10.1214/15-BA971>, and multivariate PPM change point models that are detailed in Quinlan, J.J.; Page, G.L.; Castro, L.M. (2023) <doi:10.1214/22-BA1344>. In addition, a function that fits a univariate or bivariate functional data model that employs a PPM or a PPMx to cluster curves based on B-spline coefficients is provided.
This package provides a scaling method to obtain a standardized Moran's I measure. Moran's I is a measure for the spatial autocorrelation of a data set, it gives a measure of similarity between data and its surrounding. The range of this value must be [-1,1], but this does not happen in practice. This package scale the Moran's I value and map it into the theoretical range of [-1,1]. Once the Moran's I value is rescaled, it facilitates the comparison between projects, for instance, a researcher can calculate Moran's I in a city in China, with a sample size of n1 and area of interest a1. Another researcher runs a similar experiment in a city in Mexico with different sample size, n2, and an area of interest a2. Due to the differences between the conditions, it is not possible to compare Moran's I in a straightforward way. In this version of the package, the spatial autocorrelation Moran's I is calculated as proposed in Chen(2013) <arXiv:1606.03658>.
This package provides a comprehensive toolset for any useR conducting topological data analysis, specifically via the calculation of persistent homology in a Vietoris-Rips complex. The tools this package currently provides can be conveniently split into three main sections: (1) calculating persistent homology; (2) conducting statistical inference on persistent homology calculations; (3) visualizing persistent homology and statistical inference. The published form of TDAstats can be found in Wadhwa et al. (2018) <doi:10.21105/joss.00860>. For a general background on computing persistent homology for topological data analysis, see Otter et al. (2017) <doi:10.1140/epjds/s13688-017-0109-5>. To learn more about how the permutation test is used for nonparametric statistical inference in topological data analysis, read Robinson & Turner (2017) <doi:10.1007/s41468-017-0008-7>. To learn more about how TDAstats calculates persistent homology, you can visit the GitHub repository for Ripser, the software that works behind the scenes at <https://github.com/Ripser/ripser>. This package has been published as Wadhwa et al. (2018) <doi:10.21105/joss.00860>.
This package provides a number of statistical tests have been proposed to compare two survival curves, including the difference in (or ratio of) t-year survival, difference in (or ratio of) p-th percentile survival, difference in (or ratio of) restricted mean survival time, and the weighted log-rank test. Despite the multitude of options, the convention in survival studies is to assume proportional hazards and to use the unweighted log-rank test for design and analysis. This package provides sample size and power calculation for all of the above statistical tests with allowance for flexible accrual, censoring, and survival (eg. Weibull, piecewise-exponential, mixture cure). It is the companion R package to the paper by Yung and Liu (2020) <doi:10.1111/biom.13196>. Specific to the weighted log-rank test, users may specify which approximations they wish to use to estimate the large-sample mean and variance. The default option has been shown to provide substantial improvement over the conventional sample size and power equations based on Schoenfeld (1981) <doi:10.1093/biomet/68.1.316>.
This package provides the tools to produce catseye plots, principally by catseyesplot() function which calls R's standard plot() function internally, or alternatively by the catseyes() function to overlay the catseye plot onto an existing R plot window. Catseye plots illustrate the normal distribution of the mean (picture a normal bell curve reflected over its base and rotated 90 degrees), with a shaded confidence interval; they are an intuitive way of illustrating and comparing normally distributed estimates, and are arguably a superior alternative to standard confidence intervals, since they show the full distribution rather than fixed quantile bounds. The catseyesplot and catseyes functions require pre-calculated means and standard errors (or standard deviations), provided as numeric vectors; this allows the flexibility of obtaining this information from a variety of sources, such as direct calculation or prediction from a model. Catseye plots, as illustrations of the normal distribution of the means, are described in Cumming (2013 & 2014). Cumming, G. (2013). The new statistics: Why and how. Psychological Science, 27, 7-29. <doi:10.1177/0956797613504966> pmid:24220629.
This package provides tools to calculate the theoretical hydrodynamic response of an aquifer undergoing harmonic straining or pressurization, or analyze measured responses. There are two classes of models here, designed for use with confined aquifers: (1) for sealed wells, based on the model of Kitagawa et al (2011, <doi:10.1029/2010JB007794>), and (2) for open wells, based on the models of Cooper et al (1965, <doi:10.1029/JZ070i016p03915>), Hsieh et al (1987, <doi:10.1029/WR023i010p01824>), Rojstaczer (1988, <doi:10.1029/JB093iB11p13619>), Liu et al (1989, <doi:10.1029/JB094iB07p09453>), and Wang et al (2018, <doi:10.1029/2018WR022793>). Wang's solution is a special exception which allows for leakage out of the aquifer (semi-confined); it is equivalent to Hsieh's model when there is no leakage (the confined case). These models treat strain (or aquifer head) as an input to the physical system, and fluid-pressure (or water height) as the output. The applicable frequency band of these models is characteristic of seismic waves, atmospheric pressure fluctuations, and solid earth tides.
Randomized clinical trials commonly follow participants for a time-to-event efficacy endpoint for a fixed period of time. Consequently, at the time when the last enrolled participant completes their follow-up, the number of observed endpoints is a random variable. Assuming data collected through an interim timepoint, simulation-based estimation and inferential procedures in the standard right-censored failure time analysis framework are conducted for the distribution of the number of endpoints--in total as well as by treatment arm--at the end of the follow-up period. The future (i.e., yet unobserved) enrollment, endpoint, and dropout times are generated according to mechanisms specified in the simTrial() function in the seqDesign package. A Bayesian model for the endpoint rate, offering the option to specify a robust mixture prior distribution, is used for generating future data (see the vignette for details). Inference can be restricted to participants who received treatment according to the protocol and are observed to be at risk for the endpoint at a specified timepoint. Plotting functions are provided for graphical display of results.
This package performs combination tests and sample size calculation for fixed design with survival endpoints using combination tests under either proportional hazards or non-proportional hazards. The combination tests include maximum weighted log-rank test and projection test. The sample size calculation procedure is very flexible, allowing for user-defined hazard ratio function and considering various trial conditions like staggered entry, drop-out etc. The sample size calculation also applies to various cure models such as proportional hazards cure model, cure model with (random) delayed treatments effects. Trial simulation function is also provided to facilitate the empirical power calculation. The references for projection test and maximum weighted logrank test include Brendel et al. (2014) <doi:10.1111/sjos.12059> and Cheng and He (2021) <arXiv:2110.03833>. The references for sample size calculation under proportional hazard include Schoenfeld (1981) <doi:10.1093/biomet/68.1.316> and Freedman (1982) <doi:10.1002/sim.4780010204>. The references for calculation under non-proportional hazards include Lakatos (1988) <doi:10.2307/2531910> and Cheng and He (2023) <doi:10.1002/bimj.202100403>.
The aim of postpack is to provide the infrastructure for a standardized workflow for mcmc.list objects. These objects can be used to store output from models fitted with Bayesian inference using JAGS', WinBUGS', OpenBUGS', NIMBLE', Stan', or even custom MCMC algorithms. Although the coda R package provides some methods for these objects, it is somewhat limited in easily performing post-processing tasks for specific nodes. Models are ever increasing in their complexity and the number of tracked nodes, and oftentimes a user may wish to summarize/diagnose sampling behavior for only a small subset of nodes at a time for a particular question or figure. Thus, many postpack functions support performing tasks on a subset of nodes, where the subset is specified with regular expressions. The functions in postpack streamline the extraction, summarization, and diagnostics of specific monitored nodes after model fitting. Further, because there is rarely only ever one model under consideration, postpack scales efficiently to perform the same tasks on output from multiple models simultaneously, facilitating rapid assessment of model sensitivity to changes in assumptions.
Analysis of relative cell type proportions in bulk gene expression data. Provides a well-validated set of brain cell type-specific marker genes derived from multiple types of experiments, as described in McKenzie (2018) <doi:10.1038/s41598-018-27293-5>. For brain tissue data sets, there are marker genes available for astrocytes, endothelial cells, microglia, neurons, oligodendrocytes, and oligodendrocyte precursor cells, derived from each of human, mice, and combination human/mouse data sets. However, if you have access to your own marker genes, the functions can be applied to bulk gene expression data from any tissue. Also implements multiple options for relative cell type proportion estimation using these marker genes, adapting and expanding on approaches from the CellCODE R package described in Chikina (2015) <doi:10.1093/bioinformatics/btv015>. The number of cell type marker genes used in a given analysis can be increased or decreased based on your preferences and the data set. Finally, provides functions to use the estimates to adjust for variability in the relative proportion of cell types across samples prior to downstream analyses.
Assess the calibration of an existing (i.e. previously developed) multistate model through calibration plots. Calibration is assessed using one of three methods. 1) Calibration methods for binary logistic regression models applied at a fixed time point in conjunction with inverse probability of censoring weights. 2) Calibration methods for multinomial logistic regression models applied at a fixed time point in conjunction with inverse probability of censoring weights. 3) Pseudo-values estimated using the Aalen-Johansen estimator of observed risk. All methods are applied in conjunction with landmarking when required. These calibration plots evaluate the calibration (in a validation cohort of interest) of the transition probabilities estimated from an existing multistate model. While package development has focused on multistate models, calibration plots can be produced for any model which utilises information post baseline to update predictions (e.g. dynamic models); competing risks models; or standard single outcome survival models, where predictions can be made at any landmark time. Please see Pate et al. (2024) <doi:10.1002/sim.10094> and Pate et al. (2024) <https://alexpate30.github.io/calibmsm/articles/Overview.html>.
This package implements estimation and testing procedures for evaluating an intermediate biomarker response as a principal surrogate of a clinical response to treatment (i.e., principal stratification effect modification analysis), as described in Juraska M, Huang Y, and Gilbert PB (2020), Inference on treatment effect modification by biomarker response in a three-phase sampling design, Biostatistics, 21(3): 545-560 <doi:10.1093/biostatistics/kxy074>. The methods avoid the restrictive placebo structural risk modeling assumption common to past methods and further improve robustness by the use of nonparametric kernel smoothing for biomarker density estimation. A randomized controlled two-group clinical efficacy trial is assumed with an ordered categorical or continuous univariate biomarker response measured at a fixed timepoint post-randomization and with a univariate baseline surrogate measure allowed to be observed in only a subset of trial participants with an observed biomarker response (see the flexible three-phase sampling design in the paper for details). Bootstrap-based procedures are available for pointwise and simultaneous confidence intervals and testing of four relevant hypotheses. Summary and plotting functions are provided for estimation results.
This package provides functions that fit two modern education-based value-added models. One of these models is the quantile value-added model. This model permits estimating a school's value-added based on specific quantiles of the post-test distribution. Estimating value-added based on quantiles of the post-test distribution provides a more complete picture of an education institution's contribution to learning for students of all abilities. See Page, G.L.; San Martà n, E.; Orellana, J.; Gonzalez, J. (2017) <doi:10.1111/rssa.12195> for more details. The second model is a temporally dependent value-added model. This model takes into account the temporal dependence that may exist in school performance between two cohorts in one of two ways. The first is by modeling school random effects with a non-stationary AR(1) process. The second is by modeling school effects based on previous cohort's post-test performance. In addition to more efficiently estimating value-added, this model permits making statements about the persistence of a schools effectiveness. The standard value-added model is also an option.
US VAERS vaccine data for 01/01/2018 - 06/14/2018. If you want to explore the full VAERS data for 1990 - Present (data, symptoms, and vaccines), then check out the vaers package from the URL below. The URL and BugReports below correspond to the vaers package, of which vaersvax is a small subset (2018 only). vaers is not hosted on CRAN due to the large size of the data set. To install the Suggested vaers and vaersND packages, use the following R code: devtools::install_git("<https://gitlab.com/iembry/vaers.git>", build_vignettes = TRUE) and devtools::install_git("<https://gitlab.com/iembry/vaersND.git>", build_vignettes = TRUE)'. "The Vaccine Adverse Event Reporting System (VAERS) is a national early warning system to detect possible safety problems in U.S.-licensed vaccines. VAERS is co-managed by the Centers for Disease Control and Prevention (CDC) and the U.S. Food and Drug Administration (FDA)." For more information about the data, visit <https://vaers.hhs.gov/>. For information about vaccination/immunization hazards, visit <http://www.questionuniverse.com/rethink.html#vaccine>.
This package provides a comprehensive Shiny application for analyzing Whole Genome Duplication ('WGD') events. This package provides a user-friendly Shiny web application for non-experienced researchers to prepare input data and execute command lines for several well-known WGD analysis tools, including wgd', ksrates', i-ADHoRe', OrthoFinder', and Whale'. This package also provides the source code for experienced researchers to adjust and install the package to their own server. Key Features 1) Input Data Preparation This package allows users to conveniently upload and format their data, making it compatible with various WGD analysis tools. 2) Command Line Generation This package automatically generates the necessary command lines for selected WGD analysis tools, reducing manual errors and saving time. 3) Visualization This package offers interactive visualizations to explore and interpret WGD results, facilitating in-depth WGD analysis. 4) Comparative Genomics Users can study and compare WGD events across different species, aiding in evolutionary and comparative genomics studies. 5) User-Friendly Interface This Shiny web application provides an intuitive and accessible interface, making WGD analysis accessible to researchers and bioinformaticians of all levels.
Testing homogeneity of k multivariate distributions is a classical and challenging problem in statistics, and this becomes even more challenging when the dimension of the data exceeds the sample size. We construct some tests for this purpose which are exact level (size) alpha tests based on clustering. These tests are easy to implement and distribution-free in finite sample situations. Under appropriate regularity conditions, these tests have the consistency property in HDLSS asymptotic regime, where the dimension of data grows to infinity while the sample size remains fixed. We also consider a multiscale approach, where the results for different number of partitions are aggregated judiciously. Details are in Biplab Paul, Shyamal K De and Anil K Ghosh (2021) <doi:10.1016/j.jmva.2021.104897>; Soham Sarkar and Anil K Ghosh (2019) <doi:10.1109/TPAMI.2019.2912599>; William M Rand (1971) <doi:10.1080/01621459.1971.10482356>; Cyrus R Mehta and Nitin R Patel (1983) <doi:10.2307/2288652>; Joseph C Dunn (1973) <doi:10.1080/01969727308546046>; Sture Holm (1979) <doi:10.2307/4615733>; Yoav Benjamini and Yosef Hochberg (1995) <doi: 10.2307/2346101>.