Implementation of a novel frequentist approach to produce clinically relevant treatment hierarchies in network meta-analysis. The method is based on treatment choice criteria (TCC) and probabilistic ranking models, as described by Evrenoglou et al. (2024) <DOI:10.48550/arXiv.2406.10612>
. The TCC are defined using a rule based on the minimal clinically important difference. Using the defined TCC, the study-level data (i.e., treatment effects and standard errors) are first transformed into a preference format, indicating either a treatment preference (e.g., treatment A > treatment B) or a tie (treatment A = treatment B). The preference data are then synthesized using a probabilistic ranking model, which estimates the latent ability parameter of each treatment and produces the final treatment hierarchy. This parameter represents each treatmentâ s ability to outperform all the other competing treatments in the network. Consequently, larger ability estimates indicate higher positions in the ranking list.
This package provides a molecular genetics tool that processes binary data from fragment analysis. It consolidates replicate sample pairs, outputs summary statistics, and produces hierarchical clustering trees and nMDS
plots. This package was developed from the publication available here: <https://www.sciencedirect.com/science/article/pii/S1049964420306538>. The GUI version of this package is available on the R Shiny online server at: <https://clarkevansteenderen.shinyapps.io/BINMAT/> or it is accessible via GitHub
by typing: shiny::runGitHub("BinMat
", "clarkevansteenderen") into the console in R. Two real-world datasets accompany the package: an AFLP dataset of Bunias orientalis samples from Tewes et. al. (2017) <https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/1365-2745.12869>, and an ISSR dataset of Nymphaea specimens from Reid et. al. (2021) <https://www.sciencedirect.com/science/article/pii/S0304377021000218> . The authors of these publications are thanked for allowing the use of their data.
This package creates realistic random trajectories in a 3-D space between two given fix points, so-called conditional empirical random walks (CERWs). The trajectory generation is based on empirical distribution functions extracted from observed trajectories (training data) and thus reflects the geometrical movement characteristics of the mover. A digital elevation model (DEM), representing the Earth's surface, and a background layer of probabilities (e.g. food sources, uplift potential, waterbodies, etc.) can be used to influence the trajectories. Unterfinger M (2018). "3-D Trajectory Simulation in Movement Ecology: Conditional Empirical Random Walk". Master's thesis, University of Zurich. <https://www.geo.uzh.ch/dam/jcr:6194e41e-055c-4635-9807-53c5a54a3be7/MasterThesis_Unterfinger_2018.pdf>
. Technitis G, Weibel R, Kranstauber B, Safi K (2016). "An algorithm for empirically informed random trajectory generation between two endpoints". GIScience 2016: Ninth International Conference on Geographic Information Science, 9, online. <doi:10.5167/uzh-130652>.
This package provides a toolkit that allows scientists to work with data from single cell sequencing technologies such as scRNA-seq
, scVDJ-seq
, scATAC-seq
, CITE-Seq and Spatial Transcriptomics (ST). Single (i) Cell R package ('iCellR
') provides unprecedented flexibility at every step of the analysis pipeline, including normalization, clustering, dimensionality reduction, imputation, visualization, and so on. Users can design both unsupervised and supervised models to best suit their research. In addition, the toolkit provides 2D and 3D interactive visualizations, differential expression analysis, filters based on cells, genes and clusters, data merging, normalizing for dropouts, data imputation methods, correcting for batch differences, pathway analysis, tools to find marker genes for clusters and conditions, predict cell types and pseudotime analysis. See Khodadadi-Jamayran, et al (2020) <doi:10.1101/2020.05.05.078550> and Khodadadi-Jamayran, et al (2020) <doi:10.1101/2020.03.31.019109> for more details.
This package provides a weather generator to simulate precipitation and temperature for regions with seasonality. Users input training data containing precipitation, temperature, and seasonality (up to 26 seasons). Including weather season as a training variable allows users to explore the effects of potential changes in season duration as well as average start- and end-time dates due to phenomena like climate change. Data for training should be a single time series but can originate from station data, basin averages, grid cells, etc. Bearup, L., Gangopadhyay, S., & Mikkelson, K. (2021). "Hydroclimate Analysis Lower Santa Cruz River Basin Study (Technical Memorandum No ENV-2020-056)." Bureau of Reclamation. Gangopadhyay, S., Bearup, L. A., Verdin, A., Pruitt, T., Halper, E., & Shamir, E. (2019, December 1). "A collaborative stochastic weather generator for climate impacts assessment in the Lower Santa Cruz River Basin, Arizona." Fall Meeting 2019, American Geophysical Union. <https://ui.adsabs.harvard.edu/abs/2019AGUFMGC41G1267G>.
This package performs iterative extrapolation of species haplotype accumulation curves using a nonparametric stochastic (Monte Carlo) optimization method for assessment of specimen sampling completeness based on the approach of Phillips et al. (2015) <doi:10.1515/dna-2015-0008>, Phillips et al. (2019) <doi:10.1002/ece3.4757> and Phillips et al. (2020) <doi: 10.7717/peerj-cs.243>. HACSim outputs a number of useful summary statistics of sampling coverage ("Measures of Sampling Closeness"), including an estimate of the likely required sample size (along with desired level confidence intervals) necessary to recover a given number/proportion of observed unique species haplotypes. Any genomic marker can be targeted to assess likely required specimen sample sizes for genetic diversity assessment. The method is particularly well-suited to assess sampling sufficiency for DNA barcoding initiatives. Users can also simulate their own DNA sequences according to various models of nucleotide substitution. A Shiny app is also available.
This package provides a workflow for the use of EaSIeR
tool, developed to assess patients likelihood to respond to ICB therapies providing just the patients RNA-seq data as input. We integrate RNA-seq data with different types of prior knowledge to extract quantitative descriptors of the tumor microenvironment from several points of view, including composition of the immune repertoire, and activity of intra- and extra-cellular communications. Then, we use multi-task machine learning trained in TCGA data to identify how these descriptors can simultaneously predict several state-of-the-art hallmarks of anti-cancer immune response. In this way we derive cancer-specific models and identify cancer-specific systems biomarkers of immune response. These biomarkers have been experimentally validated in the literature and the performance of EaSIeR
predictions has been validated using independent datasets form four different cancer types with patients treated with anti-PD1 or anti-PDL1 therapy.
Incorporates Approximate Bayesian Computation to get a posterior distribution and to select a model optimal parameter for an observation point. Additionally, the meta-sampling heuristic algorithm is realized for parameter estimation, which requires no model runs and is dimension-independent. A sampling scheme is also presented that allows model runs and uses the meta-sampling for point generation. A predictor is realized as the meta-sampling for the model output. All the algorithms leverage a machine learning method utilizing the maxima weighted Isolation Kernel approach, or MaxWiK
'. The method involves transforming raw data to a Hilbert space (mapping) and measuring the similarity between simulated points and the maxima weighted Isolation Kernel mapping corresponding to the observation point. Comprehensive details of the methodology can be found in the papers Iurii Nagornov (2024) <doi:10.1007/978-3-031-66431-1_16> and Iurii Nagornov (2023) <doi:10.1007/978-3-031-29168-5_18>.
Pathway Analysis is statistically linking observations on the molecular level to biological processes or pathways on the systems(i.e., organism, organ, tissue, cell) level. Traditionally, pathway analysis methods regard pathways as collections of single genes and treat all genes in a pathway as equally informative. However, this can lead to identifying spurious pathways as statistically significant since components are often shared amongst pathways. SIGORA seeks to avoid this pitfall by focusing on genes or gene pairs that are (as a combination) specific to a single pathway. In relying on such pathway gene-pair signatures (Pathway-GPS), SIGORA inherently uses the status of other genes in the experimental context to identify the most relevant pathways. The current version allows for pathway analysis of human and mouse datasets. In addition, it contains pre-computed Pathway-GPS data for pathways in the KEGG and Reactome pathway repositories and mechanisms for extracting GPS for user-supplied repositories.
This package implements the conditionally symmetric multidimensional Gaussian mixture model (csmGmm
) for large-scale testing of composite null hypotheses in genetic association applications such as mediation analysis, pleiotropy analysis, and replication analysis. In such analyses, we typically have J sets of K test statistics where K is a small number (e.g. 2 or 3) and J is large (e.g. 1 million). For each one of the J sets, we want to know if we can reject all K individual nulls. Please see the vignette for a quickstart guide. The paper describing these methods is "Testing a Large Number of Composite Null Hypotheses Using Conditionally Symmetric Multidimensional Gaussian Mixtures in Genome-Wide Studies" by Sun R, McCaw
Z, & Lin X (2024, <doi:10.1080/01621459.2024.2422124>). The paper is accepted and published online (but not yet in print) in the Journal of the American Statistical Association as of Dec 1 2024.
The GenSVM
classifier is a generalized multiclass support vector machine (SVM). This classifier aims to find decision boundaries that separate the classes with as wide a margin as possible. In GenSVM
, the loss function is very flexible in the way that misclassifications are penalized. This allows the user to tune the classifier to the dataset at hand and potentially obtain higher classification accuracy than alternative multiclass SVMs. Moreover, this flexibility means that GenSVM
has a number of other multiclass SVMs as special cases. One of the other advantages of GenSVM
is that it is trained in the primal space, allowing the use of warm starts during optimization. This means that for common tasks such as cross validation or repeated model fitting, GenSVM
can be trained very quickly. Based on: G.J.J. van den Burg and P.J.F. Groenen (2018) <https://www.jmlr.org/papers/v17/14-526.html>.
This package provides a high performance interface to the Global Biodiversity Information Facility, GBIF'. In contrast to rgbif', which can access small subsets of GBIF data through web-based queries to a central server, gbifdb provides enhanced performance for R users performing large-scale analyses on servers and cloud computing providers, providing full support for arbitrary SQL or dplyr operations on the complete GBIF data tables (now over 1 billion records, and over a terabyte in size). gbifdb accesses a copy of the GBIF data in parquet format, which is already readily available in commercial computing clouds such as the Amazon Open Data portal and the Microsoft Planetary Computer, or can be accessed directly without downloading, or downloaded to any server with suitable bandwidth and storage space. The high-performance techniques for local and remote access are described in <https://duckdb.org/why_duckdb> and <https://arrow.apache.org/docs/r/articles/fs.html> respectively.
Offers a general framework of multivariate mixed-effects models for the joint analysis of multiple correlated outcomes with clustered data structures and potential missingness proposed by Wang et al. (2018) <doi:10.1093/biostatistics/kxy022>. The missingness of outcome values may depend on the values themselves (missing not at random and non-ignorable), or may depend on only the covariates (missing at random and ignorable), or both. This package provides functions for two models: 1) mvMISE_b()
allows correlated outcome-specific random intercepts with a factor-analytic structure, and 2) mvMISE_e()
allows the correlated outcome-specific error terms with a graphical lasso penalty on the error precision matrix. Both functions are motivated by the multivariate data analysis on data with clustered structures from labelling-based quantitative proteomic studies. These models and functions can also be applied to univariate and multivariate analyses of clustered data with balanced or unbalanced design and no missingness.
We presented the Genotype-imputed Gene Set Enrichment Analysis (GIGSEA), a novel method that uses GWAS-and-eQTL-imputed
trait-associated differential gene expression to interrogate gene set enrichment for the trait-associated SNPs. By incorporating eQTL
from large gene expression studies, e.g. GTEx, GIGSEA appropriately addresses such challenges for SNP enrichment as gene size, gene boundary, SNP distal regulation, and multiple-marker regulation. The weighted linear regression model, taking as weights both imputation accuracy and model completeness, was used to perform the enrichment test, properly adjusting the bias due to redundancy in different gene sets. The permutation test, furthermore, is used to evaluate the significance of enrichment, whose efficiency can be largely elevated by expressing the computational intensive part in terms of large matrix operation. We have shown the appropriate type I error rates for GIGSEA (<5%), and the preliminary results also demonstrate its good performance to uncover the real signal.
Demo and dataset accompaying the books : De l'analyse des réseaux expérimentaux à la méta-analyse: Méthodes et applications avec le logiciel R pour les sciences agronomiques et environnementales (Published 2018-06-28, Quae, for french version) by David Makowski, Francois Piraux and Francois Brun - <https://www.quae.com/produit/1514/9782759228164/de-l-analyse-des-reseaux-experimentaux-a-la-meta-analyse> Knowledge Synthesis in Agriculture : from Experimental Network to Meta-Analysis (in preparation for 2018-06, Springer , for English version) by David Makowski, Francois Piraux and Francois Brun A full description of all the material is in both books. ACKNOWLEDGMENTS : The French network "RMT modeling and data analysis for agriculture" (<http://www.modelia.org>) have contributed to the development of this R package. This project and network are lead by ACTA (French Technical Institute for Agriculture) and was funded by a grant from the Ministry of Agriculture and Fishing of France.
Tool for exploring DNA and amino acid variation and inferring the presence of target lineages from microbial high-throughput genomic DNA samples that potentially contain mixtures of variants/lineages. MixviR
was originally created to help analyze environmental SARS-CoV-2/Covid-19
samples from environmental sources such as wastewater or dust, but can be applied to any microbial group. Inputs include reference genome information in commonly-used file formats (fasta, bed) and one or more variant call format (VCF) files, which can be generated with programs such as Illumina's DRAGEN, the Genome Analysis Toolkit, or bcftools. See DePristo
et al (2011) <doi:10.1038/ng.806> and Danecek et al (2021) <doi:10.1093/gigascience/giab008> for these tools, respectively. Available outputs include a table of mutations observed in the sample(s), estimates of proportions of target lineages in the sample(s), and an R Shiny dashboard to interactively explore the data.
RNA sequencing analysis methods are often derived by relying on hypothetical parametric models for read counts that are not likely to be precisely satisfied in practice. Methods are often tested by analyzing data that have been simulated according to the assumed model. This testing strategy can result in an overly optimistic view of the performance of an RNA-seq analysis method. We develop a data-based simulation algorithm for RNA-seq data. The vector of read counts simulated for a given experimental unit has a joint distribution that closely matches the distribution of a source RNA-seq dataset provided by the user. Users control the proportion of genes simulated to be differentially expressed (DE) and can provide a vector of weights to control the distribution of effect sizes. The algorithm requires a matrix of RNA-seq read counts with large sample sizes in at least two treatment groups. Many datasets are available that fit this standard.
This package provides a general framework to perform statistical inference of each gene pair and global inference of whole-scale gene pairs in gene networks using the well known Gaussian graphical model (GGM) in a time-efficient manner. We focus on the high-dimensional settings where p (the number of genes) is allowed to be far larger than n (the number of subjects). Four main approaches are supported in this package: (1) the bivariate nodewise scaled Lasso (Ren et al (2015) <doi:10.1214/14-AOS1286>) (2) the de-sparsified nodewise scaled Lasso (Jankova and van de Geer (2017) <doi:10.1007/s11749-016-0503-5>) (3) the de-sparsified graphical Lasso (Jankova and van de Geer (2015) <doi:10.1214/15-EJS1031>) (4) the GGM estimation with false discovery rate control (FDR) using scaled Lasso or Lasso (Liu (2013) <doi:10.1214/13-AOS1169>). Windows users should install Rtools before the installation of this package.
This package provides tools for fitting and simulating mixtures of Watson distributions. The random sampling scheme of the package offers two sampling algorithms that are based of the results of Sablica, Hornik and Leydold (2022) <doi:10.1080/10618600.2024.2416521>. What is more, the package offers a smart tool to combine these two methods, and based on the selected parameters, it approximates the relative sampling speed for both methods and picks the faster one. In addition, the package offers a fitting function for the mixtures of Watson distribution, that uses the expectation-maximization (EM) algorithm. Special features are the possibility to use multiple variants of the E-step and M-step, sparse matrices for the data representation and state of the art methods for numerical evaluation of needed special functions using the results of Sablica and Hornik (2022) <doi:10.1090/mcom/3690> and Sablica and Hornik (2024) <doi:10.1016/j.jmaa.2024.128262>.
The heterogeneous treatment effect estimation procedure proposed by Imai and Ratkovic (2013)<DOI: 10.1214/12-AOAS593>. The proposed method is applicable, for example, when selecting a small number of most (or least) efficacious treatments from a large number of alternative treatments as well as when identifying subsets of the population who benefit (or are harmed by) a treatment of interest. The method adapts the Support Vector Machine classifier by placing separate LASSO constraints over the pre-treatment parameters and causal heterogeneity parameters of interest. This allows for the qualitative distinction between causal and other parameters, thereby making the variable selection suitable for the exploration of causal heterogeneity. The package also contains a class of functions, CausalANOVA
, which estimates the average marginal interaction effects (AMIEs) by a regularized ANOVA as proposed by Egami and Imai (2019)<DOI:10.1080/01621459.2018.1476246>. It contains a variety of regularization techniques to facilitate analysis of large factorial experiments.
This package implements the Interval-Censored Sequence Kernel Association (ICSKAT) test for testing the association between interval-censored time-to-event outcomes and groups of single nucleotide polymorphisms (SNPs). Interval-censored time-to-event data occur when the event time is not known exactly but can be deduced to fall within a given interval. For example, some medical conditions like bone mineral density deficiency are generally only diagnosed at clinical visits. If a patient goes for clinical checkups yearly and is diagnosed at, say, age 30, then the onset of the deficiency is only known to fall between the date of their age 29 checkup and the date of the age 30 checkup. Interval-censored data include right- and left-censored data as special cases. This package also implements the interval-censored Burden test and the ICSKATO test, which is the optimal combination of the ICSKAT and Burden tests. Please see the vignette for a quickstart guide.
This package performs Bayesian variable selection under normal linear models for the data with the model parameters following as prior distributions either the power-expected-posterior (PEP) or the intrinsic (a special case of the former) (Fouskakis and Ntzoufras (2022) <doi: 10.1214/21-BA1288>, Fouskakis and Ntzoufras (2020) <doi: 10.3390/econometrics8020017>). The prior distribution on model space is the uniform over all models or the uniform on model dimension (a special case of the beta-binomial prior). The selection is performed by either implementing a full enumeration and evaluation of all possible models or using the Markov Chain Monte Carlo Model Composition (MC3) algorithm (Madigan and York (1995) <doi: 10.2307/1403615>). Complementary functions for hypothesis testing, estimation and predictions under Bayesian model averaging, as well as, plotting and printing the results are also provided. The results can be compared to the ones obtained under other well-known priors on model parameters and model spaces.
This package provides a method to purify a cell type or cell population of interest from heterogeneous datasets. scGate package automatizes marker-based purification of specific cell populations, without requiring training data or reference gene expression profiles. scGate takes as input a gene expression matrix stored in a Seurat object and a GM, consisting of a set of marker genes that define the cell population of interest. It evaluates the strength of signature marker expression in each cell using the rank-based method UCell, and then performs kNN smoothing by calculating the mean UCell score across neighboring cells. kNN-smoothing aims at compensating for the large degree of sparsity in scRNAseq data. Finally, a universal threshold over kNN-smoothed signature scores is applied in binary decision trees generated from the user-provided gating model, to annotate cells as either “pure” or “impure”, with respect to the cell population of interest.
Variable/Feature selection in high or ultra-high dimensional settings has gained a lot of attention recently specially in cancer genomic studies. This package provides a Bayesian approach to tackle this problem, where it exploits mixture of point masses at zero and nonlocal priors to improve the performance of variable selection and coefficient estimation. product moment (pMOM
) and product inverse moment (piMOM
) nonlocal priors are implemented and can be used for the analyses. This package performs variable selection for binary response and survival time response datasets which are widely used in biostatistic and bioinformatics community. Benefiting from parallel computing ability, it reports necessary outcomes of Bayesian variable selection such as Highest Posterior Probability Model (HPPM), Median Probability Model (MPM) and posterior inclusion probability for each of the covariates in the model. The option to use Bayesian Model Averaging (BMA) is also part of this package that can be exploited for predictive power measurements in real datasets.