Interpreting the differences between mean scale scores across various forms of an assessment can be challenging. This difficulty arises from different mappings between raw scores and scale scores, complex mathematical relationships, adjustments based on judgmental procedures, and diverse equating functions applied to different assessment forms. An alternative method involves running simulations to explore the effect of incrementing raw scores on mean scale scores. The idmact package provides an implementation of this approach based on the algorithm detailed in Schiel (1998) <https://www.act.org/content/dam/act/unsecured/documents/ACT_RR98-01.pdf> which was developed to help interpret differences between mean scale scores on the American College Testing (ACT) assessment. The function idmact_subj()
within the package offers a framework for running simulations on subject-level scores. In contrast, the idmact_comp()
function provides a framework for conducting simulations on composite scores.
This package provides a Graphical user interface to calculate the rainfall-runoff relation using the Natural Resources Conservation Service - Curve Number method (NRCS-CN method) but include modifications by Hawkins et al., (2002) about the Initial Abstraction. This GUI follows the programming logic of a previously published software (Hernandez-Guzman et al., 2011)<doi:10.1016/j.envsoft.2011.07.006>. It is a raster-based GIS tool that outputs runoff estimates from Land use/land cover and hydrologic soil group maps. This package has already been published in Journal of Hydroinformatics (Hernandez-Guzman et al., 2021)<doi:10.2166/hydro.2020.087> but it is under constant development at the Institute about Natural Resources Research (INIRENA) from the Universidad Michoacana de San Nicolas de Hidalgo and represents a collaborative effort between the Hydro-Geomatic Lab (INIRENA) with the Environmental Management Lab (CIAD, A.C.).
This package provides a post-estimation method for categorical response models (CRM). Inputs from objects of class serp()
, clm()
, polr()
, multinom()
, mlogit()
, vglm()
and glm()
are currently supported. Available tests include the Hosmer-Lemeshow tests for the binary, multinomial and ordinal logistic regression; the Lipsitz and the Pulkstenis-Robinson tests for the ordinal models. The proportional odds, adjacent-category, and constrained continuation-ratio models are particularly supported at ordinal level. Tests for the proportional odds assumptions in ordinal models are also possible with the Brant and the Likelihood-Ratio tests. Moreover, several summary measures of predictive strength (Pseudo R-squared), and some useful error metrics, including, the brier score, misclassification rate and logloss are also available for the binary, multinomial and ordinal models. Ugba, E. R. and Gertheiss, J. (2018) <http://www.statmod.org/workshops_archive_proceedings_2018.html>.
This natural language processing toolkit provides language-agnostic tokenization', parts of speech tagging', lemmatization and dependency parsing of raw text. Next to text parsing, the package also allows you to train annotation models based on data of treebanks in CoNLL-U
format as provided at <https://universaldependencies.org/format.html>. The techniques are explained in detail in the paper: Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe', available at <doi:10.18653/v1/K17-3009>. The toolkit also contains functionalities for commonly used data manipulations on texts which are enriched with the output of the parser. Namely functionalities and algorithms for collocations, token co-occurrence, document term matrix handling, term frequency inverse document frequency calculations, information retrieval metrics (Okapi BM25), handling of multi-word expressions, keyword detection (Rapid Automatic Keyword Extraction, noun phrase extraction, syntactical patterns) sentiment scoring and semantic similarity analysis.
Learning, manipulation and evaluation of mixtures of truncated basis functions (MoTBFs
), which include mixtures of polynomials (MOPs) and mixtures of truncated exponentials (MTEs). MoTBFs
are a flexible framework for modelling hybrid Bayesian networks (I. Pérez-Bernabé, A. Salmerón, H. Langseth (2015) <doi:10.1007/978-3-319-20807-7_36>; H. Langseth, T.D. Nielsen, I. Pérez-Bernabé, A. Salmerón (2014) <doi:10.1016/j.ijar.2013.09.012>; I. Pérez-Bernabé, A. Fernández, R. Rumà , A. Salmerón (2016) <doi:10.1007/s10618-015-0429-7>). The package provides functionality for learning univariate, multivariate and conditional densities, with the possibility of incorporating prior knowledge. Structural learning of hybrid Bayesian networks is also provided. A set of useful tools is provided, including plotting, printing and likelihood evaluation. This package makes use of S3 objects, with two new classes called motbf and jointmotbf'.
Functional enrichment analysis methods such as gene set enrichment analysis (GSEA) have been widely used for analyzing gene expression data. GSEA is a powerful method to infer results of gene expression data at a level of gene sets by calculating enrichment scores for predefined sets of genes. GSEA depends on the availability and accuracy of gene sets. There are overlaps between terms of gene sets or categories because multiple terms may exist for a single biological process, and it can thus lead to redundancy within enriched terms. In other words, the sets of related terms are overlapping. Using deep learning, this pakage is aimed to predict enrichment scores for unique tokens or words from text in names of gene sets to resolve this overlapping set issue. Furthermore, we can coin a new term by combining tokens and find its enrichment score by predicting such a combined tokens.
Power and sample size calculations for genetic association studies allowing for misspecification of the model of genetic susceptibility. "Hum Hered. 2019;84(6):256-271.<doi:10.1159/000508558>. Epub 2020 Jul 28." Power and/or sample size can be calculated for logistic (case/control study design) and linear (continuous phenotype) regression models, using additive, dominant, recessive or degree of freedom coding of the genetic covariate while assuming a true dominant, recessive or additive genetic effect. In addition, power and sample size calculations can be performed for gene by environment interactions. These methods are extensions of Gauderman (2002) <doi:10.1093/aje/155.5.478> and Gauderman (2002) <doi:10.1002/sim.973> and are described in: Moore CM, Jacobson S, Fingerlin TE. Power and Sample Size Calculations for Genetic Association Studies in the Presence of Genetic Model Misspecification. American Society of Human Genetics. October 2018, San Diego.
The saemix package implements the Stochastic Approximation EM algorithm for parameter estimation in (non)linear mixed effects models. The SAEM algorithm (i) computes the maximum likelihood estimator of the population parameters, without any approximation of the model (linearisation, quadrature approximation,...), using the Stochastic Approximation Expectation Maximization (SAEM) algorithm, (ii) provides standard errors for the maximum likelihood estimator (iii) estimates the conditional modes, the conditional means and the conditional standard deviations of the individual parameters, using the Hastings-Metropolis algorithm (see Comets et al. (2017) <doi:10.18637/jss.v080.i03>). Many applications of SAEM in agronomy, animal breeding and PKPD analysis have been published by members of the Monolix group. The full PDF documentation for the package including references about the algorithm and examples can be downloaded on the github of the IAME research institute for saemix': <https://github.com/iame-researchCenter/saemix/blob/7638e1b09ccb01cdff173068e01c266e906f76eb/docsaem.pdf>
.
It covers several Bayesian Analysis of Variance (BANOVA) models used in analysis of experimental designs in which both within- and between- subjects factors are manipulated. They can be applied to data that are common in the behavioral and social sciences. The package includes: Hierarchical Bayes ANOVA models with normal response, t response, Binomial (Bernoulli) response, Poisson response, ordered multinomial response and multinomial response variables. All models accommodate unobserved heterogeneity by including a normal distribution of the parameters across individuals. Outputs of the package include tables of sums of squares, effect sizes and p-values, and tables of predictions, which are easily interpretable for behavioral and social researchers. The floodlight analysis and mediation analysis based on these models are also provided. BANOVA uses Stan and JAGS as the computational platform. References: Dong and Wedel (2017) <doi:10.18637/jss.v081.i09>; Wedel and Dong (2020) <doi:10.1002/jcpy.1111>.
This package provides a collection of tools and data for analyzing the Gause microcosm experiments, and for fitting Lotka-Volterra models to time series data. Includes methods for fitting single-species logistic growth, and multi-species interaction models, e.g. of competition, predator/prey relationships, or mutualism. See documentation for individual functions for examples. In general, see the lv_optim()
function for examples of how to fit parameter values in multi-species systems. Note that the general methods applied here, as well as the form of the differential equations that we use, are described in detail in the Quantitative Ecology textbook by Lehman et al., available at <http://hdl.handle.net/11299/204551>, and in Lina K. Mühlbauer, Maximilienne Schulze, W. Stanley Harpole, and Adam T. Clark. gauseR
': Simple methods for fitting Lotka-Volterra models describing Gause's Struggle for Existence in the journal Ecology and Evolution.
This package provides a collection of clinical trial designs and methods, implemented in rstan and R, including: the Continual Reassessment Method by O'Quigley et al. (1990) <doi:10.2307/2531628>; EffTox
by Thall & Cook (2004) <doi:10.1111/j.0006-341X.2004.00218.x>; the two-parameter logistic method of Neuenschwander, Branson & Sponer (2008) <doi:10.1002/sim.3230>; and the Augmented Binary method by Wason & Seaman (2013) <doi:10.1002/sim.5867>; and more. We provide functions to aid model-fitting and analysis. The rstan implementations may also serve as a cookbook to anyone looking to extend or embellish these models. We hope that this package encourages the use of Bayesian methods in clinical trials. There is a preponderance of early phase trial designs because this is where Bayesian methods are used most. If there is a method you would like implemented, please get in touch.
This package provides a simple approach to using a probit or logit analysis to calculate lethal concentration (LC) or time (LT) and the appropriate fiducial confidence limits desired for selected LC or LT for ecotoxicology studies (Finney 1971; Wheeler et al. 2006; Robertson et al. 2007). The simplicity of ecotox comes from the syntax it implies within its functions which are similar to functions like glm()
and lm()
. In addition to the simplicity of the syntax, a comprehensive data frame is produced which gives the user a predicted LC or LT value for the desired level and a suite of important parameters such as fiducial confidence limits and slope. Finney, D.J. (1971, ISBN: 052108041X); Wheeler, M.W., Park, R.M., and Bailer, A.J. (2006) <doi:10.1897/05-320R.1>; Robertson, J.L., Savin, N.E., Russell, R.M., and Preisler, H.K. (2007, ISBN: 0849323312).
The Gene Ontology (GO) Consortium <https://geneontology.org/> organizes genes into hierarchical categories based on biological process (BP), molecular function (MF) and cellular component (CC, i.e., subcellular localization). Tools such as GoMiner
(see Zeeberg, B.R., Feng, W., Wang, G. et al. (2003) <doi:10.1186/gb-2003-4-4-r28>) can leverage GO to perform ontological analysis of microarray and proteomics studies, typically generating a list of significant functional categories. Microarray studies are usually analyzed with BP, whereas proteomics researchers often prefer CC. To capture the benefit of both of those ontologies, I developed a two-dimensional version of High-Throughput GoMiner
('HTGM2D'). I generate a 2D heat map whose axes are any two of BP, MF, or CC, and the value within a picture element of the heat map reflects the Jaccard metric p-value for the number of genes in common for the corresponding pair.
This package creates realistic random trajectories in a 3-D space between two given fix points, so-called conditional empirical random walks (CERWs). The trajectory generation is based on empirical distribution functions extracted from observed trajectories (training data) and thus reflects the geometrical movement characteristics of the mover. A digital elevation model (DEM), representing the Earth's surface, and a background layer of probabilities (e.g. food sources, uplift potential, waterbodies, etc.) can be used to influence the trajectories. Unterfinger M (2018). "3-D Trajectory Simulation in Movement Ecology: Conditional Empirical Random Walk". Master's thesis, University of Zurich. <https://www.geo.uzh.ch/dam/jcr:6194e41e-055c-4635-9807-53c5a54a3be7/MasterThesis_Unterfinger_2018.pdf>
. Technitis G, Weibel R, Kranstauber B, Safi K (2016). "An algorithm for empirically informed random trajectory generation between two endpoints". GIScience 2016: Ninth International Conference on Geographic Information Science, 9, online. <doi:10.5167/uzh-130652>.
This package provides a toolkit that allows scientists to work with data from single cell sequencing technologies such as scRNA-seq
, scVDJ-seq
, scATAC-seq
, CITE-Seq and Spatial Transcriptomics (ST). Single (i) Cell R package ('iCellR
') provides unprecedented flexibility at every step of the analysis pipeline, including normalization, clustering, dimensionality reduction, imputation, visualization, and so on. Users can design both unsupervised and supervised models to best suit their research. In addition, the toolkit provides 2D and 3D interactive visualizations, differential expression analysis, filters based on cells, genes and clusters, data merging, normalizing for dropouts, data imputation methods, correcting for batch differences, pathway analysis, tools to find marker genes for clusters and conditions, predict cell types and pseudotime analysis. See Khodadadi-Jamayran, et al (2020) <doi:10.1101/2020.05.05.078550> and Khodadadi-Jamayran, et al (2020) <doi:10.1101/2020.03.31.019109> for more details.
This package provides a weather generator to simulate precipitation and temperature for regions with seasonality. Users input training data containing precipitation, temperature, and seasonality (up to 26 seasons). Including weather season as a training variable allows users to explore the effects of potential changes in season duration as well as average start- and end-time dates due to phenomena like climate change. Data for training should be a single time series but can originate from station data, basin averages, grid cells, etc. Bearup, L., Gangopadhyay, S., & Mikkelson, K. (2021). "Hydroclimate Analysis Lower Santa Cruz River Basin Study (Technical Memorandum No ENV-2020-056)." Bureau of Reclamation. Gangopadhyay, S., Bearup, L. A., Verdin, A., Pruitt, T., Halper, E., & Shamir, E. (2019, December 1). "A collaborative stochastic weather generator for climate impacts assessment in the Lower Santa Cruz River Basin, Arizona." Fall Meeting 2019, American Geophysical Union. <https://ui.adsabs.harvard.edu/abs/2019AGUFMGC41G1267G>.
This package performs iterative extrapolation of species haplotype accumulation curves using a nonparametric stochastic (Monte Carlo) optimization method for assessment of specimen sampling completeness based on the approach of Phillips et al. (2015) <doi:10.1515/dna-2015-0008>, Phillips et al. (2019) <doi:10.1002/ece3.4757> and Phillips et al. (2020) <doi: 10.7717/peerj-cs.243>. HACSim outputs a number of useful summary statistics of sampling coverage ("Measures of Sampling Closeness"), including an estimate of the likely required sample size (along with desired level confidence intervals) necessary to recover a given number/proportion of observed unique species haplotypes. Any genomic marker can be targeted to assess likely required specimen sample sizes for genetic diversity assessment. The method is particularly well-suited to assess sampling sufficiency for DNA barcoding initiatives. Users can also simulate their own DNA sequences according to various models of nucleotide substitution. A Shiny app is also available.
This package provides a workflow for the use of EaSIeR
tool, developed to assess patients likelihood to respond to ICB therapies providing just the patients RNA-seq data as input. We integrate RNA-seq data with different types of prior knowledge to extract quantitative descriptors of the tumor microenvironment from several points of view, including composition of the immune repertoire, and activity of intra- and extra-cellular communications. Then, we use multi-task machine learning trained in TCGA data to identify how these descriptors can simultaneously predict several state-of-the-art hallmarks of anti-cancer immune response. In this way we derive cancer-specific models and identify cancer-specific systems biomarkers of immune response. These biomarkers have been experimentally validated in the literature and the performance of EaSIeR
predictions has been validated using independent datasets form four different cancer types with patients treated with anti-PD1 or anti-PDL1 therapy.
Incorporates Approximate Bayesian Computation to get a posterior distribution and to select a model optimal parameter for an observation point. Additionally, the meta-sampling heuristic algorithm is realized for parameter estimation, which requires no model runs and is dimension-independent. A sampling scheme is also presented that allows model runs and uses the meta-sampling for point generation. A predictor is realized as the meta-sampling for the model output. All the algorithms leverage a machine learning method utilizing the maxima weighted Isolation Kernel approach, or MaxWiK
'. The method involves transforming raw data to a Hilbert space (mapping) and measuring the similarity between simulated points and the maxima weighted Isolation Kernel mapping corresponding to the observation point. Comprehensive details of the methodology can be found in the papers Iurii Nagornov (2024) <doi:10.1007/978-3-031-66431-1_16> and Iurii Nagornov (2023) <doi:10.1007/978-3-031-29168-5_18>.
Pathway Analysis is statistically linking observations on the molecular level to biological processes or pathways on the systems(i.e., organism, organ, tissue, cell) level. Traditionally, pathway analysis methods regard pathways as collections of single genes and treat all genes in a pathway as equally informative. However, this can lead to identifying spurious pathways as statistically significant since components are often shared amongst pathways. SIGORA seeks to avoid this pitfall by focusing on genes or gene pairs that are (as a combination) specific to a single pathway. In relying on such pathway gene-pair signatures (Pathway-GPS), SIGORA inherently uses the status of other genes in the experimental context to identify the most relevant pathways. The current version allows for pathway analysis of human and mouse datasets. In addition, it contains pre-computed Pathway-GPS data for pathways in the KEGG and Reactome pathway repositories and mechanisms for extracting GPS for user-supplied repositories.
This package implements the conditionally symmetric multidimensional Gaussian mixture model (csmGmm
) for large-scale testing of composite null hypotheses in genetic association applications such as mediation analysis, pleiotropy analysis, and replication analysis. In such analyses, we typically have J sets of K test statistics where K is a small number (e.g. 2 or 3) and J is large (e.g. 1 million). For each one of the J sets, we want to know if we can reject all K individual nulls. Please see the vignette for a quickstart guide. The paper describing these methods is "Testing a Large Number of Composite Null Hypotheses Using Conditionally Symmetric Multidimensional Gaussian Mixtures in Genome-Wide Studies" by Sun R, McCaw
Z, & Lin X (2024, <doi:10.1080/01621459.2024.2422124>). The paper is accepted and published online (but not yet in print) in the Journal of the American Statistical Association as of Dec 1 2024.
This package provides a high performance interface to the Global Biodiversity Information Facility, GBIF'. In contrast to rgbif', which can access small subsets of GBIF data through web-based queries to a central server, gbifdb provides enhanced performance for R users performing large-scale analyses on servers and cloud computing providers, providing full support for arbitrary SQL or dplyr operations on the complete GBIF data tables (now over 1 billion records, and over a terabyte in size). gbifdb accesses a copy of the GBIF data in parquet format, which is already readily available in commercial computing clouds such as the Amazon Open Data portal and the Microsoft Planetary Computer, or can be accessed directly without downloading, or downloaded to any server with suitable bandwidth and storage space. The high-performance techniques for local and remote access are described in <https://duckdb.org/why_duckdb> and <https://arrow.apache.org/docs/r/articles/fs.html> respectively.
The GenSVM
classifier is a generalized multiclass support vector machine (SVM). This classifier aims to find decision boundaries that separate the classes with as wide a margin as possible. In GenSVM
, the loss function is very flexible in the way that misclassifications are penalized. This allows the user to tune the classifier to the dataset at hand and potentially obtain higher classification accuracy than alternative multiclass SVMs. Moreover, this flexibility means that GenSVM
has a number of other multiclass SVMs as special cases. One of the other advantages of GenSVM
is that it is trained in the primal space, allowing the use of warm starts during optimization. This means that for common tasks such as cross validation or repeated model fitting, GenSVM
can be trained very quickly. Based on: G.J.J. van den Burg and P.J.F. Groenen (2018) <https://www.jmlr.org/papers/v17/14-526.html>.
Offers a general framework of multivariate mixed-effects models for the joint analysis of multiple correlated outcomes with clustered data structures and potential missingness proposed by Wang et al. (2018) <doi:10.1093/biostatistics/kxy022>. The missingness of outcome values may depend on the values themselves (missing not at random and non-ignorable), or may depend on only the covariates (missing at random and ignorable), or both. This package provides functions for two models: 1) mvMISE_b()
allows correlated outcome-specific random intercepts with a factor-analytic structure, and 2) mvMISE_e()
allows the correlated outcome-specific error terms with a graphical lasso penalty on the error precision matrix. Both functions are motivated by the multivariate data analysis on data with clustered structures from labelling-based quantitative proteomic studies. These models and functions can also be applied to univariate and multivariate analyses of clustered data with balanced or unbalanced design and no missingness.