Single-cell RNA-seq technologies enable high throughput gene expression measurement of individual cells, and allow the discovery of heterogeneity within cell populations. Measurement of cell-to-cell gene expression similarity is critical for the identification, visualization and analysis of cell populations. However, single-cell data introduce challenges to conventional measures of gene expression similarity because of the high level of noise, outliers and dropouts. We develop a novel similarity-learning framework, SIMLR (Single-cell Interpretation via Multi-kernel LeaRning
), which learns an appropriate distance metric from the data for dimension reduction, clustering and visualization.
The sqldf
function is typically passed a single argument which is an SQL select statement where the table names are ordinary R data frame names. sqldf
transparently sets up a database, imports the data frames into that database, performs the SQL statement and returns the result using a heuristic to determine which class to assign to each column of the returned data frame. The sqldf
or read.csv.sql
functions can also be used to read filtered files into R even if the original files are larger than R itself can handle.
This package provides functions to specify, fit and visualize nested partially-latent class models ( Wu, Deloria-Knoll, Hammitt, and Zeger (2016) <doi:10.1111/rssc.12101>; Wu, Deloria-Knoll, and Zeger (2017) <doi:10.1093/biostatistics/kxw037>; Wu and Chen (2021) <doi:10.1002/sim.8804>) for inference of population disease etiology and individual diagnosis. In the motivating Pneumonia Etiology Research for Child Health (PERCH) study, because both quantities of interest sum to one hundred percent, the PERCH scientists frequently refer to them as population etiology pie and individual etiology pie, hence the name of the package.
Analysis and visualization of dropout between conditions in surveys and (online) experiments. Features include computation of dropout statistics, comparing dropout between conditions (e.g. Chi square), analyzing survival (e.g. Kaplan-Meier estimation), comparing conditions with the most different rates of dropout (Kolmogorov-Smirnov) and visualizing the result of each in designated plotting functions. Sources: Andrea Frick, Marie-Terese Baechtiger & Ulf-Dietrich Reips (2001) <https://www.researchgate.net/publication/223956222_Financial_incentives_personal_information_and_drop-out_in_online_studies>; Ulf-Dietrich Reips (2002) "Standards for Internet-Based Experimenting" <doi:10.1027//1618-3169.49.4.243>.
The multiple contrast tests for univariate were proposed by Munko, Ditzhaus, Pauly, Smaga, and Zhang (2023) <doi:10.48550/arXiv.2306.15259>
. Recently, they were extended to the multivariate functional data in Munko, Ditzhaus, Pauly, and Smaga (2024) <doi:10.48550/arXiv.2406.01242>
. These procedures enable us to evaluate the overall hypothesis regarding equality, as well as specific hypotheses defined by contrasts. In particular, we can perform post hoc tests to examine particular comparisons of interest. Different experimental designs are supported, e.g., one-way and multi-way analysis of variance for functional data.
Power logit regression models for bounded continuous data, in which the density generator may be normal, Student-t, power exponential, slash, hyperbolic, sinh-normal, or type II logistic. Diagnostic tools associated with the fitted model, such as the residuals, local influence measures, leverage measures, and goodness-of-fit statistics, are implemented. The estimation process follows the maximum likelihood approach and, currently, the package supports two types of estimators: the usual maximum likelihood estimator and the penalized maximum likelihood estimator. More details about power logit regression models are described in Queiroz and Ferrari (2022) <arXiv:2202.01697>
.
In the situation when multiple alternative treatments or interventions available, different population groups may respond differently to different treatments. This package implements a method that discovers the population subgroups in which a certain treatment has a better effect than the other alternative treatments. This is done by first estimating the treatment effect for a given treatment and its uncertainty by computing random forests, and the resulting model is summarized by a decision tree in which the probabilities that the given treatment is best for a given subgroup is shown in the corresponding terminal node of the tree.
This package provides functions for Bayesian Predictive Stacking within the Bayesian transfer learning framework for geospatial artificial systems, as introduced in "Bayesian Transfer Learning for Artificially Intelligent Geospatial Systems: A Predictive Stacking Approach" (Presicce and Banerjee, 2024) <doi:10.48550/arXiv.2410.09504>
. This methodology enables efficient Bayesian geostatistical modeling, utilizing predictive stacking to improve inference across spatial datasets. The core functions leverage C++ for high-performance computation, making the framework well-suited for large-scale spatial data analysis in parallel and distributed computing environments. Designed for scalability, it allows seamless application in computationally demanding scenarios.
An efficient tool for fitting the nested common and shared atoms models using variational Bayes approximate inference for fast computation. Specifically, the package implements the common atoms model (Denti et al., 2023), its finite version (D'Angelo et al., 2023), and a hybrid finite-infinite model. All models use Gaussian mixtures with a normal-inverse-gamma prior distribution on the parameters. Additional functions are provided to help analyze the results of the fitting procedure. References: Denti, Camerlenghi, Guindani, Mira (2023) <doi:10.1080/01621459.2021.1933499>, Dâ Angelo, Canale, Yu, Guindani (2023) <doi:10.1111/biom.13626>.
Fits generalized additive models (GAMs) using a variational approximations (VA) framework. In brief, the VA framework provides a fully or at least closed to fully tractable lower bound approximation to the marginal likelihood of a GAM when it is parameterized as a mixed model (using penalized splines, say). In doing so, the VA framework aims offers both the stability and natural inference tools available in the mixed model approach to GAMs, while achieving computation times comparable to that of using the penalized likelihood approach to GAMs. See Hui et al. (2018) <doi:10.1080/01621459.2018.1518235>.
Single cell Higher Order Testing (scHOT
) is an R package that facilitates testing changes in higher order structure of gene expression along either a developmental trajectory or across space. scHOT
is general and modular in nature, can be run in multiple data contexts such as along a continuous trajectory, between discrete groups, and over spatial orientations; as well as accommodate any higher order measurement such as variability or correlation. scHOT
meaningfully adds to first order effect testing, such as differential expression, and provides a framework for interrogating higher order interactions from single cell data.
Radare2 is a complete framework for reverse-engineering, debugging, and analyzing binaries. It is composed of a set of small utilities that can be used together or independently from the command line.
Radare2 is built around a scriptable disassembler and hexadecimal editor that support a variety of executable formats for different processors and operating systems, through multiple back ends for local and remote files and disk images.
It can also compare (diff) binaries with graphs and extract information like relocation symbols. It is able to deal with malformed binaries, making it suitable for security research and analysis.
Designed for studies where animals tagged with acoustic tags are expected to move through receiver arrays. This package combines the advantages of automatic sorting and checking of animal movements with the possibility for user intervention on tags that deviate from expected behaviour. The three analysis functions (explore()
, migration()
and residency()
) allow the users to analyse their data in a systematic way, making it easy to compare results from different studies. CJS calculations are based on Perry et al. (2012) <https://www.researchgate.net/publication/256443823_Using_mark-recapture_models_to_estimate_survival_from_telemetry_data>.
The Pritchard-Stephens-Donnelly (PSD) admixture model has k intermediate subpopulations from which n individuals draw their alleles dictated by their individual-specific admixture proportions. The BN-PSD model additionally imposes the Balding-Nichols (BN) allele frequency model to the intermediate populations, which therefore evolved independently from a common ancestral population T with subpopulation-specific FST (Wright's fixation index) parameters. The BN-PSD model can be used to yield complex population structures. This simulation approach is now extended to subpopulations related by a tree. Method described in Ochoa and Storey (2021) <doi:10.1371/journal.pgen.1009241>.
This package provides a tool for extracting information (entities and relations between them) in text datasets. It also emphasizes the results exploration with graphical displays. It is a rule-based system and works with hand-made dictionaries and local grammars defined by users. x.ent uses parsing with Perl functions and JavaScript
to define user preferences through a browser and R to display and support analysis of the results extracted. Local grammars are defined and compiled with the tool Unitex, a tool developed by University Paris Est that supports multiple languages. See ?xconfig for an introduction.
Various methods are employed for statistical analysis and graphical presentation of real-time PCR (quantitative PCR or qPCR
) data. rtpcr handles amplification efficiency calculation, statistical analysis and graphical representation of real-time PCR data based on up to two reference genes. By accounting for amplification efficiency values, rtpcr was developed using a general calculation method described by Ganger et al. (2017) <doi:10.1186/s12859-017-1949-5> and Taylor et al. (2019) <doi:10.1016/j.tibtech.2018.12.002>, covering both the Livak and Pfaffl methods. Based on the experimental conditions, the functions of the rtpcr package use t-test (for experiments with a two-level factor), analysis of variance (ANOVA), analysis of covariance (ANCOVA) or analysis of repeated measure data to calculate the fold change (FC, Delta Delta Ct method) or relative expression (RE, Delta Ct method). The functions further provide standard errors and confidence intervals for means, apply statistical mean comparisons and present significance. To facilitate function application, different data sets were used as examples and the outputs were explained. â rtpcrâ package also provides bar plots using various controlling arguments. The rtpcr package is user-friendly and easy to work with and provides an applicable resource for analyzing real-time PCR data.
This package provides a comprehensive statistical analysis of the accuracy of blood pressure devices based on the method of AAMI/ANSI SP10 standards developed by the AAMI Sphygmomanometer Committee for indirect measurement of blood pressure, incorporated into IS0 81060-2. The bpAcc
package gives the exact probability of accepting a device D derived from the join distribution of the sample standard deviation and a non-linear transformation of the sample mean for a specified sample size introduced by Chandel et al. (2023) and by the Association for the Advancement of Medical Instrumentation (2003, ISBN:1-57020-183-8).
This package implements parameter estimation using a Bayesian approach for Multivariate Threshold Autoregressive (MTAR) models with missing data using Markov Chain Monte Carlo methods. Performs the simulation of MTAR processes (mtarsim()
), estimation of matrix parameters and the threshold values (mtarns()
), identification of the autoregressive orders using Bayesian variable selection (mtarstr()
), identification of the number of regimes using Metropolised Carlin and Chib (mtarnumreg()
) and estimate missing data, coefficients and covariance matrices conditional on the autoregressive orders, the threshold values and the number of regimes (mtarmissing()
). Calderon and Nieto (2017) <doi:10.1080/03610926.2014.990758>.
In the era of big data, data redundancy and distributed characteristics present novel challenges to data analysis. This package introduces a method for estimating optimal subsets of redundant distributed data, based on PPCDT (Conjunction of Power and P-value in Distributed Settings). Leveraging PPC technology, this approach can efficiently extract valuable information from redundant distributed data and determine the optimal subset. Experimental results demonstrate that this method not only enhances data quality and utilization efficiency but also assesses its performance effectively. The philosophy of the package is described in Guo G. (2020) <doi:10.1007/s00180-020-00974-4>.
Computes the D', Wn, and conditional asymmetric linkage disequilibrium (ALD) measures for pairs of genetic loci. Performs these linkage disequilibrium (LD) calculations on phased genotype data recorded using Genotype List (GL) String or columnar formats. Alternatively, generates expectation-maximization (EM) estimated haplotypes from phased data, or performs LD calculations on EM estimated haplotypes. Performs sign tests comparing LD values for phased and unphased datasets, and generates heat-maps for each LD measure. Described by Osoegawa et al. (2019a) <doi:10.1016/j.humimm.2019.01.010>, and Osoegawa et. al. (2019b) <doi:10.1016/j.humimm.2019.05.018>.
High-throughput analysis of growth curves and fluorescence data using three methods: linear regression, growth model fitting, and smooth spline fit. Analysis of dose-response relationships via smoothing splines or dose-response models. Complete data analysis workflows can be executed in a single step via user-friendly wrapper functions. The results of these workflows are summarized in detailed reports as well as intuitively navigable R data containers. A shiny application provides access to all features without requiring any programming knowledge. The package is described in further detail in Wirth et al. (2023) <doi:10.1038/s41596-023-00850-7>.
Rapidly build accurate genetic prediction models for genome-wide association or whole-genome sequencing study data by smooth-threshold multivariate genetic prediction (STMGP) method. Variable selection is performed using marginal association test p-values with an optimal p-value cutoff selected by Cp-type criterion. Quantitative and binary traits are modeled respectively via linear and logistic regression models. A function that works through PLINK software (Purcell et al. 2007 <DOI:10.1086/519795>, Chang et al. 2015 <DOI:10.1186/s13742-015-0047-8>) <https://www.cog-genomics.org/plink2> is provided. Covariates can be included in regression model.
This package implements the SISAL algorithm by Tikka and Hollmén. It is a sequential backward selection algorithm which uses a linear model in a cross-validation setting. Starting from the full model, one variable at a time is removed based on the regression coefficients. From this set of models, a parsimonious (sparse) model is found by choosing the model with the smallest number of variables among those models where the validation error is smaller than a threshold. Also implements extensions which explore larger parts of the search space and/or use ridge regression instead of ordinary least squares.
Health research using data from electronic health records (EHR) has gained popularity, but misclassification of EHR-derived disease status and lack of representativeness of the study sample can result in substantial bias in effect estimates and can impact power and type I error for association tests. Here, the assumed target of inference is the relationship between binary disease status and predictors modeled using a logistic regression model. SAMBA implements several methods for obtaining bias-corrected point estimates along with valid standard errors as proposed in Beesley and Mukherjee (2020) <doi:10.1101/2019.12.26.19015859>, currently under review.