This package provides a wrapped LASSO approach by integrating an ensemble learning strategy to help select efficient, stable, and high confidential variables from omics-based data. Using a bagging strategy in combination of a parametric method or inflection point search method for cut-off threshold determination. This package can integrate and vote variables generated from multiple LASSO models to determine the optimal candidates. Luo H, Zhao Q, et al (2020) <doi:10.1126/scitranslmed.aax7533> for more details.
This package provides a pipeline to discern RNA structure at and proximal to the site of protein binding within regions of the transcriptome defined by the user. CLIP protein-binding data can be input as either aligned BAM or peak-called bedGraph
files. RNA structure can either be predicted internally from sequence or users have the option to input their own RNA structure data. RNA structure binding profiles can be visually and quantitatively compared across multiple formats.
AnyStyle is a very fast and smart parser for academic reference lists and bibliographies. AnyStyle uses powerful machine learning heuristics based on Conditional Random Fields and aims to make it easy to train the model with data that is relevant to your parsing needs.
This package provides the Ruby module AnyStyle
. AnyStyle can also be used via the anystyle
command-line utility or a web application, though the later has not yet been packaged for Guix.
Finds the most likely originating tissue(s) and developmental stage(s) of tissue-specific RNA sequencing data. The package identifies both pure transcriptomes and mixtures of transcriptomes. The most likely identity is found through comparisons of the sequencing data with high-throughput in situ hybridisation patterns. Typical uses are the identification of cancer cell origins, validation of cell culture strain identities, validation of single-cell transcriptomes, and validation of identity and purity of flow-sorting and dissection sequencing products.
This package implements fast Monte Carlo simulations for goodness-of-fit (GOF) tests for discrete distributions. This includes tests based on the Chi-squared statistic, the log-likelihood-ratio (G^2) statistic, the Freeman-Tukey (Hellinger-distance) statistic, the Kolmogorov-Smirnov statistic, the Cramer-von Mises statistic as described in Choulakian, Lockhart and Stephens (1994) <doi:10.2307/3315828>, and the root-mean-square statistic, see Perkins, Tygert, and Ward (2011) <doi:10.1016/j.amc.2011.03.124>.
An implementation of several functions for feature extraction in ordinal time series datasets. Specifically, some of the features proposed by Weiss (2019) <doi:10.1080/01621459.2019.1604370> can be computed. These features can be used to perform inferential tasks or to feed machine learning algorithms for ordinal time series, among others. The package also includes some interesting datasets containing financial time series. Practitioners from a broad variety of fields could benefit from the general framework provided by otsfeatures'.
Using Gaussian graphical models we propose a novel approach to perform pathway analysis using gene expression. Given the structure of a graph (a pathway) we introduce two statistical tests to compare the mean and the concentration matrices between two groups. Specifically, these tests can be performed on the graph and on its connected components (cliques). The package is based on the method described in Massa M.S., Chiogna M., Romualdi C. (2010) <doi:10.1186/1752-0509-4-121>.
This package contains functions for a variational Bayesian method for sparse PCA proposed by Ning (2020) <arXiv:2102.00305>
. There are two algorithms: the PX-CAVI algorithm (if assuming the loadings matrix is jointly row-sparse) and the batch PX-CAVI algorithm (if without this assumption). The outputs of the main function, VBsparsePCA()
, include the mean and covariance of the loadings matrix, the score functions, the variable selection results, and the estimated variance of the random noise.
This package provides a class and subclasses for storing non-scalar objects in matrix entries. This is akin to a ragged array but the raggedness is in the third dimension, much like a bumpy surface--hence the name. Of particular interest is the BumpyDataFrameMatrix
, where each entry is a Bioconductor data frame. This allows us to naturally represent multivariate data in a format that is compatible with two-dimensional containers like the SummarizedExperiment
and MultiAssayExperiment
objects.
Function-oriented Make-like declarative pipelines for statistics and data science are supported in the targets R package. As an extension to targets, the tarchetypes package provides convenient user-side functions to make targets easier to use. By establishing reusable archetypes for common kinds of targets and pipelines, these functions help express complicated reproducible pipelines concisely and compactly. The methods in this package were influenced by the drake R package by Will Landau (2018) <doi:10.21105/joss.00550>.
inline-c
is a small crate that allows a user to write C (including C++) code inside Rust. Both environments are strictly sandboxed. The C code is transformed into a string which is written to a temporary file. This file is then compiled into an object file, that is finally executed.
The primary goal of inline-c
is to ease the testing of a C API of a Rust program (generated with cbindgen
for example).
The fusion learning method uses a model selection algorithm to learn from multiple data sets across different experimental platforms through group penalization. The responses of interest may include a mix of discrete and continuous variables. The responses may share the same set of predictors, however, the models and parameters differ across different platforms. Integrating information from different data sets can enhance the power of model selection. Package is based on Xin Gao, Raymond J. Carroll (2017) <arXiv:1610.00667v1>
.
This package provides functions to estimate a factor model using discrete and continuous proxy variables. The function dproxyme estimates a factor model of discrete proxy variables using an EM algorithm (Dempster, Laird, Rubin (1977) <doi:10.1111/j.2517-6161.1977.tb01600.x>; Hu (2008) <doi:10.1016/j.jeconom.2007.12.001>; Hu(2017) <doi:10.1016/j.jeconom.2017.06.002> ). The function cproxyme estimates a linear factor model (Cunha, Heckman, and Schennach (2010) <doi:10.3982/ECTA6551>).
An API wrapper around the ProPublica
API <https://projects.propublica.org/api-docs/congress-api/> for U.S. Congressional Bills. Users can include their API key, U.S. Congress, branch, and offset ranges, to return a dataframe of all results within those parameters. This package is different from the RPublica package because it is for the ProPublica
U.S. Congress data API, and the RPublica package is for the Nonprofit Explorer, Forensics, and Free the Files data APIs.
This package contains functions for analysis and summary of tidal datasets. Also provides access to tidal data collected by the National Oceanic and Atmospheric Administration's Center for Operational Oceanographic Products and Services and the Permanent Service for Mean Sea Level. For detailed description and application examples, see Hill, T.D. and S.C. Anisfeld (2021) <doi:10.6084/m9.figshare.14161202.v1> and Hill, T.D. and S.C. Anisfeld (2015) <doi:10.1016/j.ecss.2015.06.004>.
The package CellBarcode
performs Cellular DNA Barcode analysis. It can handle all kinds of DNA barcodes, as long as the barcode is within a single sequencing read and has a pattern that can be matched by a regular expression. \codeCellBarcode
can handle barcodes with flexible lengths, with or without UMI (unique molecular identifier). This tool also can be used for pre-processing some amplicon data such as CRISPR gRNA
screening, immune repertoire sequencing, and metagenome data.
Routines to handle family data with a Pedigree object. The initial purpose was to create correlation structures that describe family relationships such as kinship and identity-by-descent, which can be used to model family data in mixed effects models, such as in the coxme function. Also includes a tool for Pedigree drawing which is focused on producing compact layouts without intervention. Recent additions include utilities to trim the Pedigree object with various criteria, and kinship for the X chromosome.
SpectralTAD
is an R package designed to identify Topologically Associated Domains (TADs) from Hi-C contact matrices. It uses a modified version of spectral clustering that uses a sliding window to quickly detect TADs. The function works on a range of different formats of contact matrices and returns a bed file of TAD coordinates. The method does not require users to adjust any parameters to work and gives them control over the number of hierarchical levels to be returned.
SpatialCPie
is an R package designed to facilitate cluster evaluation for spatial transcriptomics data by providing intuitive visualizations that display the relationships between clusters in order to guide the user during cluster identification and other downstream applications. The package is built around a shiny "gadget" to allow the exploration of the data with multiple plots in parallel and an interactive UI. The user can easily toggle between different cluster resolutions in order to choose the most appropriate visual cues.
Programming oncology specific Clinical Data Interchange Standards Consortium (CDISC) compliant Analysis Data Model (ADaM
) datasets in R'. ADaM
datasets are a mandatory part of any New Drug or Biologics License Application submitted to the United States Food and Drug Administration (FDA). Analysis derivations are implemented in accordance with the "Analysis Data Model Implementation Guide" (CDISC Analysis Data Model Team (2021), <https://www.cdisc.org/standards/foundational/adam>). The package is an extension package of the admiral package.
This package provides functions to compute distances between probability measures or any other data object than can be posed in this way, entropy measures for samples of curves, distances and depth measures for functional data, and the Generalized Mahalanobis Kernel distance for high dimensional data. For further details about the metrics please refer to Martos et al (2014) <doi:10.3233/IDA-140706>; Martos et al (2018) <doi:10.3390/e20010033>; Hernandez et al (2018, submitted); Martos et al (2018, submitted).
The four-gamete test is based on the infinite-sites model which assumes that the probability of the same mutation occurring twice (recurrent or parallel mutations) and the probability of a mutation back to the original state (reverse mutations) are close to zero. Without these types of mutations, the only explanation for observing the four dilocus genotypes (example below) is recombination (Hudson and Kaplan 1985, Genetics 111:147-164). Thus, the presence of all four gametes is also called phylogenetic incompatibility.
A tandem repeat in DNA is two or more adjacent, approximate copies of a pattern of nucleotides. Tandem Repeats Finder is a program to locate and display tandem repeats in DNA sequences. In order to use the program, the user submits a sequence in FASTA format. The output consists of two files: a repeat table file and an alignment file. Submitted sequences may be of arbitrary length. Repeats with pattern size in the range from 1 to 2000 bases are detected.
Estimator augmentation methods for statistical inference on high-dimensional data, as described in Zhou, Q. (2014) <arXiv:1401.4425v2>
and Zhou, Q. and Min, S. (2017) <doi:10.1214/17-EJS1309>. It provides several simulation-based inference methods: (a) Gaussian and wild multiplier bootstrap for lasso, group lasso, scaled lasso, scaled group lasso and their de-biased estimators, (b) importance sampler for approximating p-values in these methods, (c) Markov chain Monte Carlo lasso sampler with applications in post-selection inference.