Rasqal is a C library that handles Resource Description Framework (RDF) query language syntaxes, query construction and execution of queries returning results as bindings, boolean, RDF graphs/triples or syntaxes. The supported query languages are SPARQL Query 1.0, SPARQL Query 1.1, SPARQL Update 1.1 (no executing) and the Experimental SPARQL extensions (LAQRS). Rasqal can write binding query results in the SPARQL XML, SPARQL JSON, CSV, TSV, HTML, ASCII tables, RDF/XML and Turtle/N3 and read them in SPARQL XML, RDF/XML and Turtle/N3.
The main purpose of this package is to provide the algorithmic complexity for short strings, an approximation of the Kolmogorov Complexity of a short string using the coding theorem method. While the database containing the complexity is provided in the data only package acss.data, this package provides functions accessing the data such as prob_random returning the posterior probability that a given string was produced by a random process. In addition, two traditional (but problematic) measures of complexity are also provided: entropy and change complexity.
RNA abundance and cell size parameters could improve RNA-seq deconvolution algorithms to more accurately estimate cell type proportions given the different cell type transcription activity levels. A Total RNA Expression Gene (TREG) can facilitate estimating total RNA content using single molecule fluorescent in situ hybridization (smFISH). We developed a data-driven approach using a measure of expression invariance to find candidate TREGs in postmortem human brain single nucleus RNA-seq. This R package implements the method for identifying candidate TREGs from snRNA-seq data.
Gene Set Variation Analysis (GSVA) is a non-parametric, unsupervised method for estimating variation of gene set enrichment through the samples of a expression data set. GSVA performs a change in coordinate systems, transforming the data from a gene by sample matrix to a gene-set by sample matrix, thereby allowing the evaluation of pathway enrichment for each sample. This new matrix of GSVA enrichment scores facilitates applying standard analytical methods like functional enrichment, survival analysis, clustering, CNV-pathway analysis or cross-tissue pathway analysis, in a pathway-centric manner.
This package facilitates easy manipulation of variant call format (VCF) data. Functions are provided to rapidly read from and write to VCF files. Once VCF data is read into R, a parser function extracts matrices of data. This information can then be used for quality control or other purposes. Additional functions provide visualization of genomic data. Once processing is complete data may be written to a VCF file. It also may be converted into other popular R objects. This package provides a link between VCF data and familiar R software.
TSIS is used for detecting transcript isoform switches in time-series data. Transcript isoform switches occur when a pair of alternatively spliced isoforms reverse the order of their relative expression levels. TSIS characterizes the transcript switch by defining the isoform switch time-points for any pair of transcript isoforms within a gene. In addition, this tool describes the switch using five different features or metrics. Also it filters the results with user’s specifications and visualizes the results using different plots for the user to examine further details of the switches.
REDUCE is a portable general-purpose computer algebra system. It is a system for doing scalar, vector and matrix algebra by computer, which also supports arbitrary precision numerical approximation and interfaces to gnuplot to provide graphics. It can be used interactively for simple calculations but also provides a full programming language, with a syntax similar to other modern programming languages. REDUCE supports alternative user interfaces including Run-REDUCE, TeXmacs and GNU Emacs. This package provides the Codemist Standard Lisp (CSL) version of REDUCE. It uses the gnuplot program, if installed, to draw figures.
Protein Group Code Algorithm (PGCA) is a computationally inexpensive algorithm to merge protein summaries from multiple experimental quantitative proteomics data. The algorithm connects two or more groups with overlapping accession numbers. In some cases, pairwise groups are mutually exclusive but they may still be connected by another group (or set of groups) with overlapping accession numbers. Thus, groups created by PGCA from multiple experimental runs (i.e., global groups) are called "connected" groups. These identified global protein groups enable the analysis of quantitative data available for protein groups instead of unique protein identifiers.
Publicly available RNA-seq data is routinely used for retrospective analysis to elucidate new biology. Novel transcript discovery enabled by large collections of RNA-seq datasets has emerged as one of such analysis. To increase the power of transcript discovery from large collections of RNA-seq datasets, we developed a new R package named Pooling RNA-seq and Assembling Models (PRAM), which builds transcript models in intergenic regions from pooled RNA-seq datasets. This package includes functions for defining intergenic regions, extracting and pooling related RNA-seq alignments, predicting, selected, and evaluating transcript models.
This package provides an up-to-date copy of the Internet Assigned Numbers Authority (IANA) Time Zone Database. It is updated periodically to reflect changes made by political bodies to time zone boundaries, UTC offsets, and daylight saving time rules. Additionally, this package provides a C++ interface for working with the date library. date provides comprehensive support for working with dates and date-times, which this package exposes to make it easier for other R packages to utilize. Headers are provided for calendar specific calculations, along with a limited interface for time zone manipulations.
Format dates and times flexibly and to whichever locales make sense. This package parses dates, times, and date-times in various formats (including string-based ISO 8601 constructions). The formatting syntax gives the user many options for formatting the date and time output in a precise manner. Time zones in the input can be expressed in multiple ways and there are many options for formatting time zones in the output as well. Several of the provided helper functions allow for automatic generation of locale-aware formatting patterns based on date/time skeleton formats and standardized date/time formats with varying specificity.
Multi-omic Pathway Analysis of Cells (MPAC), integrates multi-omic data for understanding cellular mechanisms. It predicts novel patient groups with distinct pathway profiles as well as identifying key pathway proteins with potential clinical associations. From CNA and RNA-seq data, it determines genes’ DNA and RNA states (i.e., repressed, normal, or activated), which serve as the input for PARADIGM to calculate Inferred Pathway Levels (IPLs). It also permutes DNA and RNA states to create a background distribution to filter IPLs as a way to remove events observed by chance. It provides multiple methods for downstream analysis and visualization.
The Tweedie compound Poisson distribution is a mixture of a degenerate distribution at the origin and a continuous distribution on the positive real line. It has been applied in a wide range of fields in which continuous data with exact zeros regularly arise. The cplm package provides likelihood based and Bayesian procedures for fitting common Tweedie compound Poisson linear models. In particular, models with hierarchical structures or extra zero inflation can be handled. Further, the package implements the Gini index based on an ordered version of the Lorenz curve as a robust model comparison tool involving zero-inflated and highly skewed distributions.
The aim of XINA is to determine which proteins exhibit similar patterns within and across experimental conditions, since proteins with co-abundance patterns may have common molecular functions. XINA imports multiple datasets, tags dataset in silico, and combines the data for subsequent subgrouping into multiple clusters. The result is a single output depicting the variation across all conditions. XINA not only extracts coabundance profiles within and across experiments, but also incorporates protein-protein interaction databases and integrative resources such as Kyoto encyclopedia of genes and genomes (KEGG) to infer interactors and molecular functions, respectively, and produces intuitive graphical outputs.
Multiple imputation using Fully Conditional Specification (FCS) implemented by the MICE algorithm as described in http://doi.org/10.18637/jss.v045.i03. Each variable has its own imputation model. Built-in imputation models are provided for continuous data (predictive mean matching, normal), binary data (logistic regression), unordered categorical data (polytomous logistic regression) and ordered categorical data (proportional odds). MICE can also impute continuous two-level data (normal model, pan, second-level variables). Passive imputation can be used to maintain consistency between variables. Various diagnostic plots are available to inspect the quality of the imputations.
This package implements the fast cross-validation via sequential testing (CVST) procedure. CVST is an improved cross-validation procedure which uses non-parametric testing coupled with sequential analysis to determine the best parameter set on linearly increasing subsets of the data. Additionally to the CVST the package contains an implementation of the ordinary k-fold cross-validation with a flexible and powerful set of helper objects and methods to handle the overall model selection process. The implementations of the Cochran's Q test with permutations and the sequential testing framework of Wald are generic and can therefore also be used in other contexts.
Rclone is a command line program to sync files and directories to and from different cloud storage providers.
Features include:
MD5/SHA1 hashes checked at all times for file integrity
Timestamps preserved on files
Partial syncs supported on a whole file basis
Copy mode to just copy new/changed files
Sync (one way) mode to make a directory identical
Check mode to check for file hash equality
Can sync to and from network, e.g., two different cloud accounts
Optional encryption (Crypt)
Optional cache (Cache)
Optional FUSE mount (rclone mount)
ASEB is an R package to predict lysine sites that can be acetylated by a specific KAT (K-acetyl-transferases) family. Lysine acetylation is a well-studied posttranslational modification on kinds of proteins. About four thousand lysine acetylation sites and over 20 lysine KATs have been identified. However, which KAT is responsible for a given protein or lysine site acetylation is mostly unknown. In this package, we use a GSEA-like (Gene Set Enrichment Analysis) method to make predictions. GSEA method was developed and successfully used to detect coordinated expression changes and find the putative functions of the long non-coding RNAs.
Zero-variance control variates (ZV-CV) is a post-processing method to reduce the variance of Monte Carlo estimators of expectations using the derivatives of the log target. Once the derivatives are available, the only additional computational effort is in solving a linear regression problem. This method has been extended to higher dimensions using regularisation. This package can be used to easily perform ZV-CV or regularised ZV-CV when a set of samples, derivatives and function evaluations are available. Additional functions for applying ZV-CV to two estimators for the normalising constant of the posterior distribution in Bayesian statistics are also supplied.
This package implements algorithms for calculating microarray enrichment (ACME), and it is a set of tools for analysing tiling array of combined chromatin immunoprecipitation with DNA microarray (ChIP/chip), DNAse hypersensitivity, or other experiments that result in regions of the genome showing enrichment. It does not rely on a specific array technology (although the array should be a tiling array), is very general (can be applied in experiments resulting in regions of enrichment), and is very insensitive to array noise or normalization methods. It is also very fast and can be applied on whole-genome tiling array experiments quite easily with enough memory.
The purpose of this package is to discover the genes that are differentially expressed between two conditions in RNA-seq experiments. Gene expression is measured in counts of transcripts and modeled with the Negative Binomial (NB) distribution using a shrinkage approach for dispersion estimation. The method of moment (MM) estimates for dispersion are shrunk towards an estimated target, which minimizes the average squared difference between the shrinkage estimates and the initial estimates. The exact per-gene probability under the NB model is calculated, and used to test the hypothesis that the expected expression of a gene in two conditions identically follow a NB distribution.
This package provides methods for manipulating regression models and for describing these in a style adapted for medical journals. It contains functions for generating an HTML table with crude and adjusted estimates, plotting hazard ratio, plotting model estimates and confidence intervals using forest plots, extending this to comparing multiple models in a single forest plots. In addition to the descriptive methods, there are functions for the robust covariance matrix provided by the sandwich package, a function for adding non-linearities to a model, and a wrapper around the Epi package's Lexis() functions for time-splitting a dataset when modeling non-proportional hazards in Cox regressions.
This is a package for the analysis of discrete response data using unidimensional and multidimensional item analysis models under the Item Response Theory paradigm (Chalmers (2012) <doi:10.18637/jss.v048.i06>). Exploratory and confirmatory item factor analysis models are estimated with quadrature (EM) or stochastic (MHRM) methods. Confirmatory bi-factor and two-tier models are available for modeling item testlets using dimension reduction EM algorithms, while multiple group analyses and mixed effects designs are included for detecting differential item, bundle, and test functioning, and for modeling item and person covariates. Finally, latent class models such as the DINA, DINO, multidimensional latent class, mixture IRT models, and zero-inflated response models are supported.
tLOH, or transcriptomicsLOH, assesses evidence for loss of heterozygosity (LOH) in pre-processed spatial transcriptomics data. This tool requires spatial transcriptomics cluster and allele count information at likely heterozygous single-nucleotide polymorphism (SNP) positions in VCF format. Bayes factors are calculated at each SNP to determine likelihood of potential loss of heterozygosity event. Two plotting functions are included to visualize allele fraction and aggregated Bayes factor per chromosome. Data generated with the 10X Genomics Visium Spatial Gene Expression platform must be pre-processed to obtain an individual sample VCF with columns for each cluster. Required fields are allele depth (AD) with counts for reference/alternative alleles and read depth (DP).