Linear regression model and generalized linear models with nonparametric network effects on network-linked observations. The model is originally proposed by Le and Li (2022) <doi:10.48550/arXiv.2007.00803>
and is assumed on observations that are connected by a network or similar relational data structure. A more recent work by Wang, Le and Li (2024) <doi:10.48550/arXiv.2410.01163>
further extends the framework to generalized linear models. All these models are implemented in the current package. The model does not assume that the relational data or network structure to be precisely observed; thus, the method is provably robust to a certain level of perturbation of the network structure. The package contains the estimation and inference function for the model.
The core algorithm is described in "Ball mapper: a shape summary for topological data analysis" by Pawel Dlotko, (2019) <arXiv:1901.07410>
. Please consult the following youtube video <https://www.youtube.com/watch?v=M9Dm1nl_zSQfor>
the idea of functionality. Ball Mapper provide a topologically accurate summary of a data in a form of an abstract graph. To create it, please provide the coordinates of points (in the points array), values of a function of interest at those points (can be initialized randomly if you do not have it) and the value epsilon which is the radius of the ball in the Ball Mapper construction. It can be understood as the minimal resolution on which we use to create the model of the data.
This package provides a generic, easy-to-use and expandable implementation of a pharmacokinetic (PK) / pharmacodynamic (PD) model based on the S4 class system. This package allows the user to read/write a pharmacometric model from/to files and adapt it further on the fly in the R environment. For this purpose, this package provides an intuitive API to add, modify or delete equations, ordinary differential equations (ODE's), model parameters or compartment properties (like infusion duration or rate, bioavailability and initial values). Finally, this package also provides a useful export of the model for use with simulation packages rxode2 and mrgsolve'. This package is designed and intended to be used with package campsis', a PK/PD simulation platform built on top of rxode2 and mrgsolve'.
PubTator
<https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/>
is a National Center for Biotechnology Information (NCBI) tool that enhances the annotation of articles on PubMed
<https://www.ncbi.nlm.nih.gov/pubmed/>. It makes it possible to rapidly identify potential relationships between genes or proteins using text mining techniques. In contrast, manually searching for and reading the annotated articles would be very time consuming. PubTator
offers both an online interface and a RESTful API, however, neither of these approaches are well suited for frequent, high-throughput analyses. The package pubtatordb provides a set of functions that make it easy for the average R user to download PubTator
annotations, create, and then query a local version of the database.
This toolkit is designed for manipulation and analysis of peptides. It provides functionalities to assist researchers in peptide engineering and proteomics. Users can manipulate peptides by adding amino acids at every position, count occurrences of each amino acid at each position, and transform amino acid counts based on probabilities. The package offers functionalities to select the best versus the worst peptides and analyze these peptides, which includes counting specific residues, reducing peptide sequences, extracting features through One Hot Encoding (OHE), and utilizing Quantitative Structure-Activity Relationship (QSAR) properties (based in the package Peptides by Osorio et al. (2015) <doi:10.32614/RJ-2015-001>). This package is intended for both researchers and bioinformatics enthusiasts working on peptide-based projects, especially for their use with machine learning.
The package xmapbridge
can plot graphs in the X:Map genome browser. X:Map uses the Google Maps API to provide a scrollable view of the genome. It supports a number of species, and can be accessed at http://xmap.picr.man.ac.uk. This package exports plotting files in a suitable format. Graph plotting in R is done using calls to the functions xmap.plot
and xmap.points
, which have parameters that aim to be similar to those used by the standard plot methods in R. These result in data being written to a set of files (in a specific directory structure) that contain the data to be displayed, as well as some additional meta-data describing each of the graphs.
Statistical inference with non-probability samples when auxiliary information from external sources such as probability samples or population totals or means is available. The package implements various methods such as inverse probability (propensity score) weighting, mass imputation and doubly robust approach. Details can be found in: Chen et al. (2020) <doi:10.1080/01621459.2019.1677241>, Yang et al. (2020) <doi:10.1111/rssb.12354>, Kim et al. (2021) <doi:10.1111/rssa.12696>, Yang et al. (2021) <https://www150.statcan.gc.ca/n1/pub/12-001-x/2021001/article/00004-eng.htm> and Wu (2022) <https://www150.statcan.gc.ca/n1/pub/12-001-x/2022002/article/00002-eng.htm>. For details on the package and its functionalities see <doi:10.48550/arXiv.2504.04255>
.
An updated implementation of R package ranger by Wright et al, (2017) <doi:10.18637/jss.v077.i01> for training and predicting from random forests, particularly suited to high-dimensional data, and for embedding in Multiple Imputation by Chained Equations (MICE) by van Buuren (2007) <doi:10.1177/0962280206074463>. Ensembles of classification and regression trees are currently supported. Sparse data of class dgCMatrix
(R package Matrix') can be directly analyzed. Conventional bagged predictions are available alongside an efficient prediction for MICE via the algorithm proposed by Doove et al (2014) <doi:10.1016/j.csda.2013.10.025>. Survival and probability forests are not supported in the update, nor is data of class gwaa.data (R package GenABEL
'); use the original ranger package for these analyses.
This package provides a graph community detection algorithm that aims to be performant on large graphs and robust, returning consistent results across runs. SpeakEasy
2 (SE2), the underlying algorithm, is described in Chris Gaiteri, David R. Connell & Faraz A. Sultan et al. (2023) <doi:10.1186/s13059-023-03062-0>. The core algorithm is written in C', providing speed and keeping the memory requirements low. This implementation can take advantage of multiple computing cores without increasing memory usage. SE2 can detect community structure across scales, making it a good choice for biological data, which often has hierarchical structure. Graphs can be passed to the algorithm as adjacency matrices using base R matrices, the Matrix library, igraph graphs, or any data that can be coerced into a matrix.
epidecodeR
is a package capable of analysing impact of degree of DNA/RNA epigenetic chemical modifications on dysregulation of genes or proteins. This package integrates chemical modification data generated from a host of epigenomic or epitranscriptomic techniques such as ChIP-seq
, ATAC-seq, m6A-seq, etc. and dysregulated gene lists in the form of differential gene expression, ribosome occupancy or differential protein translation and identify impact of dysregulation of genes caused due to varying degrees of chemical modifications associated with the genes. epidecodeR
generates cumulative distribution function (CDF) plots showing shifts in trend of overall log2FC between genes divided into groups based on the degree of modification associated with the genes. The tool also tests for significance of difference in log2FC between groups of genes.
This package provides an integrated pipeline for the analysis of PAR-CLIP data. PAR-CLIP-induced transitions are first discriminated from sequencing errors, SNPs and additional non-experimental sources by a non- parametric mixture model. The protein binding sites (clusters) are then resolved at high resolution and cluster statistics are estimated using a rigorous Bayesian framework. Post-processing of the results, data export for UCSC genome browser visualization and motif search analysis are provided. In addition, the package integrates RNA-Seq data to estimate the False Discovery Rate of cluster detection. Key functions support parallel multicore computing. While wavClusteR was designed for PAR-CLIP data analysis, it can be applied to the analysis of other NGS data obtained from experimental procedures that induce nucleotide substitutions (e.g. BisSeq).
This package provides a powerful and flexible tool for visualizing proportional data across spatially resolved contexts. By combining the concepts of scatter plots and stacked bar charts, scatterbar allows users to create scattered bar chart plots, which effectively display the proportions of different categories at each (x, y) location. This visualization is particularly useful for applications where understanding the distribution of categories across spatial coordinates is essential. This package features automatic determination of optimal scaling factors based on data, customizable scaling and padding options for both x and y axes, flexibility to specify custom colors for each category, options to customize the legend title, and integration with ggplot2 for robust and high-quality visualizations. For more details, see Velazquez et al. (2024) <doi:10.1101/2024.08.14.606810>.
This package provides model data and functions for easily using machine learning models that use data from the DNA methylome to classify cancer type and phenotype from a sample. The primary motivation for the development of this package is to abstract away the granular and accessibility-limiting code required to utilize machine learning models in R. Our package provides this abstraction for RandomForest
, e1071 Support Vector, Extreme Gradient Boosting, and Tensorflow models. This is paired with an ExperimentHub
component, which contains models developed for epigenetic cancer classification and predicting phenotypes. This includes CNS tumor classification, Pan-cancer classification, race prediction, cell of origin classification, and subtype classification models. The package links to our models on ExperimentHub
. The package currently supports HM450, EPIC, EPICv2, MSA, and MM285.
The Datasaurus Dozen is a set of datasets with the same summary statistics. They retain the same summary statistics despite having radically different distributions. The datasets represent a larger and quirkier object lesson that is typically taught via Anscombe's Quartet (available in the 'datasets' package). Anscombe's Quartet contains four very different distributions with the same summary statistics and as such highlights the value of visualisation in understanding data, over and above summary statistics. As well as being an engaging variant on the Quartet, the data is generated in a novel way. The simulated annealing process used to derive datasets from the original Datasaurus is detailed in "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing" doi:10.1145/3025453.3025912.
In computer experiments space-filling designs are having great impact. Most popularly used space-filling designs are Uniform designs (UDs), Latin hypercube designs (LHDs) etc. For further references one can see Mckay (1979) <DOI:10.1080/00401706.1979.10489755> and Fang (1980) <https://cir.nii.ac.jp/crid/1570291225616774784>. In this package, we have provided algorithms for generate efficient LHDs and UDs. Here, generated LHDs are efficient as they possess lower value of Maxpro measure, Phi_p value and Maximum Absolute Correlation (MAC) value based on the weightage given to each criterion. On the other hand, the produced UDs are having good space-filling property as they always attain the lower bound of Discrete Discrepancy measure. Further, some useful functions added in this package for adding more value to this package.
This package provides a customisable R shiny app for immersively visualising, mapping and annotating panospheric (360 degree) imagery. The flexible interface allows annotation of any geocoded images using up to 4 user specified dropdown menus. The app uses leaflet to render maps that display the geo-locations of images and panellum <https://pannellum.org/>, a lightweight panorama viewer for the web, to render images in virtual 360 degree viewing mode. Key functions include the ability to draw on & export parts of 360 images for downstream applications. Users can also draw polygons and points on map imagery related to the panoramic images and export them for further analysis. Downstream applications include using annotations to train Artificial Intelligence/Machine Learning (AI/ML) models and geospatial modelling and analysis of camera based survey data.
Unleash the power of time-series data visualization with ease using our package. Designed with simplicity in mind, it offers three key features through the shiny package output. The first tab shows time- series charts with forecasts, allowing users to visualize trends and changes effortlessly. The second one displays Averages per country presented in tables with accompanying sparklines, providing a quick and attractive overview of the data. The last tab presents A customizable world map colored based on user-defined variables for any chosen number of countries, offering an advanced visual approach to understanding geographical data distributions. This package operates with just a few simple arguments, enabling users to conduct sophisticated analyses without the need for complex programming skills. Transform your time-series data analysis experience with our user-friendly tool.
DNA methylation is generally considered to be associated with transcriptional silencing. However, comprehensive, genome-wide investigation of this relationship requires the evaluation of potentially millions of correlation values between the methylation of individual genomic loci and expression of associated transcripts in a relatively large numbers of samples. Methodical makes this process quick and easy while keeping a low memory footprint. It also provides a novel method for identifying regions where a number of methylation sites are consistently strongly associated with transcriptional expression. In addition, Methodical enables housing DNA methylation data from diverse sources (e.g. WGBS, RRBS and methylation arrays) with a common framework, lifting over DNA methylation data between different genome builds and creating base-resolution plots of the association between DNA methylation and transcriptional activity at transcriptional start sites.
The cyclotomic numbers are complex numbers that can be thought of as the rational numbers extended with the roots of unity. They are represented exactly, enabling exact computations. They contain the Gaussian rationals (complex numbers with rational real and imaginary parts) as well as the square roots of all rational numbers. They also contain the sine and cosine of all rational multiples of pi. The algorithms implemented in this package are taken from the Haskell package cyclotomic', whose algorithms are adapted from code by Martin Schoenert and Thomas Breuer in the GAP project (<https://www.gap-system.org/>). Cyclotomic numbers have applications in number theory, algebraic geometry, algebraic number theory, coding theory, and in the theory of graphs and combinatorics. They have connections to the theory of modular functions and modular curves.
Data analysis often requires coding, especially when data are collected through interviews, observations, or questionnaires. As a result, code counting and data preparation are essential steps in the analysis process. Analysts may need to count the codes in a text (Tokenization, counting of pre-established codes, computing the co-occurrence matrix by line) and prepare the data (e.g., min-max normalization, Z-score, robust scaling, Box-Cox transformation, and non-parametric bootstrap). For the Box-Cox transformation (Box & Cox, 1964, <https://www.jstor.org/stable/2984418>), the optimal Lambda is determined using the log-likelihood method. Non-parametric bootstrap involves randomly sampling data with replacement. Two random number generators are also integrated: a Lehmer congruential generator for uniform distribution and a Box-Muller generator for normal distribution. Package for educational purposes.
This package provides functions to classify mass spectra in known categories, and to determine discriminant mass-over-charge values. It includes easy-to-use functions for pre-processing mass spectra, functions to determine discriminant mass-over-charge values (m/z) from a library of mass spectra corresponding to different categories, and functions to predict the category (species, phenotypes, etc.) associated to a mass spectrum from a list of selected mass-over-charge values. Three vignettes illustrating how to use the functions of this package from real data sets are also available online to help users: <https://agodmer.github.io/MSclassifR_examples/Vignettes/Vignettemsclassifr_Ecrobiav3.html>
, <https://agodmer.github.io/MSclassifR_examples/Vignettes/Vignettemsclassifr_Klebsiellav3.html>
and <https://agodmer.github.io/MSclassifR_examples/Vignettes/Vignettemsclassifr_DAv3.html>
.
Calculates exact hypothesis tests to compare a treatment and a reference group with respect to multiple binary endpoints. The tested null hypothesis is an identical multidimensional distribution of successes and failures in both groups. The alternative hypothesis is a larger success proportion in the treatment group in at least one endpoint. The tests are based on the multivariate permutation distribution of subjects between the two groups. For this permutation distribution, rejection regions are calculated that satisfy one of different possible optimization criteria. In particular, regions with maximal exhaustion of the nominal significance level, maximal power under a specified alternative or maximal number of elements can be found. Optimization is achieved by a branch-and-bound algorithm. By application of the closed testing principle, the global hypothesis tests are extended to multiple testing procedures.
This package performs multi-omic differential network analysis by revealing differential interactions between molecular entities (genes, proteins, transcription factors, or other biomolecules) across the omic datasets provided. For each omic dataset, a differential network is constructed where links represent statistically significant differential interactions between entities. These networks are then integrated into a comprehensive visualization using distinct colors to distinguish interactions from different omic layers. This unified display allows interactive exploration of cross-omic patterns, such as differential interactions present at both transcript and protein levels. For each link, users can access differential statistical significance metrics (p values or adjusted p values, calculated via robust or traditional linear regression with interaction term) and differential regression plots. The methods implemented in this package are described in Sciacca et al. (2023) <doi:10.1093/bioinformatics/btad192>.
Fast randomization based two sample tests. Testing the hypothesis that two samples come from the same distribution using randomization to create p-values. Included tests are: Kolmogorov-Smirnov, Kuiper, Cramer-von Mises, Anderson-Darling, Wasserstein, and DTS. The default test (two_sample) is based on the DTS test statistic, as it is the most powerful, and thus most useful to most users. The DTS test statistic builds on the Wasserstein distance by using a weighting scheme like that of Anderson-Darling. See the companion paper at <arXiv:2007.01360>
or <https://codowd.com/public/DTS.pdf> for details of that test statistic, and non-standard uses of the package (parallel for big N, weighted observations, one sample tests, etc). We also include the permutation scheme to make test building simple for others.