Enhancing T cell receptor (TCR) sequence analysis, ClusTCR2', based on ClusTCR python program, leverages Hamming distance to compare the complement-determining region three (CDR3) sequences for sequence similarity, variable gene (V gene) and length. The second step employs the Markov Cluster Algorithm to identify clusters within an undirected graph, providing a summary of amino acid motifs and matrix for generating network plots. Tailored for single-cell RNA-seq data with integrated TCR-seq information, ClusTCR2 is integrated into the Single Cell TCR and Expression Grouped Ontologies (STEGO) R application or STEGO.R'. See the two publications for more details. Sebastiaan Valkiers, Max Van Houcke, Kris Laukens, Pieter Meysman (2021) <doi:10.1093/bioinformatics/btab446>, Kerry A. Mullan, My Ha, Sebastiaan Valkiers, Nicky de Vrij, Benson Ogunjimi, Kris Laukens, Pieter Meysman (2023) <doi:10.1101/2023.09.27.559702>.
This package provides a two-stage procedure for the denoising and clustering of stack of noisy images acquired over time. Clustering only assumes that the data contain an unknown but small number of dynamic features. The method first denoises the signals using local spatial and full temporal information. The clustering step uses the previous output to aggregate voxels based on the knowledge of their spatial neighborhood. Both steps use a single keytool based on the statistical comparison of the difference of two signals with the null signal. No assumption is therefore required on the shape of the signals. The data are assumed to be normally distributed (or at least follow a symmetric distribution) with a known constant variance. Working pixelwise, the method can be time-consuming depending on the size of the data-array but harnesses the power of multicore cpus.
The new (dQTG.seq1 and dQTG.seq2) and existing (SmoothLOD, G', deltaSNP and ED) bulked segregant analysis methods are used to identify various types of quantitative trait loci for complex traits via extreme phenotype individuals in bi-parental segregation populations (F2, backcross, doubled haploid and recombinant inbred line). The numbers of marker alleles in extreme low and high pools are used in existing methods to identify trait-related genes, while the numbers of marker alleles and genotypes in extreme low and high pools are used in the new methods to construct a new statistic Gw for identifying trait-related genes. dQTG-seq2 is feasible to identify extremely over-dominant and small-effect genes in F2. Li P, Li G, Zhang YW, Zuo JF, Liu JY, Zhang YM (2022, <doi: 10.1016/j.xplc.2022.100319>).
This package provides a protocol that facilitates the processing and analysis of Hydrogen-Deuterium Exchange Mass Spectrometry data using p-value statistics and Critical Interval analysis. It provides a pipeline for analyzing data from HDXExaminer (Sierra Analytics, Trajan Scientific), automating matching and comparison of protein states through Welch's T-test and the Critical Interval statistical framework. Additionally, it simplifies data export, generates PyMol scripts, and ensures calculations meet publication standards. HDXBoxeR assists in various aspects of hydrogen-deuterium exchange data analysis, including reprocessing data, calculating parameters, identifying significant peptides, generating plots, and facilitating comparison between protein states. For details check papers by Hageman and Weis (2019) <doi:10.1021/acs.analchem.9b01325> and Masson et al. (2019) <doi:10.1038/s41592-019-0459-y>. HDXBoxeR citation: Janowska et al. (2024) <doi:10.1093/bioinformatics/btae479>.
This package provides a fast generalized edit distance and string alignment computation mainly for linguistic aims. As a generalization to the classic edit distance algorithms, the package allows users to define custom cost for every symbol's insertion, deletion, and substitution. The package also allows character combinations in any length to be seen as a single symbol which is very useful for International Phonetic Alphabet (IPA) transcriptions with diacritics. In addition to edit distance result, users can get detailed alignment information such as all possible alignment scenarios between two strings which is useful for testing, illustration or any further usage. Either the distance matrix or its long table form can be obtained and tools to do such conversions are provided. All functions in the package are implemented in C++ and the distance matrix computation is parallelized leveraging the RcppThread package.
This package provides a set of tools to facilitate data sonification and handle the musicXML format <https://usermanuals.musicxml.com/MusicXML/Content/XS-MusicXML.htm>. Several classes are defined for basic musical objects such as note pitch, note duration, note, measure and score. Moreover, sonification utilities functions are provided, e.g. to map data into musical attributes such as pitch, loudness or duration. A typical sonification workflow hence looks like: get data; map them to musical attributes; create and write the musicXML score, which can then be further processed using specialized music software (e.g. MuseScore', GuitarPro', etc.). Examples can be found in the blog <https://globxblog.github.io/>, the presentation by Renard and Le Bescond (2022, <https://hal.science/hal-03710340v1>) or the poster by Renard et al. (2023, <https://hal.inrae.fr/hal-04388845v1>).
There is variation across AgNPs due to differences in characterization techniques and testing metrics employed in studies. To address this problem, we have developed a systematic evaluation framework called sysAgNPs'. Within this framework, Distribution Entropy (DE) is utilized to measure the uncertainty of feature categories of AgNPs, Proclivity Entropy (PE) assesses the preference of these categories, and Combination Entropy (CE) quantifies the uncertainty of feature combinations of AgNPs. Additionally, a Markov chain model is employed to examine the relationships among the sub-features of AgNPs and to determine a Transition Score (TS) scoring standard that is based on steady-state probabilities. The sysAgNPs framework provides metrics for evaluating AgNPs, which helps to unravel their complexity and facilitates effective comparisons among different AgNPs, thereby advancing the scientific research and application of these AgNPs.
MapScape integrates clonal prevalence, clonal hierarchy, anatomic and mutational information to provide interactive visualization of spatial clonal evolution. There are four inputs to MapScape: (i) the clonal phylogeny, (ii) clonal prevalences, (iii) an image reference, which may be a medical image or drawing and (iv) pixel locations for each sample on the referenced image. Optionally, MapScape can accept a data table of mutations for each clone and their variant allele frequencies in each sample. The output of MapScape consists of a cropped anatomical image surrounded by two representations of each tumour sample. The first, a cellular aggregate, visually displays the prevalence of each clone. The second shows a skeleton of the clonal phylogeny while highlighting only those clones present in the sample. Together, these representations enable the analyst to visualize the distribution of clones throughout anatomic space.
SCAN is a microarray normalization method to facilitate personalized-medicine workflows. Rather than processing microarray samples as groups, which can introduce biases and present logistical challenges, SCAN normalizes each sample individually by modeling and removing probe- and array-specific background noise using only data from within each array. SCAN can be applied to one-channel (e.g., Affymetrix) or two-channel (e.g., Agilent) microarrays. The Universal exPression Codes (UPC) method is an extension of SCAN that estimates whether a given gene/transcript is active above background levels in a given sample. The UPC method can be applied to one-channel or two-channel microarrays as well as to RNA-Seq read counts. Because UPC values are represented on the same scale and have an identical interpretation for each platform, they can be used for cross-platform data integration.
This package implements Lagrangian multiplier smoothing splines for flexible nonparametric regression and function estimation. Provides tools for fitting, prediction, and inference using a constrained optimization approach to enforce smoothness. Supports generalized linear models, Weibull accelerated failure time (AFT) models, quadratic programming problems, and customizable arbitrary correlation structures. Options for fitting in parallel are provided. The method builds upon the framework described by Ezhov et al. (2018) <doi:10.1515/jag-2017-0029> using Lagrangian multipliers to fit cubic splines. For more information on correlation structure estimation, see Searle et al. (2009) <ISBN:978-0470009598>. For quadratic programming and constrained optimization in general, see Nocedal & Wright (2006) <doi:10.1007/978-0-387-40065-5>. For a comprehensive background on smoothing splines, see Wahba (1990) <doi:10.1137/1.9781611970128> and Wood (2006) <ISBN:978-1584884743> "Generalized Additive Models: An Introduction with R".
This package provides a set of functions which use the Expectation Maximisation (EM) algorithm (Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977) <doi:10.1111/j.2517-6161.1977.tb01600.x> Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39(1), 1--22) to take a finite mixture model approach to clustering. The package is designed to cluster multivariate data that have categorical and continuous variables and that possibly contain missing values. The method is described in Hunt, L. and Jorgensen, M. (1999) <doi:10.1111/1467-842X.00071> Australian & New Zealand Journal of Statistics 41(2), 153--171 and Hunt, L. and Jorgensen, M. (2003) <doi:10.1016/S0167-9473(02)00190-1> Mixture model clustering for mixed data with missing information, Computational Statistics & Data Analysis, 41(3-4), 429--440.
Dichotomous responses having two categories can be analyzed with stats::glm() or lme4::glmer() using the family=binomial option. Unfortunately, polytomous responses with three or more unordered categories cannot be analyzed similarly because there is no analogous family=multinomial option. For between-subjects data, nnet::multinom() can address this need, but it cannot handle random factors and therefore cannot handle repeated measures. To address this gap, we transform nominal response data into counts for each categorical alternative. These counts are then analyzed using (mixed) Poisson regression as per Baker (1994) <doi:10.2307/2348134>. Omnibus analyses of variance can be run along with post hoc pairwise comparisons. For users wishing to analyze nominal responses from surveys or experiments, the functions in this package essentially act as though stats::glm() or lme4::glmer() provide a family=multinomial option.
Simulation of the stochastic 3D structure model for the nanoporous binder-conductive additive phase in battery cathodes introduced in P. Gräfensteiner, M. Osenberg, A. Hilger, N. Bohn, J. R. Binder, I. Manke, V. Schmidt, M. Neumann (2024) <doi:10.48550/arXiv.2409.11080>. The model is developed for a binder-conductive additive phase of consisting of carbon black, polyvinylidene difluoride binder and graphite particles. For its stochastic 3D modeling, a three-step procedure based on methods from stochastic geometry is used. First, the graphite particles are described by a Boolean model with ellipsoidal grains. Second, the mixture of carbon black and binder is modeled by an excursion set of a Gaussian random field in the complement of the graphite particles. Third, large pore regions within the mixture of carbon black and binder are described by a Boolean model with spherical grains.
Linear cross-section factor model fitting with least-squares and robust fitting the lmrobdetMM() function from RobStatTM'; related volatility, Value at Risk and Expected Shortfall risk and performance attribution (factor-contributed vs idiosyncratic returns); tabular displays of risk and performance reports; factor model Monte Carlo. The package authors would like to thank Chicago Research on Security Prices,LLC for the cross-section of about 300 CRSP stocks data (in the data.table object stocksCRSP', and S&P GLOBAL MARKET INTELLIGENCE for contributing 14 factor scores (a.k.a "alpha factors".and "factor exposures") fundamental data on the 300 companies in the data.table object factorSPGMI'. The stocksCRSP and factorsSPGMI data are not covered by the GPL-2 license, are not provided as open source of any kind, and they are not to be redistributed in any form.
Bayesian regularized quantile regression utilizing sparse priors to promote exact sparsity leads to efficient Bayesian shrinkage estimation, variable selection and statistical inference. In this package, we have implemented robust Bayesian variable selection with spike-and-slab priors under high-dimensional linear regression models (Fan et al. (2024) <doi:10.3390/e26090794> and Ren et al. (2023) <doi:10.1111/biom.13670>), and regularized quantile varying coefficient models (Zhou et al.(2023) <doi:10.1016/j.csda.2023.107808>). In particular, valid robust Bayesian inferences under both models in the presence of heavy-tailed errors can be validated on finite samples. Additional models with spike-and-slab priors include robust Bayesian group LASSO and robust binary Bayesian LASSO (Fan and Wu (2025) <doi:10.1002/sta4.70078>). The Markov Chain Monte Carlo (MCMC) algorithms of the proposed and alternative models are implemented in C++.
The textrank algorithm is an extension of the Pagerank algorithm for text. The algorithm allows to summarize text by calculating how sentences are related to one another. This is done by looking at overlapping terminology used in sentences in order to set up links between sentences. The resulting sentence network is next plugged into the Pagerank algorithm which identifies the most important sentences in your text and ranks them. In a similar way textrank can also be used to extract keywords. A word network is constructed by looking if words are following one another. On top of that network the Pagerank algorithm is applied to extract relevant words after which relevant words which are following one another are combined to get keywords. More information can be found in the paper from Mihalcea, Rada & Tarau, Paul (2004) <https://www.aclweb.org/anthology/W04-3252/>.
This package provides convenience functions for common data modification and analysis tasks in communication research. This includes functions for univariate and bivariate data analysis, index generation and reliability computation, and intercoder reliability tests. All functions follow the style and syntax of the tidyverse, and are construed to perform their computations on multiple variables at once. Functions for univariate and bivariate data analysis comprise summary statistics for continuous and categorical variables, as well as several tests of bivariate association including effect sizes. Functions for data modification comprise index generation and automated reliability analysis of index variables. Functions for intercoder reliability comprise tests of several intercoder reliability estimates, including simple and mean pairwise percent agreement, Krippendorff's Alpha (Krippendorff 2004, ISBN: 9780761915454), and various Kappa coefficients (Brennan & Prediger 1981 <doi: 10.1177/001316448104100307>; Cohen 1960 <doi: 10.1177/001316446002000104>; Fleiss 1971 <doi: 10.1037/h0031619>).
An R frontend for the WhiteboxTools library, which is an advanced geospatial data analysis platform developed by Prof. John Lindsay at the University of Guelph's Geomorphometry and Hydrogeomatics Research Group. WhiteboxTools can be used to perform common geographical information systems (GIS) analysis operations, such as cost-distance analysis, distance buffering, and raster reclassification. Remote sensing and image processing tasks include image enhancement (e.g. panchromatic sharpening, contrast adjustments), image mosaicing, numerous filtering operations, simple classification (k-means), and common image transformations. WhiteboxTools also contains advanced tooling for spatial hydrological analysis (e.g. flow-accumulation, watershed delineation, stream network analysis, sink removal), terrain analysis (e.g. common terrain indices such as slope, curvatures, wetness index, hillshading; hypsometric analysis; multi-scale topographic position analysis), and LiDAR data processing. Suggested citation: Lindsay (2016) <doi:10.1016/j.cageo.2016.07.003>.
In a typical microarray setting with gene expression data observed under two conditions, the local false discovery rate describes the probability that a gene is not differentially expressed between the two conditions given its corrresponding observed score or p-value level. The resulting curve of p-values versus local false discovery rate offers an insight into the twilight zone between clear differential and clear non-differential gene expression. Package twilight contains two main functions: Function twilight.pval performs a two-condition test on differences in means for a given input matrix or expression set and computes permutation based p-values. Function twilight performs a stochastic downhill search to estimate local false discovery rates and effect size distributions. The package further provides means to filter for permutations that describe the null distribution correctly. Using filtered permutations, the influence of hidden confounders could be diminished.
Implementation of No-Effect-Concentration estimation that uses brms (see Burkner (2017)<doi:10.18637/jss.v080.i01>; Burkner (2018)<doi:10.32614/RJ-2018-017>; Carpenter et al. (2017)<doi:10.18637/jss.v076.i01> to fit concentration(dose)-response data using Bayesian methods for the purpose of estimating ECx values, but more particularly NEC (see Fox (2010)<doi:10.1016/j.ecoenv.2009.09.012>), NSEC (see Fisher and Fox (2023)<doi:10.1002/etc.5610>), and N(S)EC (see Fisher et al. 2023<doi:10.1002/ieam.4809>). A full description of this package can be found in Fisher et al. (2024)<doi:10.18637/jss.v110.i05>. This package expands and supersedes an original version implemented in R2jags (see Su and Yajima (2020)<https://CRAN.R-project.org/package=R2jags>; Fisher et al. (2020)<doi:10.5281/ZENODO.3966864>).
Computes confidence intervals for the positive predictive value (PPV) and negative predictive value (NPV) based on varied scenarios. In situations where the proportion of diseased subjects does not correspond to the disease prevalence (e.g. case-control studies), this package provides two types of solutions: 1) five methods for estimating confidence intervals for PPV and NPV via ratio of two binomial proportions including Gart & Nam (1988), Walter (1975), MOVER-J (Laud, 2017), Fieller (1954), and Bootstrap (Efron, 1979); 2) three direct methods that compute the confidence intervals including Pepe (2003), Zhou (2007), and Delta. In prospective studies where the proportion of diseased subjects is an unbiased estimate of the disease prevalence, this package provides several methods for calculating the confidence intervals for PPV and NPV including Clopper-Pearson, Wald, Wilson, Agresti-Coull, and Beta. See the Details and References sections in the corresponding functions.
Illustrate graphically the most common Null Hypothesis Significance Testing procedures. More specifically, this package provides functions to plot Chi-Squared, F, t (one- and two-tailed) and z (one- and two-tailed) tests, by plotting the probability density under the null hypothesis as a function of the different test statistic values. Although highly flexible (color theme, fonts, etc.), only the minimal number of arguments (observed test statistic, degrees of freedom) are necessary for a clear and useful graph to be plotted, with the observed test statistic and the p value, as well as their corresponding value labels. The axes are automatically scaled to present the relevant part and the overall shape of the probability density function. This package is especially intended for education purposes, as it provides a helpful support to help explain the Null Hypothesis Significance Testing process, its use and/or shortcomings.
Is designed to interactively and reproducibly visualize and filter SNP (single-nucleotide polymorphism) datasets. This R-based implementation of SNP and genotype filters facilitates an interactive and iterative SNP filtering pipeline, which can be documented reproducibly via rmarkdown'. SNPfiltR contains functions for visualizing various quality and missing data metrics for a SNP dataset, and then filtering the dataset based on user specified cutoffs. All functions take vcfR objects as input, which can easily be generated by reading standard vcf (variant call format) files into R using the R package vcfR authored by Knaus and Grünwald (2017) <doi:10.1111/1755-0998.12549>. Each SNPfiltR function can return a newly filtered vcfR object, which can then be written to a local directory in standard vcf format using the vcfR package, for downstream population genetic and phylogenetic analyses.
LOBSTAHS is a multifunction package for screening, annotation, and putative identification of mass spectral features in large, HPLC-MS lipid datasets. In silico data for a wide range of lipids, oxidized lipids, and oxylipins can be generated from user-supplied structural criteria with a database generation function. LOBSTAHS then applies these databases to assign putative compound identities to features in any high-mass accuracy dataset that has been processed using xcms and CAMERA. Users can then apply a series of orthogonal screening criteria based on adduct ion formation patterns, chromatographic retention time, and other properties, to evaluate and assign confidence scores to this list of preliminary assignments. During the screening routine, LOBSTAHS rejects assignments that do not meet the specified criteria, identifies potential isomers and isobars, and assigns a variety of annotation codes to assist the user in evaluating the accuracy of each assignment.