Statistical and computational method to analyze the co-expression of gene pairs at single cell level. It provides the foundation for single-cell gene interactome analysis. The basic idea is studying the zero UMI counts distribution instead of focusing on positive counts; this is done with a generalized contingency tables framework. COTAN can effectively assess the correlated or anti-correlated expression of gene pairs. It provides a numerical index related to the correlation and an approximate p-value for the associated independence test. COTAN can also evaluate whether single genes are differentially expressed, scoring them with a newly defined global differentiation index. Moreover, this approach provides ways to plot and cluster genes according to their co-expression pattern with other genes, effectively helping the study of gene interactions and becoming a new tool to identify cell-identity marker genes.
Computes the Exposure-At-Default based on the standardized approach of CRR2 (SA-CCR). The simplified version of SA-CCR has been included, as well as the OEM methodology. Multiple trade types of all the five major asset classes are being supported including the Other Exposure and, given the inheritance- based structure of the application, the addition of further trade types is straightforward. The application returns a list of trees per Counterparty and CSA after automatically separating the trades based on the Counterparty, the CSAs, the hedging sets, the netting sets and the risk factors. The basis and volatility transactions are also identified and treated in specific hedging sets whereby the corresponding penalty factors are applied. All the examples appearing on the regulatory papers (both for the margined and the unmargined workflow) have been implemented including the latest CRR2 developments.
Analyze the co-adaptation of codon usage between a virus and its host, calculate various codon usage bias measurements as: effective number of codons (ENc) Novembre (2002) <doi:10.1093/oxfordjournals.molbev.a004201>, codon adaptation index (CAI) Sharp and Li (1987) <doi:10.1093/nar/15.3.1281>, relative codon deoptimization index (RCDI) Puigbò et al (2010) <doi:10.1186/1756-0500-3-87>, similarity index (SiD) Zhou et al (2013) <doi:10.1371/journal.pone.0077239>, synonymous codon usage orderliness (SCUO) Wan et al (2004) <doi:10.1186/1471-2148-4-19> and, relative synonymous codon usage (RSCU) Sharp et al (1986) <doi:10.1093/nar/14.13.5125>. Also, it provides a statistical dinucleotide over- and underrepresentation with three different models. Implement several methods for visualization of codon usage as ENc.GC3plot() and PR2.plot().
Cell clustering is one of the most important and commonly performed tasks in single-cell RNA sequencing (scRNA-seq) data analysis. An important step in cell clustering is to select a subset of genes (referred to as “features”), whose expression patterns will then be used for downstream clustering. A good set of features should include the ones that distinguish different cell types, and the quality of such set could have significant impact on the clustering accuracy. FEAST is an R library for selecting most representative features before performing the core of scRNA-seq clustering. It can be used as a plug-in for the etablished clustering algorithms such as SC3, TSCAN, SHARP, SIMLR, and Seurat. The core of FEAST algorithm includes three steps: 1. consensus clustering; 2. gene-level significance inference; 3. validation of an optimized feature set.
The core function of this R package is to provide the implementation of the well-cited and well-reviewed QUBIC algorithm, aiming to deliver an effective and efficient biclustering capability. This package also includes the following related functions: (i) a qualitative representation of the input gene expression data, through a well-designed discretization way considering the underlying data property, which can be directly used in other biclustering programs; (ii) visualization of identified biclusters using heatmap in support of overall expression pattern analysis; (iii) bicluster-based co-expression network elucidation and visualization, where different correlation coefficient scores between a pair of genes are provided; and (iv) a generalize output format of biclusters and corresponding network can be freely downloaded so that a user can easily do following comprehensive functional enrichment analysis (e.g. DAVID) and advanced network visualization (e.g. Cytoscape).
Item response theory based methods are used to compute linking constants and conduct chain linking of unidimensional or multidimensional tests for multiple groups under a common item design. The unidimensional methods include the Mean/Mean, Mean/Sigma, Haebara, and Stocking-Lord methods for dichotomous (1PL, 2PL and 3PL) and/or polytomous (graded response, partial credit/generalized partial credit, nominal, and multiple-choice model) items. The multidimensional methods include the least squares method and extensions of the Haebara and Stocking-Lord method using single or multiple dilation parameters for multidimensional extensions of all the unidimensional dichotomous and polytomous item response models. The package also includes functions for importing item and/or ability parameters from common IRT software, conducting IRT true score and observed score equating, and plotting item response curves/surfaces, vector plots, information plots, and comparison plots for examining parameter drift.
We analyzed the nucleotide composition of genes with a special emphasis on stability of DNA sequences. Besides, in a variety of different organisms unequal use of synonymous codons, or codon usage bias, occurs which also show variation among genes in the same genome. Seemingly, codon usage bias is affected by both selective constraints and mutation bias which allows and enables us to examine and detect changes in these two evolutionary forces between genomes or along one genome. Therefore, we determined the codon adaptation index (CAI), effective number of codons (ENC) and codon usage analysis with calculation of the relative synonymous codon usage (RSCU), and subsequently predicted the translation efficiency and accuracy through GC-rich codon usages. Furthermore, we estimated the relative stability of the DNA sequence following calculation of the average free energy (Delta G) and Dimer base-stacking energy level.
Symptomatic heterogeneity in complex diseases reveals differences in molecular states that need to be investigated. However, selecting the numerous parameters of an exploratory clustering analysis in RNA profiling studies requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent and further gene association analyses need to be performed independently. We have developed a suite of tools to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with four datasets characterised by different expression signal strengths. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Even in datasets with less clear biological distinctions, stable subgroups with different expression profiles and clinical associations were found.
High-throughput omics data are often affected by systematic biases introduced throughout all the steps of a clinical study, from sample collection to quantification. Normalization methods aim to adjust for these biases to make the actual biological signal more prominent. However, selecting an appropriate normalization method is challenging due to the wide range of available approaches. Therefore, a comparative evaluation of unnormalized and normalized data is essential in identifying an appropriate normalization strategy for a specific data set. This R package provides different functions for preprocessing, normalizing, and evaluating different normalization approaches. Furthermore, normalization methods can be evaluated on downstream steps, such as differential expression analysis and statistical enrichment analysis. Spike-in data sets with known ground truth and real-world data sets of biological experiments acquired by either tandem mass tag (TMT) or label-free quantification (LFQ) can be analyzed.
Fit data from a continuous population with a smooth density on finite interval by an approximate Bernstein polynomial model which is a mixture of certain beta distributions and find maximum approximate Bernstein likelihood estimator of the unknown coefficients. Consequently, maximum likelihood estimates of the unknown density, distribution functions, and more can be obtained. If the support of the density is not the unit interval then transformation can be applied. This is an implementation of the methods proposed by the author of this package published in the Journal of Nonparametric Statistics: Guan (2016) <doi:10.1080/10485252.2016.1163349> and Guan (2017) <doi:10.1080/10485252.2017.1374384>. For data with covariates, under some semiparametric regression models such as Cox proportional hazards model and the accelerated failure time model, the baseline survival function can be estimated smoothly based on general interval censored data.
Multiple 2 by 2 tables often arise in meta-analysis which combines statistical evidence from multiple studies. Two risks within the same study are possibly correlated because they share some common factors such as environment and population structure. This package implements a set of novel Bayesian approaches for multivariate meta analysis when the risks within the same study are independent or correlated. The exact posterior inference of odds ratio, relative risk, and risk difference given either a single 2 by 2 table or multiple 2 by 2 tables is provided. Luo, Chen, Su, Chu, (2014) <doi:10.18637/jss.v056.i11>, Chen, Luo, (2011) <doi:10.1002/sim.4248>, Chen, Chu, Luo, Nie, Chen, (2015) <doi:10.1177/0962280211430889>, Chen, Luo, Chu, Su, Nie, (2014) <doi:10.1080/03610926.2012.700379>, Chen, Luo, Chu, Wei, (2013) <doi:10.1080/19466315.2013.791483>.
This package contains a collection of functions to deal with nonparametric measurement error problems using deconvolution kernel methods. We focus two measurement error models in the package: (1) an additive measurement error model, where the goal is to estimate the density or distribution function from contaminated data; (2) nonparametric regression model with errors-in-variables. The R functions allow the measurement errors to be either homoscedastic or heteroscedastic. To make the deconvolution estimators computationally more efficient in R, we adapt the "Fast Fourier Transform" (FFT) algorithm for density estimation with error-free data to the deconvolution kernel estimation. Several methods for the selection of the data-driven smoothing parameter are also provided in the package. See details in: Wang, X.F. and Wang, B. (2011). Deconvolution estimation in measurement error models: The R package decon. Journal of Statistical Software, 39(10), 1-24.
Models the relationship between dose levels and responses in a pharmacological experiment using the 4 Parameter Logistic model. Traditional packages on dose-response modelling such as drc and nplr often draw errors due to convergence failure especially when data have outliers or non-logistic shapes. This package provides robust estimation methods that are less affected by outliers and other initialization methods that work well for data lacking logistic shapes. We provide the bounds on the parameters of the 4PL model that prevent parameter estimates from diverging or converging to zero and base their justification in a statistical principle. These methods are used as remedies to convergence failure problems. Gadagkar, S. R. and Call, G. B. (2015) <doi:10.1016/j.vascn.2014.08.006> Ritz, C. and Baty, F. and Streibig, J. C. and Gerhard, D. (2015) <doi:10.1371/journal.pone.0146021>.
The univariate statistical quality control tool aims to address measurement error effects when constructing exponentially weighted moving average p control charts. The method primarily focuses on binary random variables, but it can be applied to any continuous random variables by using sign statistic to transform them to discrete ones. With the correction of measurement error effects, we can obtain the corrected control limits of exponentially weighted moving average p control chart and reasonably adjusted exponentially weighted moving average p control charts. The methods in this package can be found in some relevant references, such as Chen and Yang (2022) <arXiv: 2203.03384>; Yang et al. (2011) <doi: 10.1016/j.eswa.2010.11.044>; Yang and Arnold (2014) <doi: 10.1155/2014/238719>; Yang (2016) <doi: 10.1080/03610918.2013.763980> and Yang and Arnold (2016) <doi: 10.1080/00949655.2015.1125901>.
Quantitative trait loci mapping and genome wide association analysis are used to find candidate molecular marker or region associated with phenotype based on linkage analysis and linkage disequilibrium. Gene expression quantitative trait loci mapping is used to find candidate molecular marker or region associated with gene expression. In this package, we applied the method in Liu W. (2011) <doi:10.1007/s00122-011-1631-7> and Gusev A. (2016) <doi:10.1038/ng.3506> to genome and transcriptome wide association study, which is aimed at revealing the association relationship between phenotype and molecular markers, expression levels, molecular markers nested within different related expression effect and expression effect nested within different related molecular marker effect. F test based on full and reduced model are performed to obtain p value or likelihood ratio statistic. The best linear model can be obtained by stepwise regression analysis.
Factor models have been widely applied in areas such as economics and finance, and the well-known heavy-tailedness of macroeconomic/financial data should be taken into account when conducting factor analysis. We propose two algorithms to do robust factor analysis by considering the Huber loss. One is based on minimizing the Huber loss of the idiosyncratic error's L2 norm, which turns out to do Principal Component Analysis (PCA) on the weighted sample covariance matrix and thereby named as Huber PCA. The other one is based on minimizing the element-wise Huber loss, which can be solved by an iterative Huber regression algorithm. In this package we also provide the code for traditional PCA, the Robust Two Step (RTS) method by He et al. (2022) and the Quantile Factor Analysis (QFA) method by Chen et al. (2021) and He et al. (2023).
This package provides a regression and classification algorithm based on random forests, which takes the form of a short list of rules. SIRUS combines the simplicity of decision trees with a predictivity close to random forests. The core aggregation principle of random forests is kept, but instead of aggregating predictions, SIRUS aggregates the forest structure: the most frequent nodes of the forest are selected to form a stable rule ensemble model. The algorithm is fully described in the following articles: Benard C., Biau G., da Veiga S., Scornet E. (2021), Electron. J. Statist., 15:427-505 <DOI:10.1214/20-EJS1792> for classification, and Benard C., Biau G., da Veiga S., Scornet E. (2021), AISTATS, PMLR 130:937-945 <http://proceedings.mlr.press/v130/benard21a>, for regression. This R package is a fork from the project ranger (<https://github.com/imbs-hl/ranger>).
Surface Protein abundance Estimation using CKmeans-based clustered thresholding ('SPECK') is an unsupervised learning-based method that performs receptor abundance estimation for single cell RNA-sequencing data based on reduced rank reconstruction (RRR) and a clustered thresholding mechanism. Seurat's normalization method is described in: Hao et al., (2021) <doi:10.1016/j.cell.2021.04.048>, Stuart et al., (2019) <doi:10.1016/j.cell.2019.05.031>, Butler et al., (2018) <doi:10.1038/nbt.4096> and Satija et al., (2015) <doi:10.1038/nbt.3192>. Method for the RRR is further detailed in: Erichson et al., (2019) <doi:10.18637/jss.v089.i11> and Halko et al., (2009) <doi:10.48550/arXiv.0909.4061>. Clustering method is outlined in: Song et al., (2020) <doi:10.1093/bioinformatics/btaa613> and Wang et al., (2011) <doi:10.32614/RJ-2011-015>.
iPath is the Bioconductor package used for calculating personalized pathway score and test the association with survival outcomes. Abundant single-gene biomarkers have been identified and used in the clinics. However, hundreds of oncogenes or tumor-suppressor genes are involved during the process of tumorigenesis. We believe individual-level expression patterns of pre-defined pathways or gene sets are better biomarkers than single genes. In this study, we devised a computational method named iPath to identify prognostic biomarker pathways, one sample at a time. To test its utility, we conducted a pan-cancer analysis across 14 cancer types from The Cancer Genome Atlas and demonstrated that iPath is capable of identifying highly predictive biomarkers for clinical outcomes, including overall survival, tumor subtypes, and tumor stage classifications. We found that pathway-based biomarkers are more robust and effective than single genes.
An update to the Joint Location-Scale (JLS) testing framework that identifies associated SNPs, gene-sets and pathways with main and/or interaction effects on quantitative traits (Soave et al., 2015; <doi:10.1016/j.ajhg.2015.05.015>). The JLS method simultaneously tests the null hypothesis of equal mean and equal variance across genotypes, by aggregating association evidence from the individual location/mean-only and scale/variance-only tests using Fisher's method. The generalized joint location-scale (gJLS) framework has been developed to deal specifically with sample correlation and group uncertainty (Soave and Sun, 2017; <doi:10.1111/biom.12651>). The current release: gJLS2, include additional functionalities that enable analyses of X-chromosome genotype data through novel methods for location (Chen et al., 2021; <doi:10.1002/gepi.22422>) and scale (Deng et al., 2019; <doi:10.1002/gepi.22247>).
Population ratio estimator (calibrated) under two-phase random sampling design has gained enormous popularity in the recent time. This package provides functions for estimation population ratio (calibrated) under two phase sampling design, including the approximate variance of the ratio estimator. The improved ratio estimator can be applicable for both the case, when auxiliary data is available at unit level or aggregate level (eg., mean or total) for first phase sampled. Calibration weight of each unit of the second phase sample was calculated. Single and combined inclusion probabilities were also estimated for both phases under two phase random [simple random sampling without replacement (SRSWOR)] sampling. The improved ratio estimator's percentage coefficient of variation was also determined as a measure of accuracy. This package has been developed based on the theoretical development of Islam et al. (2021) and Ozgul (2020) <doi:10.1080/00949655.2020.1844702>.
At Novartis, we aimed at standardizing the set of diagnostic plots used for modeling activities in order to reduce the overall effort required for generating such plots. For this, we developed a guidance that proposes an adequate set of diagnostics and a toolbox, called ggPMX to execute them. ggPMX is a toolbox that can generate all diagnostic plots at a quality sufficient for publication and submissions using few lines of code. This package focuses on plots recommended by ISoP <doi:10.1002/psp4.12161>. While not required, you can get/install the R lixoftConnectors package in the Monolix installation, as described at the following url <https://monolixsuite.slp-software.com/r-functions/2024R1/installation-and-initialization>. When lixoftConnectors is available, R can use Monolix directly to create the required Chart Data instead of exporting it from the Monolix gui.
This package provides a modular and computationally efficient R package for parameterizing, simulating, and analyzing health economic simulation models. The package supports cohort discrete time state transition models (Briggs et al. 1998) <doi:10.2165/00019053-199813040-00003>, N-state partitioned survival models (Glasziou et al. 1990) <doi:10.1002/sim.4780091106>, and individual-level continuous time state transition models (Siebert et al. 2012) <doi:10.1016/j.jval.2012.06.014>, encompassing both Markov (time-homogeneous and time-inhomogeneous) and semi-Markov processes. Decision uncertainty from a cost-effectiveness analysis is quantified with standard graphical and tabular summaries of a probabilistic sensitivity analysis (Claxton et al. 2005, Barton et al. 2008) <doi:10.1002/hec.985>, <doi:10.1111/j.1524-4733.2008.00358.x>. Use of C++ and data.table make individual-patient simulation, probabilistic sensitivity analysis, and incorporation of patient heterogeneity fast.
This package provides a bunch of algorithms based on linear programming for estimating, under the homogeneity hypothesis, RxC ecological contingency tables (or vote transition matrices) using mainly aggregate data (from voting units). References: Pavà a and Romero (2024) <doi:10.1177/00491241221092725>. Pavà a and Romero (2024) <doi:10.1093/jrsssa/qnae013>. Pavà a (2023) <doi:10.1007/s43545-023-00658-y>. Pavà a (2024) <doi:10.1080/0022250X.2024.2423943>. Pavà a (2024) <doi:10.1177/07591063241277064>. Pavà a and Penadés (2024). A bottom-up approach for ecological inference. Romero, Pavà a, Martà n and Romero (2020) <doi:10.1080/02664763.2020.1804842>. Acknowledgements: The authors wish to thank Consellerà a de Educación, Cultura, Universidades y Empleo, Generalitat Valenciana (grants AICO/2021/257, CIAICO/2023/031) and MICIU/AEI/10.13039/501100011033/FEDER, UE (grant PID2021-128228NB-I00) for supporting this research.