This package provides a tool for comprehensive transcriptomic data analysis, with a focus on transcript-level data preprocessing, expression profiling, differential expression analysis, and functional enrichment. It enables researchers to identify key biological processes, disease biomarkers, and gene regulatory mechanisms. TransProR is aimed at researchers and bioinformaticians working with RNA-Seq data, providing an intuitive framework for in-depth analysis and visualization of transcriptomic datasets. The package includes comprehensive documentation and usage examples to guide users through the entire analysis pipeline. The differential expression analysis methods incorporated in the package include limma (Ritchie et al., 2015, <doi:10.1093/nar/gkv007>; Smyth, 2005, <doi:10.1007/0-387-29362-0_23>), edgeR (Robinson et al., 2010, <doi:10.1093/bioinformatics/btp616>), DESeq2 (Love et al., 2014, <doi:10.1186/s13059-014-0550-8>), and Wilcoxon tests (Li et al., 2022, <doi:10.1186/s13059-022-02648-4>), providing flexible and robust approaches to RNA-Seq data analysis. For more information, refer to the package vignettes and related publications.
The advent of genomic technologies has enabled the generation of two-dimensional or even multi-dimensional high-throughput data, e.g., monitoring multiple changes in gene expression in genome-wide siRNA screens across many different cell types (E Robert McDonald 3rd (2017) <doi: 10.1016/j.cell.2017.07.005> and Tsherniak A (2017) <doi: 10.1016/j.cell.2017.06.010>) or single cell transcriptomics under different experimental conditions. We found that simple computational methods based on a single statistical criterion is no longer adequate for analyzing such multi-dimensional data. We herein introduce ZetaSuite', a statistical package initially designed to score hits from two-dimensional RNAi screens.We also illustrate a unique utility of ZetaSuite in analyzing single cell transcriptomics to differentiate rare cells from damaged ones (Vento-Tormo R (2018) <doi: 10.1038/s41586-018-0698-6>). In ZetaSuite', we have the following steps: QC of input datasets, normalization using Z-transformation, Zeta score calculation and hits selection based on defined Screen Strength.
FamSKAT-RC is a family-based association kernel test for both rare and common variants. This test is general and several special cases are known as other methods: famSKAT, which only focuses on rare variants in family-based data, SKAT, which focuses on rare variants in population-based data (unrelated individuals), and SKAT-RC, which focuses on both rare and common variants in population-based data. When one applies famSKAT-RC and sets the value of phi to 1, famSKAT-RC becomes famSKAT. When one applies famSKAT-RC and set the value of phi to 1 and the kinship matrix to the identity matrix, famSKAT-RC becomes SKAT. When one applies famSKAT-RC and set the kinship matrix (fullkins) to the identity matrix (and phi is not equal to 1), famSKAT-RC becomes SKAT-RC. We also include a small sample synthetic pedigree to demonstrate the method with. For more details see Saad M and Wijsman EM (2014) <doi:10.1002/gepi.21844>.
This package provides a collection of miscellaneous basic statistic functions and convenience wrappers for efficiently describing data. The author's intention was to create a toolbox, which facilitates the (notoriously time consuming) first descriptive tasks in data analysis, consisting of calculating descriptive statistics, drawing graphical summaries and reporting the results. The package contains furthermore functions to produce documents using MS Word (or PowerPoint) and functions to import data from Excel. Many of the included functions can be found scattered in other packages and other sources written partly by Titans of R. The reason for collecting them here, was primarily to have them consolidated in ONE instead of dozens of packages (which themselves might depend on other packages which are not needed at all), and to provide a common and consistent interface as far as function and arguments naming, NA handling, recycling rules etc. are concerned. Google style guides were used as naming rules (in absence of convincing alternatives). The BigCamelCase style was consequently applied to functions borrowed from contributed R packages as well.
This package provides a collection of functions to construct A-optimal block designs for comparing test treatments with one or more control(s). Mainly A-optimal balanced treatment incomplete block designs, weighted A-optimal balanced treatment incomplete block designs, A-optimal group divisible treatment designs and A-optimal balanced bipartite block designs can be constructed using the package. The designs are constructed using algorithms based on linear integer programming. To the best of our knowledge, these facilities to construct A-optimal block designs for comparing test treatments with one or more controls are not available in the existing R packages. For more details on designs for tests versus control(s) comparisons, please see Hedayat, A. S. and Majumdar, D. (1984) <doi:10.1080/00401706.1984.10487989> A-Optimal Incomplete Block Designs for Control-Test Treatment Comparisons, Technometrics, 26, 363-370 and Mandal, B. N. , Gupta, V. K., Parsad, Rajender. (2017) <doi:10.1080/03610926.2015.1071394> Balanced treatment incomplete block designs through integer programming. Communications in Statistics - Theory and Methods 46(8), 3728-3737.
Automatic generation and selection of spatial predictors for spatial regression with Random Forest. Spatial predictors are surrogates of variables driving the spatial structure of a response variable. The package offers two methods to generate spatial predictors from a distance matrix among training cases: 1) Moran's Eigenvector Maps (MEMs; Dray, Legendre, and Peres-Neto 2006 <DOI:10.1016/j.ecolmodel.2006.02.015>): computed as the eigenvectors of a weighted matrix of distances; 2) RFsp (Hengl et al. <DOI:10.7717/peerj.5518>): columns of the distance matrix used as spatial predictors. Spatial predictors help minimize the spatial autocorrelation of the model residuals and facilitate an honest assessment of the importance scores of the non-spatial predictors. Additionally, functions to reduce multicollinearity, identify relevant variable interactions, tune random forest hyperparameters, assess model transferability via spatial cross-validation, and explore model results via partial dependence curves and interaction surfaces are included in the package. The modelling functions are built around the highly efficient ranger package (Wright and Ziegler 2017 <DOI:10.18637/jss.v077.i01>).
Missing values often occur in financial data due to a variety of reasons (errors in the collection process or in the processing stage, lack of asset liquidity, lack of reporting of funds, etc.). However, most data analysis methods expect complete data and cannot be employed with missing values. One convenient way to deal with this issue without having to redesign the data analysis method is to impute the missing values. This package provides an efficient way to impute the missing values based on modeling the time series with a random walk or an autoregressive (AR) model, convenient to model log-prices and log-volumes in financial data. In the current version, the imputation is univariate-based (so no asset correlation is used). In addition, outliers can be detected and removed. The package is based on the paper: J. Liu, S. Kumar, and D. P. Palomar (2019). Parameter Estimation of Heavy-Tailed AR Model With Missing Data Via Stochastic EM. IEEE Trans. on Signal Processing, vol. 67, no. 8, pp. 2159-2172. <doi:10.1109/TSP.2019.2899816>.
Navigating the shift of clinical laboratory data from primary everyday clinical use to secondary research purposes presents a significant challenge. Given the substantial time and expertise required for lab data pre-processing and cleaning and the lack of all-in-one tools tailored for this need, we developed our algorithm lab2clean as an open-source R-package. lab2clean package is set to automate and standardize the intricate process of cleaning clinical laboratory results. With a keen focus on improving the data quality of laboratory result values and units, our goal is to equip researchers with a straightforward, plug-and-play tool, making it smoother for them to unlock the true potential of clinical laboratory data in clinical research and clinical machine learning (ML) model development. Functions to clean & validate result values (Version 1.0) are described in detail in Zayed et al. (2024) <doi:10.1186/s12911-024-02652-7>. Functions to standardize & harmonize result units (added in Version 2.0) are described in detail in Zayed et al. (2025) <doi:10.1016/j.ijmedinf.2025.106131>.
We provide a toolbox to estimate the time delay between the brightness time series of gravitationally lensed quasar images via Bayesian and profile likelihood approaches. The model is based on a state-space representation for irregularly observed time series data generated from a latent continuous-time Ornstein-Uhlenbeck process. Our Bayesian method adopts scientifically motivated hyper-prior distributions and a Metropolis-Hastings within Gibbs sampler, producing posterior samples of the model parameters that include the time delay. A profile likelihood of the time delay is a simple approximation to the marginal posterior distribution of the time delay. Both Bayesian and profile likelihood approaches complement each other, producing almost identical results; the Bayesian way is more principled but the profile likelihood is easier to implement. A new functionality is added in version 1.0.9 for estimating the time delay between doubly-lensed light curves observed in two bands. See also Tak et al. (2017) <doi:10.1214/17-AOAS1027>, Tak et al. (2018) <doi:10.1080/10618600.2017.1415911>, Hu and Tak (2020) <arXiv:2005.08049>.
This package implements a variety of nonparametric and parametric methods that are commonly used when the data set is a mixture of paired observations and independent samples. The package also calculates and returns values of different tests with their corresponding p-values. Bhoj, D. S. (1991) <doi:10.1002/bimj.4710330108> "Testing equality of means in the presence of correlation and missing data". Dubnicka, S. R., Blair, R. C., and Hettmansperger, T. P. (2002) <doi:10.22237/jmasm/1020254460> "Rank-based procedures for mixed paired and two-sample designs". Einsporn, R. L. and Habtzghi, D. (2013) <https://pdfs.semanticscholar.org/89a3/90bafeb2bc41ed4414533cfd5ab84a6b54b6.pdf> "Combining paired and two-sample data using a permutation test". Ekbohm, G. (1976) <doi:10.1093/biomet/63.2.299> "On comparing means in the paired case with incomplete data on both responses". Lin, P. E. and Stivers, L. E. (1974) <doi:10.1093/biomet/61.2.325> On difference of means with incomplete data". Maritz, J. S. (1995) <doi:10.1111/j.1467-842x.1995.tb00649.x> "A permutation paired test allowing for missing values".
Life Table Response Experiments (LTREs) are a method of comparative demographic analysis. The purpose is to quantify how the difference or variance in vital rates (stage-specific survival, growth, and fertility) among populations contributes to difference or variance in the population growth rate, "lambda." We provide functions for one-way fixed design and random design LTRE, using either the classical methods that have been in use for several decades, or an fANOVA-based exact method that directly calculates the impact on lambda of changes in matrix elements, for matrix elements and their interactions. The equations and descriptions for the classical methods of LTRE analysis can be found in Caswell (2001, ISBN: 0878930965), and the fANOVA-based exact methods are described in Hernandez et al. (2023) <doi:10.1111/2041-210X.14065>. We also provide some demographic functions, including generation time from Bienvenu and Legendre (2015) <doi:10.1086/681104>. For implementation of exactLTRE where all possible interactions are calculated, we use an operator matrix presented in Poelwijk, Krishna, and Ranganathan (2016) <doi:10.1371/journal.pcbi.1004771>.
Carries out a two-level sample selection where the possibility of an initially selected site not wanting to participate is anticipated, and the site is optimally replaced. The procedure aims to reduce bias (and/or loss of external validity) with respect to the target population. In selecting units and sub-units, sitepickR uses the cube method developed by Deville & Tillé', (2004) <http://www.math.helsinki.fi/msm/banocoss/Deville_Tille_2004.pdf> and described in Tillé (2011) <https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2011002/article/11609-eng.pdf?st=5-sx8Q8n>. The cube method is a probability sampling method that is designed to satisfy criteria for balance between the sample and the population. Recent research has shown that this method performs well in simulations for studies of educational programs (see Fay & Olsen (2021, under review). To implement the cube method, sitepickR uses the sampling R package <https://cran.r-project.org/package=sampling>. To implement statistical matching, sitepickR uses the MatchIt R package <https://cran.r-project.org/package=MatchIt>.
It generates the roster of turn for an outlet which is flowing (water) 24X7 or 168 hours towards the area under command or agricutural area (to be irrigated). The area under command is differentially owned by different individual farmers. The Outlet runs for free of cost to irrigate the area under command 24X7. So, flow time of the outlet has to be divided based on an area owned by an individual farmer and the location of his land or farm. This roster is known as warabandi and its generation in agriculture practices is a very tedious task. Calculations of time in microseconds are more error-prone, especially whenever it is performed by hands. That division of flow time for an individual farmer can be calculated by warabandi'. However, it generates a full publishable report for an outlet and all the farmers who have farms subjected to be irrigated. It reduces error risk and makes a more reproducible roster. For more details about warabandi system you can found elsewhere in Bandaragoda DJ(1995) <https://publications.iwmi.org/pdf/H_17571i.pdf>.
This package implements sparse generalized factor models (sparseGFM) for dimension reduction and variable selection in high-dimensional data with automatic adaptation to weak factor scenarios. The package supports multiple data types (continuous, count, binary) through generalized linear model frameworks and handles missing values automatically. It provides 12 different penalty functions including Least Absolute Shrinkage and Selection Operator (Lasso), adaptive Lasso, Smoothly Clipped Absolute Deviation (SCAD), Minimax Concave Penalty (MCP), group Lasso, and their adaptive versions for inducing row-wise sparsity in factor loadings. Key features include cross-validation for regularization parameter selection using Sparsity Information Criterion (SIC), automatic determination of the number of factors via multiple information criteria, and specialized algorithms for row-sparse loading structures. The methodology employs alternating minimization with Singular Value Decomposition (SVD)-based identifiability constraints and is particularly effective for high-dimensional applications in genomics, economics, and social sciences where interpretable sparse dimension reduction is crucial. For penalty functions, see Tibshirani (1996) <doi:10.1111/j.2517-6161.1996.tb02080.x>, Fan and Li (2001) <doi:10.1198/016214501753382273>, and Zhang (2010) <doi:10.1214/09-AOS729>.
Bayesian dynamic borrowing is an approach to incorporating external data to supplement a randomized, controlled trial analysis in which external data are incorporated in a dynamic way (e.g., based on similarity of outcomes); see Viele 2013 <doi:10.1002/pst.1589> for an overview. This package implements the hierarchical commensurate prior approach to dynamic borrowing as described in Hobbes 2011 <doi:10.1111/j.1541-0420.2011.01564.x>. There are three main functionalities. First, psborrow2 provides a user-friendly interface for applying dynamic borrowing on the study results handles the Markov Chain Monte Carlo sampling on behalf of the user. Second, psborrow2 provides a simulation framework to compare different borrowing parameters (e.g. full borrowing, no borrowing, dynamic borrowing) and other trial and borrowing characteristics (e.g. sample size, covariates) in a unified way. Third, psborrow2 provides a set of functions to generate data for simulation studies, and also allows the user to specify their own data generation process. This package is designed to use the sampling functions from cmdstanr which can be installed from <https://stan-dev.r-universe.dev>.
In some phase I trials, the design goal is to find the dose associated with a certain target toxicity rate or the dose with a certain weighted sum of rates of various toxicity grades. TITEgBOIN provides the set up and calculations needed to run a dose-finding trial using bayesian optimal interval (BOIN) (Yuan et al. (2016) <doi:10.1158/1078-0432.CCR-16-0592>), generalized bayesian optimal interval (gBOIN) (Mu et al. (2019) <doi:10.1111/rssc.12263>), time-to-event bayesian optimal interval (TITEBOIN) (Lin et al. (2020) <doi:10.1093/biostatistics/kxz007>) and time-to-event generalized bayesian optimal interval (TITEgBOIN) (Takeda et al. (2022) <doi:10.1002/pst.2182>) designs. TITEgBOIN can conduct tasks: run simulations and get operating characteristics; determine the dose for the next cohort; select maximum tolerated dose (MTD). These functions allow customization of design characteristics to vary sample size, cohort sizes, target dose limiting toxicity (DLT) rates or target normalized equivalent toxicity score (ETS) rates to account for discrete toxicity score, and incorporate safety and/or stopping rules.
Estimates and plots as a heat map the correlation coefficients obtained via the wavelet local multiple correlation WLMC (Fernández-Macho 2018) and the dominant variable/s, i.e., the variable/s that maximizes the multiple correlation through time and scale (Polanco-Martà nez et al. 2020, Polanco-Martà nez 2022). We improve the graphical outputs of WLMC proposing a didactic and useful way to visualize the dominant variable(s) for a set of time series. The WLMC was designed for financial time series, but other kinds of data (e.g., climatic, ecological, etc.) can be used. The functions contained in VisualDom are highly flexible since these contains several parameters to personalize the time series under analysis and the heat maps. In addition, we have also included two data sets (named rdata_climate and rdata_Lorenz') to exemplify the use of the functions contained in VisualDom'. Methods derived from Fernández-Macho (2018) <doi:10.1016/j.physa.2017.11.050>, Polanco-Martà nez et al. (2020) <doi:10.1038/s41598-020-77767-8> and Polanco-Martà nez (2023, in press).
This package implements efficient NumPy'-like broadcasted operations for atomic and recursive arrays. In the context of operations involving 2 (or more) arrays, â broadcastingâ refers to efficiently recycling array dimensions without allocating additional memory. Besides linking to Rcpp', broadcast does not use any external libraries in any way; broadcast was essentially made from scratch and can be installed out-of-the-box. The implementations available in broadcast include, but are not limited to, the following. 1) Broadcasted element-wise operations on any 2 arrays; they support a large set of relational, arithmetic, Boolean, string, and bit-wise operations. 2) A faster, more memory efficient, and broadcasted abind-like function, for binding arrays along an arbitrary dimension. 3) Broadcasted ifelse-like, and apply-like functions. 4) Casting functions, that cast subset-groups of an array to a new dimension, cast nested lists to dimensional lists, and vice-versa. 5) A few linear algebra functions for statistics. The functions in the broadcast package strive to minimize computation time and memory usage (which is not just better for efficient computing, but also for the environment).
NONMEM has been a tool for running nonlinear mixed effects models since the 80s and is still used today (Bauer 2019 <doi:10.1002/psp4.12404>). This tool allows you to convert NONMEM models to rxode2 (Wang, Hallow and James (2016) <doi:10.1002/psp4.12052>) and with simple models nlmixr2 syntax (Fidler et al (2019) <doi:10.1002/psp4.12445>). The nlmixr2 syntax requires the residual specification to be included and it is not always translated. If available, the rxode2 model will read in the NONMEM data and compare the simulation for the population model ('PRED') individual model ('IPRED') and residual model ('IWRES') to immediately show how well the translation is performing. This saves the model development time for people who are creating an rxode2 model manually. Additionally, this package reads in all the information to allow simulation with uncertainty (that is the number of observations, the number of subjects, and the covariance matrix) with a rxode2 model. This is complementary to the babelmixr2 package that translates nlmixr2 models to NONMEM and can convert the objects converted from nonmem2rx to a full nlmixr2 fit.
Allows clustering of incomplete observations by addressing missing values using multiple imputation. For achieving this goal, the methodology consists in three steps, following Audigier and Niang 2022 <doi:10.1007/s11634-022-00519-1>. I) Missing data imputation using dedicated models. Four multiple imputation methods are proposed, two are based on joint modelling and two are fully sequential methods, as discussed in Audigier et al. (2021) <doi:10.48550/arXiv.2106.04424>. II) cluster analysis of imputed data sets. Six clustering methods are available (distances-based or model-based), but custom methods can also be easily used. III) Partition pooling. The set of partitions is aggregated using Non-negative Matrix Factorization based method. An associated instability measure is computed by bootstrap (see Fang, Y. and Wang, J., 2012 <doi:10.1016/j.csda.2011.09.003>). Among applications, this instability measure can be used to choose a number of clusters with missing values. The package also proposes several diagnostic tools to tune the number of imputed data sets, to tune the number of iterations in fully sequential imputation, to check the fit of imputation models, etc.
An implementation of the multilevel (also known as mixed or random effects) hidden Markov model using Bayesian estimation in R. The multilevel hidden Markov model (HMM) is a generalization of the well-known hidden Markov model, for the latter see Rabiner (1989) <doi:10.1109/5.18626>. The multilevel HMM is tailored to accommodate (intense) longitudinal data of multiple individuals simultaneously, see e.g., de Haan-Rietdijk et al. <doi:10.1080/00273171.2017.1370364>. Using a multilevel framework, we allow for heterogeneity in the model parameters (transition probability matrix and conditional distribution), while estimating one overall HMM. The model can be fitted on multivariate data with either a categorical, normal, or Poisson distribution, and include individual level covariates (allowing for e.g., group comparisons on model parameters). Parameters are estimated using Bayesian estimation utilizing the forward-backward recursion within a hybrid Metropolis within Gibbs sampler. Missing data (NA) in the dependent variables is accommodated assuming MAR. The package also includes various visualization options, a function to simulate data, and a function to obtain the most likely hidden state sequence for each individual using the Viterbi algorithm.
clustComp is a package that implements several techniques for the comparison and visualisation of relationships between different clustering results, either flat versus flat or hierarchical versus flat. These relationships among clusters are displayed using a weighted bi-graph, in which the nodes represent the clusters and the edges connect pairs of nodes with non-empty intersection; the weight of each edge is the number of elements in that intersection and is displayed through the edge thickness. The best layout of the bi-graph is provided by the barycentre algorithm, which minimises the weighted number of crossings. In the case of comparing a hierarchical and a non-hierarchical clustering, the dendrogram is pruned at different heights, selected by exploring the tree by depth-first search, starting at the root. Branches are decided to be split according to the value of a scoring function, that can be based either on the aesthetics of the bi-graph or on the mutual information between the hierarchical and the flat clusterings. A mapping between groups of clusters from each side is constructed with a greedy algorithm, and can be additionally visualised.
Seed germinates through the physical process of water uptake by dry seed driven by the difference in water potential between the seed and the water. There exists seed-to-seed variability in the base seed water potential. Hence, there is a need for a distribution such that a viable seed with its base seed water potential germinates if and only if the soil water potential is more than the base seed water potential. This package estimates the stress tolerance and uniformity parameters of the seed lot for germination under various temperatures by using the hydro-time model of counts of germinated seeds under various water potentials. The distribution of base seed water potential has been considered to follow Normal, Logistic and Extreme value distribution. The estimated proportion of germinated seeds along with the estimates of stress and uniformity parameters are obtained using a generalised linear model. The significance test of the above parameters for within and between temperatures is also performed in the analysis. Details can be found in Kebreab and Murdoch (1999) <doi:10.1093/jxb/50.334.655> and Bradford (2002) <https://www.jstor.org/stable/4046371>.
Sample and cell filtering as well as visualisation of output metrics from Cell Ranger by Grace X.Y. Zheng et al. (2017) <doi:10.1038/ncomms14049>. CRMetrics allows for easy plotting of output metrics across multiple samples as well as comparative plots including statistical assessments of these. CRMetrics allows for easy removal of ambient RNA using SoupX by Matthew D Young and Sam Behjati (2020) <doi:10.1093/gigascience/giaa151> or CellBender by Stephen J Fleming et al. (2022) <doi:10.1101/791699>. Furthermore, it is possible to preprocess data using Pagoda2 by Nikolas Barkas et al. (2021) <https://github.com/kharchenkolab/pagoda2> or Seurat by Yuhan Hao et al. (2021) <doi:10.1016/j.cell.2021.04.048> followed by embedding of cells using Conos by Nikolas Barkas et al. (2019) <doi:10.1038/s41592-019-0466-z>. Finally, doublets can be detected using scrublet by Samuel L. Wolock et al. (2019) <doi:10.1016/j.cels.2018.11.005> or DoubletDetection by Gayoso et al. (2020) <doi:10.5281/zenodo.2678041>. In the end, cells are filtered based on user input for use in downstream applications.