This package implements the algorithm described in Barron, M., Zhang, S. and Li, J. 2017, "A sparse differential clustering algorithm for tracing cell type changes via single-cell RNA-sequencing data", Nucleic Acids Research, gkx1113, <doi:10.1093/nar/gkx1113>. This algorithm clusters samples from two different populations, links the clusters across the conditions and identifies marker genes for these changes. The package was designed for scRNA-Seq
data but is also applicable to many other data types, just replace cells with samples and genes with variables. The package also contains functions for estimating the parameters for SparseDC
as outlined in the paper. We recommend that users further select their marker genes using the magnitude of the cluster centers.
PAM (Partitioning Around Medoids) algorithm application to samples of single cell sequencing techniques with a high number of cells (as many as the computer memory allows). The package uses a binary format to store matrices (either full, sparse or symmetric) in files written in the disk that can contain any data type (not just double) which allows its manipulation when memory is sufficient to load them as int or float, but not as double. The PAM implementation is done in parallel, using several/all the cores of the machine, if it has them. This package shares a great part of its code with packages jmatrix and parallelpam but their functionality is included here so there is no need to install them.
This is a collection of functions optimized for working with with various kinds of text matrices. Focusing on the text matrix as the primary object - represented either as a base R dense matrix or a Matrix package sparse matrix - allows for a consistent and intuitive interface that stays close to the underlying mathematical foundation of computational text analysis. In particular, the package includes functions for working with word embeddings, text networks, and document-term matrices. Methods developed in Stoltz and Taylor (2019) <doi:10.1007/s42001-019-00048-6>, Taylor and Stoltz (2020) <doi:10.1007/s42001-020-00075-8>, Taylor and Stoltz (2020) <doi:10.15195/v7.a23>, and Stoltz and Taylor (2021) <doi:10.1016/j.poetic.2021.101567>.
Recent advances in single cell/nucleus transcriptomic technology has enabled collection of cohort-scale datasets to study cell type specific gene expression differences associated disease state, stimulus, and genetic regulation. The scale of these data, complex study designs, and low read count per cell mean that characterizing cell type specific molecular mechanisms requires a user-frieldly, purpose-build analytical framework. We have developed the dreamlet package that applies a pseudobulk approach and fits a regression model for each gene and cell cluster to test differential expression across individuals associated with a trait of interest. Use of precision-weighted linear mixed models enables accounting for repeated measures study designs, high dimensional batch effects, and varying sequencing depth or observed cells per biosample.
This package provides a Bayesian Nonparametric model for the study of time-evolving frequencies, which has become renowned in the study of population genetics. The model consists of a Hidden Markov Model (HMM) in which the latent signal is a distribution-valued stochastic process that takes the form of a finite mixture of Dirichlet Processes, indexed by vectors that count how many times each value is observed in the population. The package implements methodologies presented in Ascolani, Lijoi and Ruggiero (2021) <doi:10.1214/20-BA1206> and Ascolani, Lijoi and Ruggiero (2023) <doi:10.3150/22-BEJ1504> that make it possible to study the process at the time of data collection or to predict its evolution in future or in the past.
Allows the user to generate a list of features (gene, pseudo, RNA, CDS, and/or UTR) directly from NCBI database for any species with a current build available. Option to save downloaded and formatted files is available, and the user can prioritize the feature list based on type and assembly builds present in the current build used. The user can then use the list of features generated or provide a list to map a set of markers (designed for SNP markers with a single base pair position available) to the closest feature based on the map build. This function does require map positions of the markers to be provided and the positions should be based on the build being queried through NCBI.
Construct a principal surface that are two-dimensional surfaces that pass through the middle of a p-dimensional data set. They minimise the distance from the data points, and provide a nonlinear summary of data. The surfaces are nonparametric and their shape is suggested by the data. The formation of a surface is found using an iterative procedure which starts with a linear summary, typically with a principal component plane. Each successive iteration is a local average of the p-dimensional points, where an average is based on a projection of a point onto the nonlinear surface of the previous iteration. For more information on principal surfaces, see Ganey, R. (2019, "https://open.uct.ac.za/items/4e655d7d-d10c-481b-9ccc-801903aebfc8").
This package provides methods for estimation and hypothesis testing of proportions in group testing designs: methods for estimating a proportion in a single population (assuming sensitivity and specificity equal to 1 in designs with equal group sizes), as well as hypothesis tests and functions for experimental design for this situation. For estimating one proportion or the difference of proportions, a number of confidence interval methods are included, which can deal with various different pool sizes. Further, regression methods are implemented for simple pooling and matrix pooling designs. Methods for identification of positive items in group testing designs: Optimal testing configurations can be found for hierarchical and array-based algorithms. Operating characteristics can be calculated for testing configurations across a wide variety of situations.
This package provides some tools for developing and validating prediction models, estimate expected survival of patients and visualize them graphically. Most of the implemented methods are based on penalized regressions such as: the lasso (Tibshirani R (1996)), the elastic net (Zou H et al. (2005) <doi:10.1111/j.1467-9868.2005.00503.x>), the adaptive lasso (Zou H (2006) <doi:10.1198/016214506000000735>), the stability selection (Meinshausen N et al. (2010) <doi:10.1111/j.1467-9868.2010.00740.x>), some extensions of the lasso (Ternes et al. (2016) <doi:10.1002/sim.6927>), some methods for the interaction setting (Ternes N et al. (2016) <doi:10.1002/bimj.201500234>), or others. A function generating simulated survival data set is also provided.
Given a likelihood provided by the user, this package applies it to a given matrix dataset in order to find change points in the data that maximize the sum of the likelihoods of all the segments. This package provides a handful of algorithms with different time complexities and assumption compromises so the user is able to choose the best one for the problem at hand. The implementation of the segmentation algorithms in this package are based on the paper by Bruno M. de Castro, Florencia Leonardi (2018) <arXiv:1501.01756>
. The Berlin weather sample dataset was provided by Deutscher Wetterdienst <https://dwd.de/>. You can find all the references in the Acknowledgments section of this package's repository via the URL below.
Allow user to run the Adaptive Correlated Spike and Slab (ACSS) algorithm, corresponding INdependent Spike and Slab (INSS) algorithm, and Giannone, Lenza and Primiceri (GLP) algorithm with adaptive burn-in. All of the three algorithms are used to fit high dimensional data set with either sparse structure, or dense structure with smaller contributions from all predictors. The state-of-the-art GLP algorithm is in Giannone, D., Lenza, M., & Primiceri, G. E. (2021, ISBN:978-92-899-4542-4) "Economic predictions with big data: The illusion of sparsity". The two new algorithms, ACSS algorithm and INSS algorithm, and the discussion on their performance can be seen in Yang, Z., Khare, K., & Michailidis, G. (2024, preprint) "Bayesian methodology for adaptive sparsity and shrinkage in regression".
Statistical procedures to perform stability analysis in plant breeding and to identify stable genotypes under diverse environments. It is possible to calculate coefficient of homeostaticity by Khangildin et al. (1979), variance of specific adaptive ability by Kilchevsky&Khotyleva (1989), weighted homeostaticity index by Martynov (1990), steadiness of stability index by Udachin (1990), superiority measure by Lin&Binn (1988) <doi:10.4141/cjps88-018>, regression on environmental index by Erberhart&Rassel (1966) <doi:10.2135/cropsci1966.0011183X000600010011x>, Tai's (1971) stability parameters <doi:10.2135/cropsci1971.0011183X001100020006x>, stability variance by Shukla (1972) <doi:10.1038/hdy.1972.87>, ecovalence by Wricke (1962), nonparametric stability parameters by Nassar&Huehn (1987) <doi:10.2307/2531947>, Francis&Kannenberg's parameters of stability (1978) <doi:10.4141/cjps78-157>.
This package provides a tool for matching ICD-10 codes to corresponding Clinical Classification Software Refined (CCSR) codes. The main function, CCSRfind()
, identifies each CCSR code that applies to an individual given their diagnosis codes. It also provides a summary of CCSR codes that are matched to a dataset. The package contains 3 datasets: DXCCSR (mapping of ICD-10 codes to CCSR codes), Legend (conversion of DXCCSR to CCSRfind-usable format for CCSR codes with less than or equal to 1000 ICD-10 diagnosis codes), and LegendExtend
(conversion of DXCCSR to CCSRfind-usable format for CCSR codes with more than 1000 ICD-10 dx codes). The disc()
function applies grepl()
('base') to multiple columns and is used in CCSRfind()
.
Estimation of the generalized beta distribution of the second kind (GB2) and related models using grouped data in form of income shares. The GB2 family is a general class of distributions that provides an accurate fit to income data. GB2group includes functions to estimate the GB2, the Singh-Maddala, the Dagum, the Beta 2, the Lognormal and the Fisk distributions. GB2group deploys two different econometric strategies to estimate these parametric distributions, the equally weighted minimum distance (EWMD) estimator and the optimally weighted minimum distance (OMD) estimator. Asymptotic standard errors are reported for the OMD estimates. Standard errors of the EWMD estimates are obtained by Monte Carlo simulation. See Jorda et al. (2018) <arXiv:1808.09831>
for a detailed description of the estimation procedure.
Corset plots are a visualization technique used strictly to visualize repeat measures at 2 time points (such as pre- and post- data). The distribution of measurements are visualized at each time point, whilst the trajectories of individual change are visualized by connecting the pre- and post- values linearly. These lines can be coloured to represent the magnitude of change, or other user-defined value. This method of visualization is ideal for showing the heterogeneity of data, including differences by sub-groups. The package relies on ggplot2 allowing for easy integration so that users can customize their visualizations as required. Users can create corset plots using data in either wide or long format using the functions gg_corset()
or gg_corset_elongated()
, respectively.
Computing statistical hypothesis testing for loading in principal component analysis (PCA) (Yamamoto, H. et al. (2014) <doi:10.1186/1471-2105-15-51>), orthogonal smoothed PCA (OS-PCA) (Yamamoto, H. et al. (2021) <doi:10.3390/metabo11030149>), one-sided kernel PCA (Yamamoto, H. (2023) <doi:10.51094/jxiv.262>), partial least squares (PLS) and PLS discriminant analysis (PLS-DA) (Yamamoto, H. et al. (2009) <doi:10.1016/j.chemolab.2009.05.006>), PLS with rank order of groups (PLS-ROG) (Yamamoto, H. (2017) <doi:10.1002/cem.2883>), regularized canonical correlation analysis discriminant analysis (RCCA-DA) (Yamamoto, H. et al. (2008) <doi:10.1016/j.bej.2007.12.009>), multiset PLS and PLS-ROG (Yamamoto, H. (2022) <doi:10.1101/2022.08.30.505949>).
This package performs the O2PLS data integration method for two datasets, yielding joint and data-specific parts for each dataset. The algorithm automatically switches to a memory-efficient approach to fit O2PLS to high dimensional data. It provides a rigorous and a faster alternative cross-validation method to select the number of components, as well as functions to report proportions of explained variation and to construct plots of the results. See the software article by el Bouhaddani et al (2018) <doi:10.1186/s12859-018-2371-3>, and Trygg and Wold (2003) <doi:10.1002/cem.775>. It also performs Sparse Group (Penalized) O2PLS, see Gu et al (2020) <doi:10.1186/s12859-021-03958-3> and cross-validation for the degree of sparsity.
This package provides tools to visualize the results of a classification of cases. The graphical displays include stacked plots, silhouette plots, quasi residual plots, and class maps. Implements the techniques described and illustrated in Raymaekers J., Rousseeuw P.J., Hubert M. (2021). Class maps for visualizing classification results. \emphTechnometrics, 64(2), 151â 165. \doi10.1080/00401706.2021.1927849 (open access) and Raymaekers J., Rousseeuw P.J.(2021). Silhouettes and quasi residual plots for neural nets and tree-based classifiers. \emphJournal of Computational and Graphical Statistics, 31(4), 1332â 1343. \doi10.1080/10618600.2022.2050249. Examples can be found in the vignettes: "Discriminant_analysis_examples","K_nearest_neighbors_examples", "Support_vector_machine_examples", "Rpart_examples", "Random_forest_examples", and "Neural_net_examples".
The hybrid model is a highly effective forecasting approach that integrates decomposition techniques with machine learning to enhance time series prediction accuracy. Each decomposition technique breaks down a time series into multiple intrinsic mode functions (IMFs), which are then individually modeled and forecasted using machine learning algorithms. The final forecast is obtained by aggregating the predictions of all IMFs, producing an ensemble output for the time series. The performance of the developed models is evaluated using international monthly maize price data, assessed through metrics such as root mean squared error (RMSE), mean absolute percentage error (MAPE), and mean absolute error (MAE). For method details see Choudhary, K. et al. (2023). <https://ssca.org.in/media/14_SA44052022_R3_SA_21032023_Girish_Jha_FINAL_Finally.pdf>.
This package performs robust multiple testing for means in the presence of known and unknown latent factors presented in Fan et al.(2019) "FarmTest
: Factor-Adjusted Robust Multiple Testing With Approximate False Discovery Control" <doi:10.1080/01621459.2018.1527700>. Implements a series of adaptive Huber methods combined with fast data-drive tuning schemes proposed in Ke et al.(2019) "User-Friendly Covariance Estimation for Heavy-Tailed Distributions" <doi:10.1214/19-STS711> to estimate model parameters and construct test statistics that are robust against heavy-tailed and/or asymmetric error distributions. Extensions to two-sample simultaneous mean comparison problems are also included. As by-products, this package contains functions that compute adaptive Huber mean, covariance and regression estimators that are of independent interest.
Novel method to unbiasedly include studies with Non-statistically Significant Unreported Effects (NSUEs) in a meta-analysis. First, the function calculates the interval where the unreported effects (e.g., t-values) should be according to the threshold of statistical significance used in each study. Afterward, the method uses maximum likelihood techniques to impute the expected effect size of each study with NSUEs, accounting for between-study heterogeneity and potential covariates. Multiple imputations of the NSUEs are then randomly created based on the expected value, variance, and statistical significance bounds. Finally, it conducts a restricted-maximum likelihood random-effects meta-analysis separately for each set of imputations, and it performs estimations from these meta-analyses. Please read the reference in metansue for details of the procedure.
This package provides a collection of white noise hypothesis tests for functional time series and related visualizations. These include tests based on the norms of autocovariance operators that are built under both strong and weak white noise assumptions. Additionally, tests based on the spectral density operator and on principal component dimensional reduction are included, which are built under strong white noise assumptions. Also, this package provides goodness-of-fit tests for functional autoregressive of order 1 models. These methods are described in Kokoszka et al. (2017) <doi:10.1016/j.jmva.2017.08.004>, Characiejus and Rice (2019) <doi:10.1016/j.ecosta.2019.01.003>, Gabrys and Kokoszka (2007) <doi:10.1198/016214507000001111>, and Kim et al. (2023) <doi: 10.1214/23-SS143> respectively.
Simulates age-at-onset traits associated with a segregating major gene in family data obtained from population-based, clinic-based, or multi-stage designs. Appropriate ascertainment correction is utilized to estimate age-dependent penetrance functions either parametrically from the fitted model or nonparametrically from the data. The Expectation and Maximization algorithm can infer missing genotypes and carrier probabilities estimated from family's genotype and phenotype information or from a fitted model. Plot functions include pedigrees of simulated families and predicted penetrance curves based on specified parameter values. For more information see Choi, Y.-H., Briollais, L., He, W. and Kopciuk, K. (2021) FamEvent
: An R Package for Generating and Modeling Time-to-Event Data in Family Designs, Journal of Statistical Software 97 (7), 1-30.
This package provides functions for performing least-squares bilinear clustering of three-way data. The method uses the bilinear decomposition (or bi-additive model) to model two-way matrix slices while clustering over the third way. Up to four different types of clusters are included, one for each term of the bilinear decomposition. In this way, matrices are clustered simultaneously on (a subset of) their overall means, row margins, column margins and row-column interactions. The orthogonality of the bilinear model results in separability of the joint clustering problem into four separate ones. Three of these sub-problems are specific k-means problems, while a special algorithm is implemented for the interactions. Plotting methods are provided, including biplots for the low-rank approximations of the interactions.