This package provides methods for cluster analysis. It is a much extended version of the original from Peter Rousseeuw, Anja Struyf and Mia Hubert, based on Kaufman and Rousseeuw (1990) "Finding Groups in Data".
This package provides Gaussian mixture models, k-means, mini-batch-kmeans, k-medoids and affinity propagation clustering with the option to plot, validate, predict (new data) and estimate the optimal number of clusters. The package takes advantage of RcppArmadillo
to speed up the computationally intensive parts of the functions. For more information, see
"Clustering in an Object-Oriented Environment" by Anja Struyf, Mia Hubert, Peter Rousseeuw (1997), Journal of Statistical Software, https://doi.org/10.18637/jss.v001.i04;
"Web-scale k-means clustering" by D. Sculley (2010), ACM Digital Library, https://doi.org/10.1145/1772690.1772862;
"Armadillo: a template-based C++ library for linear algebra" by Sanderson et al (2016), The Journal of Open Source Software, https://doi.org/10.21105/joss.00026;
"Clustering by Passing Messages Between Data Points" by Brendan J. Frey and Delbert Dueck, Science 16 Feb 2007: Vol. 315, Issue 5814, pp. 972-976, https://doi.org/10.1126/science.1136800.
This package provides tools for clustering high-dimensional data. In particular, it contains the methods described in <doi:10.1093/bioinformatics/btaa243>, <arXiv:2010.00950>
.
Evaluate arbitrary function calls using workers on HPC schedulers in single line of code. All processing is done on the network without accessing the file system. Remote schedulers are supported via SSH.
Allows clustering of incomplete observations by addressing missing values using multiple imputation. For achieving this goal, the methodology consists in three steps, following Audigier and Niang 2022 <doi:10.1007/s11634-022-00519-1>. I) Missing data imputation using dedicated models. Four multiple imputation methods are proposed, two are based on joint modelling and two are fully sequential methods, as discussed in Audigier et al. (2021) <doi:10.48550/arXiv.2106.04424>
. II) cluster analysis of imputed data sets. Six clustering methods are available (distances-based or model-based), but custom methods can also be easily used. III) Partition pooling. The set of partitions is aggregated using Non-negative Matrix Factorization based method. An associated instability measure is computed by bootstrap (see Fang, Y. and Wang, J., 2012 <doi:10.1016/j.csda.2011.09.003>). Among applications, this instability measure can be used to choose a number of clusters with missing values. The package also proposes several diagnostic tools to tune the number of imputed data sets, to tune the number of iterations in fully sequential imputation, to check the fit of imputation models, etc.
Identification of clusters of co-expressed genes based on their expression across multiple (replicated) biological samples.
Estimates latent class vector-autoregressive models via EM algorithm on time-series data for model-based clustering and classification. Includes model selection criteria for selecting the number of lags and clusters.
Calculate p-values and confidence intervals using cluster-adjusted t-statistics (based on Ibragimov and Muller (2010) <DOI:10.1198/jbes.2009.08046>, pairs cluster bootstrapped t-statistics, and wild cluster bootstrapped t-statistics (the latter two techniques based on Cameron, Gelbach, and Miller (2008) <DOI:10.1162/rest.90.3.414>. Procedures are included for use with GLM, ivreg, plm (pooling or fixed effects), and mlogit models.
One haplotype is a combination of SNP (Single Nucleotide Polymorphisms) within the QTL (Quantitative Trait Loci). clusterhap groups together all individuals of a population with the same haplotype. Each group contains individual with the same allele in each SNP, whether or not missing data. Thus, clusterhap groups individuals, that to be imputed, have a non-zero probability of having the same alleles in the entire sequence of SNP's. Moreover, clusterhap calculates such probability from relative frequencies.
The design of this package allows us to run different clustering packages and compare the results between them, to determine which algorithm behaves best from the data provided. See Martos, L.A.P., Garcà a-Vico, à .M., González, P. et al.(2023) <doi:10.1007/s13748-022-00294-2> "Clustering: an R library to facilitate the analysis and comparison of cluster algorithms.", Martos, L.A.P., Garcà a-Vico, à .M., González, P. et al. "A Multiclustering Evolutionary Hyperrectangle-Based Algorithm" <doi:10.1007/s44196-023-00341-3> and L.A.P., Garcà a-Vico, à .M., González, P. et al. "An Evolutionary Fuzzy System for Multiclustering in Data Streaming" <doi:10.1016/j.procs.2023.12.058>.
Distance measures (GDM1, GDM2, Sokal-Michener, Bray-Curtis, for symbolic interval-valued data), cluster quality indices (Calinski-Harabasz, Baker-Hubert, Hubert-Levine, Silhouette, Krzanowski-Lai, Hartigan, Gap, Davies-Bouldin), data normalization formulas (metric data, interval-valued symbolic data), data generation (typical and non-typical data), HINoV
method, replication analysis, linear ordering methods, spectral clustering, agreement indices between two partitions, plot functions (for categorical and symbolic interval-valued data). (MILLIGAN, G.W., COOPER, M.C. (1985) <doi:10.1007/BF02294245>, HUBERT, L., ARABIE, P. (1985) <doi:10.1007%2FBF01908075>, RAND, W.M. (1971) <doi:10.1080/01621459.1971.10482356>, JAJUGA, K., WALESIAK, M. (2000) <doi:10.1007/978-3-642-57280-7_11>, MILLIGAN, G.W., COOPER, M.C. (1988) <doi:10.1007/BF01897163>, JAJUGA, K., WALESIAK, M., BAK, A. (2003) <doi:10.1007/978-3-642-55721-7_12>, DAVIES, D.L., BOULDIN, D.W. (1979) <doi:10.1109/TPAMI.1979.4766909>, CALINSKI, T., HARABASZ, J. (1974) <doi:10.1080/03610927408827101>, HUBERT, L. (1974) <doi:10.1080/01621459.1974.10480191>, TIBSHIRANI, R., WALTHER, G., HASTIE, T. (2001) <doi:10.1111/1467-9868.00293>, BRECKENRIDGE, J.N. (2000) <doi:10.1207/S15327906MBR3502_5>, WALESIAK, M., DUDEK, A. (2008) <doi:10.1007/978-3-540-78246-9_11>).
This package can be used to estimate the number of clusters in a set of microarray data, as well as test the stability of these clusters.
Calculate some statistics aiming to help analyzing the clustering tendency of given data. In the first version, Hopkins statistic is implemented. See Hopkins and Skellam (1954) <doi:10.1093/oxfordjournals.aob.a083391>.
The clusterCrit
package provides an implementation of the following indices: Czekanowski-Dice, Folkes-Mallows, Hubert Γ, Jaccard, McNemar, Kulczynski, Phi, Rand, Rogers-Tanimoto, Russel-Rao or Sokal-Sneath. ClusterCrit defines several functions which compute internal quality indices or external comparison indices. The partitions are specified as an integer vector giving the index of the cluster each observation belongs to.
Streamlining the clustering and visualization of time-series gene expression data from RNA-Seq experiments, this tool supports fuzzy c-means and k-means clustering algorithms. It is compatible with outputs from widely-used packages such as Seurat', Monocle', and WGCNA', enabling seamless downstream visualization and analysis. See Lokesh Kumar and Matthias E Futschik (2007) <doi:10.6026/97320630002005> for more details.
Assignment of cell type labels to single-cell RNA sequencing (scRNA-seq
) clusters is often a time-consuming process that involves manual inspection of the cluster marker genes complemented with a detailed literature search. This is especially challenging when unexpected or poorly described populations are present. The clustermole R package provides methods to query thousands of human and mouse cell identity markers sourced from a variety of databases.
This is a function for validating microarray clusters via reproducibility, based on the paper referenced below.
ClusterJudge
implements the functions, examples and other software published as an algorithm by Gibbons, FD and Roth FP. The article is called "Judging the Quality of Gene Expression-Based Clustering Methods Using Gene Annotation" and it appeared in Genome Research, vol. 12, pp1574-1581 (2002). See package?ClusterJudge
for an overview.
Estimate and return the needed parameters for visualisations designed for OpenBudgets
<http://openbudgets.eu/> data. Calculate cluster analysis measures in Budget data of municipalities across Europe, according to the OpenBudgets
data model. It involves a set of techniques and algorithms used to find and divide the data into groups of similar observations. Also, can be used generally to extract visualisation parameters convert them to JSON format and use them as input in a different graphical interface.
Integrative context-dependent clustering for heterogeneous biomedical datasets. Identifies local clustering structures in related datasets, and a global clusters that exist across the datasets.
This package implements methods to analyze and visualize functional profiles (GO and KEGG) of gene and gene clusters.
Nonparametric rank based tests (rank-sum tests and signed-rank tests) for clustered data, especially useful for clusters having informative cluster size and intra-cluster group size.
This package provides a collection of data sets for teaching cluster analysis.
This package provides functionality for the analysis of clustered data using the cluster bootstrap.