PAM (Partitioning Around Medoids) algorithm application to samples of single cell sequencing techniques with a high number of cells (as many as the computer memory allows). The package uses a binary format to store matrices (either full, sparse or symmetric) in files written in the disk that can contain any data type (not just double) which allows its manipulation when memory is sufficient to load them as int or float, but not as double. The PAM implementation is done in parallel, using several/all the cores of the machine, if it has them. This package shares a great part of its code with packages jmatrix and parallelpam but their functionality is included here so there is no need to install them.
This package implements the algorithm described in Barron, M., Zhang, S. and Li, J. 2017, "A sparse differential clustering algorithm for tracing cell type changes via single-cell RNA-sequencing data", Nucleic Acids Research, gkx1113, <doi:10.1093/nar/gkx1113>. This algorithm clusters samples from two different populations, links the clusters across the conditions and identifies marker genes for these changes. The package was designed for scRNA-Seq data but is also applicable to many other data types, just replace cells with samples and genes with variables. The package also contains functions for estimating the parameters for SparseDC as outlined in the paper. We recommend that users further select their marker genes using the magnitude of the cluster centers.
This is a collection of functions optimized for working with with various kinds of text matrices. Focusing on the text matrix as the primary object - represented either as a base R dense matrix or a Matrix package sparse matrix - allows for a consistent and intuitive interface that stays close to the underlying mathematical foundation of computational text analysis. In particular, the package includes functions for working with word embeddings, text networks, and document-term matrices. Methods developed in Stoltz and Taylor (2019) <doi:10.1007/s42001-019-00048-6>, Taylor and Stoltz (2020) <doi:10.1007/s42001-020-00075-8>, Taylor and Stoltz (2020) <doi:10.15195/v7.a23>, and Stoltz and Taylor (2021) <doi:10.1016/j.poetic.2021.101567>.
This package provides a Bayesian Nonparametric model for the study of time-evolving frequencies, which has become renowned in the study of population genetics. The model consists of a Hidden Markov Model (HMM) in which the latent signal is a distribution-valued stochastic process that takes the form of a finite mixture of Dirichlet Processes, indexed by vectors that count how many times each value is observed in the population. The package implements methodologies presented in Ascolani, Lijoi and Ruggiero (2021) <doi:10.1214/20-BA1206> and Ascolani, Lijoi and Ruggiero (2023) <doi:10.3150/22-BEJ1504> that make it possible to study the process at the time of data collection or to predict its evolution in future or in the past.
Allows the user to generate a list of features (gene, pseudo, RNA, CDS, and/or UTR) directly from NCBI database for any species with a current build available. Option to save downloaded and formatted files is available, and the user can prioritize the feature list based on type and assembly builds present in the current build used. The user can then use the list of features generated or provide a list to map a set of markers (designed for SNP markers with a single base pair position available) to the closest feature based on the map build. This function does require map positions of the markers to be provided and the positions should be based on the build being queried through NCBI.
Construct a principal surface that are two-dimensional surfaces that pass through the middle of a p-dimensional data set. They minimise the distance from the data points, and provide a nonlinear summary of data. The surfaces are nonparametric and their shape is suggested by the data. The formation of a surface is found using an iterative procedure which starts with a linear summary, typically with a principal component plane. Each successive iteration is a local average of the p-dimensional points, where an average is based on a projection of a point onto the nonlinear surface of the previous iteration. For more information on principal surfaces, see Ganey, R. (2019, "https://open.uct.ac.za/items/4e655d7d-d10c-481b-9ccc-801903aebfc8").
The gradual release of active substances from packaging can enhance food preservation by maintaining high concentrations of polyphenols and antioxidants for a period of 72 hrs. To assess the effectiveness of packaging materials that serve as carriers for antioxidants, it is crucial to model the diffusivity of the active agents. Understanding this diffusivity helps evaluate the packaging's capacity to prolong the shelf life of food items. The process of migration, which encompasses diffusion, dissolution, and reaching equilibrium, facilitates the transfer of low molecular weight compounds from the packaging into food simulants. The rate at which these active compounds are released from the packaging is typically analysed using food simulants under conditions outlined in European food packaging regulations (Ramos et al., 2014).
The fossil record is a joint expression of ecological, taphonomic, evolutionary, and stratigraphic processes (Holland and Patzkowsky, 2012, ISBN:978-0226649382). This package allowing to simulate biological processes in the time domain (e.g., trait evolution, fossil abundance, phylogenetic trees), and examine how their expression in the rock record (stratigraphic domain) is influenced based on age-depth models, ecological niche models, and taphonomic effects. Functions simulating common processes used in modeling trait evolution, biostratigraphy or event type data such as first/last occurrences are provided and can be used standalone or as part of a pipeline. The package comes with example data sets and tutorials in several vignettes, which can be used as a template to set up one's own simulation.
This package provides some tools for developing and validating prediction models, estimate expected survival of patients and visualize them graphically. Most of the implemented methods are based on penalized regressions such as: the lasso (Tibshirani R (1996)), the elastic net (Zou H et al. (2005) <doi:10.1111/j.1467-9868.2005.00503.x>), the adaptive lasso (Zou H (2006) <doi:10.1198/016214506000000735>), the stability selection (Meinshausen N et al. (2010) <doi:10.1111/j.1467-9868.2010.00740.x>), some extensions of the lasso (Ternes et al. (2016) <doi:10.1002/sim.6927>), some methods for the interaction setting (Ternes N et al. (2016) <doi:10.1002/bimj.201500234>), or others. A function generating simulated survival data set is also provided.
This package provides methods for estimation and hypothesis testing of proportions in group testing designs: methods for estimating a proportion in a single population (assuming sensitivity and specificity equal to 1 in designs with equal group sizes), as well as hypothesis tests and functions for experimental design for this situation. For estimating one proportion or the difference of proportions, a number of confidence interval methods are included, which can deal with various different pool sizes. Further, regression methods are implemented for simple pooling and matrix pooling designs. Methods for identification of positive items in group testing designs: Optimal testing configurations can be found for hierarchical and array-based algorithms. Operating characteristics can be calculated for testing configurations across a wide variety of situations.
This package provides tools for building Rescorla-Wagner Models for Two-Alternative Forced Choice tasks, commonly employed in psychological research. Most concepts and ideas within this R package are referenced from Sutton and Barto (2018) <ISBN:9780262039246>. The package allows for the intuitive definition of RL models using simple if-else statements and three basic models built into this R package are referenced from Niv et al. (2012) <doi:10.1523/JNEUROSCI.5498-10.2012>. Our approach to constructing and evaluating these computational models is informed by the guidelines proposed in Wilson & Collins (2019) <doi:10.7554/eLife.49547>. Example datasets included with the package are sourced from the work of Mason et al. (2024) <doi:10.3758/s13423-023-02415-x>.
Given a likelihood provided by the user, this package applies it to a given matrix dataset in order to find change points in the data that maximize the sum of the likelihoods of all the segments. This package provides a handful of algorithms with different time complexities and assumption compromises so the user is able to choose the best one for the problem at hand. The implementation of the segmentation algorithms in this package are based on the paper by Bruno M. de Castro, Florencia Leonardi (2018) <arXiv:1501.01756>. The Berlin weather sample dataset was provided by Deutscher Wetterdienst <https://dwd.de/>. You can find all the references in the Acknowledgments section of this package's repository via the URL below.
Statistical procedures to perform stability analysis in plant breeding and to identify stable genotypes under diverse environments. It is possible to calculate coefficient of homeostaticity by Khangildin et al. (1979), variance of specific adaptive ability by Kilchevsky&Khotyleva (1989), weighted homeostaticity index by Martynov (1990), steadiness of stability index by Udachin (1990), superiority measure by Lin&Binn (1988) <doi:10.4141/cjps88-018>, regression on environmental index by Erberhart&Rassel (1966) <doi:10.2135/cropsci1966.0011183X000600010011x>, Tai's (1971) stability parameters <doi:10.2135/cropsci1971.0011183X001100020006x>, stability variance by Shukla (1972) <doi:10.1038/hdy.1972.87>, ecovalence by Wricke (1962), nonparametric stability parameters by Nassar&Huehn (1987) <doi:10.2307/2531947>, Francis&Kannenberg's parameters of stability (1978) <doi:10.4141/cjps78-157>.
This package provides a tool for matching ICD-10 codes to corresponding Clinical Classification Software Refined (CCSR) codes. The main function, CCSRfind(), identifies each CCSR code that applies to an individual given their diagnosis codes. It also provides a summary of CCSR codes that are matched to a dataset. The package contains 3 datasets: DXCCSR (mapping of ICD-10 codes to CCSR codes), Legend (conversion of DXCCSR to CCSRfind-usable format for CCSR codes with less than or equal to 1000 ICD-10 diagnosis codes), and LegendExtend (conversion of DXCCSR to CCSRfind-usable format for CCSR codes with more than 1000 ICD-10 dx codes). The disc() function applies grepl() ('base') to multiple columns and is used in CCSRfind().
The futurize() function transpiles calls to sequential map-reduce functions such as base::lapply(), purrr::map(), foreach::foreach() %do% ... into concurrent alternatives, providing you with a simple, straightforward path to scalable parallel computing via the future ecosystem <doi:10.32614/RJ-2021-048>. By combining this function with R's native pipe operator, you have an convenient way for speeding up iterative computations with minimal refactoring, e.g. lapply(xs, fcn) |> futurize()', purrr::map(xs, fcn) |> futurize()', and foreach::foreach(x = xs) %do% fcn(x) |> futurize()'. Other map-reduce packages that be "futurized" are BiocParallel', plyr', crossmap packages. There is also support for growing set of domain-specific packages, including boot', glmnet', mgcv', lme4', and tm'.
Estimation of the generalized beta distribution of the second kind (GB2) and related models using grouped data in form of income shares. The GB2 family is a general class of distributions that provides an accurate fit to income data. GB2group includes functions to estimate the GB2, the Singh-Maddala, the Dagum, the Beta 2, the Lognormal and the Fisk distributions. GB2group deploys two different econometric strategies to estimate these parametric distributions, the equally weighted minimum distance (EWMD) estimator and the optimally weighted minimum distance (OMD) estimator. Asymptotic standard errors are reported for the OMD estimates. Standard errors of the EWMD estimates are obtained by Monte Carlo simulation. See Jorda et al. (2018) <arXiv:1808.09831> for a detailed description of the estimation procedure.
Nonparametric methods for landmark prediction of long-term survival outcomes, incorporating covariate and short-term event information. The package supports the construction of flexible varying-coefficient models that use discrete covariates, as well as multiple continuous covariates. The goal is to improve prediction accuracy when censored short-term events are available as predictors, using robust nonparametric procedures that do not require correct model specification and avoid restrictive parametric assumptions found in alternative methods. More information on these methods can be found in Parast et al. 2012 <doi:10.1080/01621459.2012.721281>, Parast et al. 2011 <doi:10.1002/bimj.201000150>, and Parast and Cai 2013 <doi:10.1002/sim.5776>. A tutorial for this package is available here: <https://www.laylaparast.com/landpred>.
Computing statistical hypothesis testing for loading in principal component analysis (PCA) (Yamamoto, H. et al. (2014) <doi:10.1186/1471-2105-15-51>), orthogonal smoothed PCA (OS-PCA) (Yamamoto, H. et al. (2021) <doi:10.3390/metabo11030149>), one-sided kernel PCA (Yamamoto, H. (2023) <doi:10.51094/jxiv.262>), partial least squares (PLS) and PLS discriminant analysis (PLS-DA) (Yamamoto, H. et al. (2009) <doi:10.1016/j.chemolab.2009.05.006>), PLS with rank order of groups (PLS-ROG) (Yamamoto, H. (2017) <doi:10.1002/cem.2883>), regularized canonical correlation analysis discriminant analysis (RCCA-DA) (Yamamoto, H. et al. (2008) <doi:10.1016/j.bej.2007.12.009>), multiset PLS and PLS-ROG (Yamamoto, H. (2022) <doi:10.1101/2022.08.30.505949>).
This package performs the O2PLS data integration method for two datasets, yielding joint and data-specific parts for each dataset. The algorithm automatically switches to a memory-efficient approach to fit O2PLS to high dimensional data. It provides a rigorous and a faster alternative cross-validation method to select the number of components, as well as functions to report proportions of explained variation and to construct plots of the results. See the software article by el Bouhaddani et al (2018) <doi:10.1186/s12859-018-2371-3>, and Trygg and Wold (2003) <doi:10.1002/cem.775>. It also performs Sparse Group (Penalized) O2PLS, see Gu et al (2020) <doi:10.1186/s12859-021-03958-3> and cross-validation for the degree of sparsity.
This package provides a standardized and reproducible framework for characterizing and classifying discrete color classes from digital images of biological organisms. The package automatically determines the presence or absence of 10 human-visible color categories (black, blue, brown, green, grey, orange, purple, red, white, yellow) using a biologically-inspired Color Look-Up Table (CLUT) that partitions HSV color space. Supports both fully automated and semi-automated (interactive) workflows with complete provenance tracking for reproducibility. Pre-processes images using the recolorize package (Weller et al. 2024 <doi:10.1111/ele.14378>) for spatial-color binning, and integrates with pavo (Maia et al. 2019 <doi:10.1111/2041-210X.13174>) for color pattern geometry statistics. Designed for high-throughput analysis and seamless integration with downstream evolutionary analyses.
The hybrid model is a highly effective forecasting approach that integrates decomposition techniques with machine learning to enhance time series prediction accuracy. Each decomposition technique breaks down a time series into multiple intrinsic mode functions (IMFs), which are then individually modeled and forecasted using machine learning algorithms. The final forecast is obtained by aggregating the predictions of all IMFs, producing an ensemble output for the time series. The performance of the developed models is evaluated using international monthly maize price data, assessed through metrics such as root mean squared error (RMSE), mean absolute percentage error (MAPE), and mean absolute error (MAE). For method details see Choudhary, K. et al. (2023). <https://ssca.org.in/media/14_SA44052022_R3_SA_21032023_Girish_Jha_FINAL_Finally.pdf>.
Novel method to unbiasedly include studies with Non-statistically Significant Unreported Effects (NSUEs) in a meta-analysis. First, the function calculates the interval where the unreported effects (e.g., t-values) should be according to the threshold of statistical significance used in each study. Afterward, the method uses maximum likelihood techniques to impute the expected effect size of each study with NSUEs, accounting for between-study heterogeneity and potential covariates. Multiple imputations of the NSUEs are then randomly created based on the expected value, variance, and statistical significance bounds. Finally, it conducts a restricted-maximum likelihood random-effects meta-analysis separately for each set of imputations, and it performs estimations from these meta-analyses. Please read the reference in metansue for details of the procedure.
This package provides a collection of white noise hypothesis tests for functional time series and related visualizations. These include tests based on the norms of autocovariance operators that are built under both strong and weak white noise assumptions. Additionally, tests based on the spectral density operator and on principal component dimensional reduction are included, which are built under strong white noise assumptions. Also, this package provides goodness-of-fit tests for functional autoregressive of order 1 models. These methods are described in Kokoszka et al. (2017) <doi:10.1016/j.jmva.2017.08.004>, Characiejus and Rice (2019) <doi:10.1016/j.ecosta.2019.01.003>, Gabrys and Kokoszka (2007) <doi:10.1198/016214507000001111>, and Kim et al. (2023) <doi: 10.1214/23-SS143> respectively.
This package offers extensive tools for phylogenetic analysis. It focuses on phylogenetic comparative biology but also includes methods for visualizing, analyzing, manipulating, reading, writing, and inferring phylogenetic trees. Functions for comparative biology include ancestral state reconstruction, model fitting, and phylogeny and trait data simulation. A broad range of plotting methods includes mapping trait evolution on trees, projecting trees into phenotype space or geographic maps, and visualizing correlated speciation between trees. Additional functions allow for reading, writing, analyzing, inferring, simulating, and manipulating phylogenetic trees and comparative data. Examples include computing consensus trees, simulating trees and data under various models, and attaching species or clades to a tree either randomly or non-randomly. This package provides numerous tools for tree manipulations and analyses that are valuable for phylogenetic research.