Check your R code for some of the most common layout flaws. Many tried to teach us how to write code less dreadful, be it implicitly as B. W. Kernighan and D. M. Ritchie (1988) <ISBN:0-13-110362-8> in The C Programming Language did, be it explicitly as R.C. Martin (2008) <ISBN:0-13-235088-2> in Clean Code: A Handbook of Agile Software Craftsmanship did. So we should check our code for files too long or wide, functions with too many lines, too wide lines, too many arguments or too many levels of nesting. Note: This is not a static code analyzer like pylint or the like. Checkout <https://cran.r-project.org/package=lintr> instead.
Utilities to read and write files in the FITS (Flexible Image Transport System) format, a standard format in astronomy (see e.g. <https://en.wikipedia.org/wiki/FITS> for more information). Present low-level routines allow: reading, parsing, and modifying FITS headers; reading FITS images (multi-dimensional arrays); reading FITS binary and ASCII tables; and writing FITS images (multi-dimensional arrays). Higher-level functions allow: reading files composed of one or more headers and a single (perhaps multidimensional) image or single table; reading tables into data frames; generating vectors for image array axes; scaling and writing images as 16-bit integers. Known incompletenesses are reading random group extensions, as well as complex and array descriptor data types in binary tables.
Models for non-linear time series analysis and causality detection. The main functionalities of this package consist of an implementation of the classical causality test (C.W.J.Granger 1980) <doi:10.1016/0165-1889(80)90069-X>, and a non-linear version of it based on feed-forward neural networks. This package contains also an implementation of the Transfer Entropy <doi:10.1103/PhysRevLett.85.461>
, and the continuous Transfer Entropy using an approximation based on the k-nearest neighbors <doi:10.1103/PhysRevE.69.066138>
. There are also some other useful tools, like the VARNN (Vector Auto-Regressive Neural Network) prediction model, the Augmented test of stationarity, and the discrete and continuous entropy and mutual information.
Fits by ABC, the parameters of a stochastic process modelling the phylogeny and evolution of a suite of traits following the tree. The user may define an arbitrary Markov process for the trait and phylogeny. Importantly, trait-dependent speciation models are handled and fitted to data. See K. Bartoszek, P. Lio (2019) <doi:10.5506/APhysPolBSupp.12.25>
. The suggested geiger package can be obtained from CRAN's archive <https://cran.r-project.org/src/contrib/Archive/geiger/>, suggested to take latest version. Otherwise its required code is present in the pcmabc package. The suggested distory package can be obtained from CRAN's archive <https://cran.r-project.org/src/contrib/Archive/distory/>, suggested to take latest version.
The creation of effective visualizations is a fundamental component of data analysis. In biomedical research, new challenges are emerging to visualize multi-dimensional data in a 2D space, but current data visualization tools have limited capabilities. To address this problem, we leverage Gestalt principles to improve the design and interpretability of multi-dimensional data in 2D data visualizations, layering aesthetics to display multiple variables. The proposed visualization can be applied to spatially-resolved transcriptomics data, but also broadly to data visualized in 2D space, such as embedding visualizations. We provide this open source R package escheR
, which is built off of the state-of-the-art ggplot2 visualization framework and can be seamlessly integrated into genomics toolboxes and workflows.
Broadly useful convenient and efficient R functions that bring users concise and elegant R data analyses. This package includes easy-to-use functions for (1) basic R programming (e.g., set working directory to the path of currently opened file; import/export data from/to files in any format; print tables to Microsoft Word); (2) multivariate computation (e.g., compute scale sums/means/... with reverse scoring); (3) reliability analyses and factor analyses; (4) descriptive statistics and correlation analyses; (5) t-test, multi-factor analysis of variance (ANOVA), simple-effect analysis, and post-hoc multiple comparison; (6) tidy report of statistical models (to R Console and Microsoft Word); (7) mediation and moderation analyses (PROCESS); and (8) additional toolbox for statistics and graphics.
Supervised learning from a source distribution (with known segmentation into cell sub-populations) to fit a target distribution with unknown segmentation. It relies regularized optimal transport to directly estimate the different cell population proportions from a biological sample characterized with flow cytometry measurements. It is based on the regularized Wasserstein metric to compare cytometry measurements from different samples, thus accounting for possible mis-alignment of a given cell population across sample (due to technical variability from the technology of measurements). Supervised learning technique based on the Wasserstein metric that is used to estimate an optimal re-weighting of class proportions in a mixture model Details are presented in Freulon P, Bigot J and Hejblum BP (2023) <doi:10.1214/22-AOAS1660>.
Inference using a class of Hidden Markov models (HMMs) called oHMMed'(ordered
HMM with emission densities <doi:10.1186/s12859-024-05751-4>): The oHMMed
algorithms identify the number of comparably homogeneous regions within observed sequences with autocorrelation patterns. These are modelled as discrete hidden states; the observed data points are then realisations of continuous probability distributions with state-specific means that enable ordering of these distributions. The observed sequence is labelled according to the hidden states, permitting only neighbouring states that are also neighbours within the ordering of their associated distributions. The parameters that characterise these state-specific distributions are then inferred. Relevant for application to genomic sequences, time series, or any other sequence data with serial autocorrelation.
This package provides a high-level R interface to CoreArray Genomic Data Structure (GDS) data files, which are portable across platforms with hierarchical structure to store multiple scalable array-oriented data sets with metadata information. It is suited for large-scale datasets, especially for data which are much larger than the available random-access memory. The gdsfmt
package offers efficient operations specifically designed for integers of less than 8 bits, since a diploid genotype, like single-nucleotide polymorphism (SNP), usually occupies fewer bits than a byte. Data compression and decompression are available with relatively efficient random access. It is also allowed to read a GDS file in parallel with multiple R processes supported by the package parallel
.
Estimating and analyzing auto regressive integrated moving average (ARIMA) models. The primary function in this package is arima()
, which fits an ARIMA model to univariate time series data using a random restart algorithm. This approach frequently leads to models that have model likelihood greater than or equal to that of the likelihood obtained by fitting the same model using the arima()
function from the stats package. This package enables proper optimization of model likelihoods, which is a necessary condition for performing likelihood ratio tests. This package relies heavily on the source code of the arima()
function of the stats package. For more information, please see Jesse Wheeler and Edward L. Ionides (2023) <doi:10.48550/arXiv.2310.01198>
.
This package provides the dose transition pathways (DTP) to project in advance the doses recommended by a model-based design for subsequent patients (stay, escalate, deescalate or stop early) using all the accumulated toxicity information; See Yap et al (2017) <doi: 10.1158/1078-0432.CCR-17-0582>. DTP can be used as a design and an operational tool and can be displayed as a table or flow diagram. The dtpcrm package also provides the modified continual reassessment method (CRM) and time-to-event CRM (TITE-CRM) with added practical considerations to allow stopping early when there is sufficient evidence that the lowest dose is too toxic and/or there is a sufficient number of patients dosed at the maximum tolerated dose.
Estimate Gaussian graphical models with nonconvex penalties <doi:10.31234/osf.io/ad57p>, including the atan Wang and Zhu (2016) <doi:10.1155/2016/6495417>, seamless L0 Dicker, Huang, and Lin (2013) <doi:10.5705/ss.2011.074>, exponential Wang, Fan, and Zhu <doi:10.1007/s10463-016-0588-3>, smooth integration of counting and absolute deviation Lv and Fan (2009) <doi:10.1214/09-AOS683>, logarithm Mazumder, Friedman, and Hastie (2011) <doi:10.1198/jasa.2011.tm09738>, Lq, smoothly clipped absolute deviation Fan and Li (2001) <doi:10.1198/016214501753382273>, and minimax concave penalty Zhang (2010) <doi:10.1214/09-AOS729>. There are also extensions for computing variable inclusion probabilities, multiple regression coefficients, and statistical inference <doi:10.1214/15-EJS1031>.
Researchers have been using simulated data from a multivariate linear model to compare and evaluate different methods, ideas and models. Additionally, teachers and educators have been using a simulation tool to demonstrate and teach various statistical and machine learning concepts. This package helps users to simulate linear model data with a wide range of properties by tuning few parameters such as relevant latent components. In addition, a shiny app as an RStudio gadget gives users a simple interface for using the simulation function. See more on: Sæbø, S., Almøy, T., Helland, I.S. (2015) <doi:10.1016/j.chemolab.2015.05.012> and Rimal, R., Almøy, T., Sæbø, S. (2018) <doi:10.1016/j.chemolab.2018.02.009>.
This function produces empirical best linier unbiased predictions (EBLUPs) for Zero-Inflated data and its Relative Standard Error. Small Area Estimation with Zero-Inflated Model (SAE-ZIP) is a model developed for Zero-Inflated data that can lead us to overdispersion situation. To handle this kind of situation, this model is created. The model in this package is based on Small Area Estimation with Zero-Inflated Poisson model proposed by Dian Christien Arisona (2018)<https://repository.ipb.ac.id/handle/123456789/92308>. For the data sample itself, we use combination method between Roberto Benavent and Domingo Morales (2015)<doi:10.1016/j.csda.2015.07.013> and Sabine Krieg, Harm Jan Boonstra and Marc Smeets (2016)<doi:10.1515/jos-2016-0051>.
This package provides a differential abundance analysis for the comparison of two or more conditions. Useful for analyzing data from standard RNA-seq or meta-RNA-seq assays as well as selected and unselected values from in-vitro sequence selections. Uses a Dirichlet-multinomial model to infer abundance from counts, optimized for three or more experimental replicates. The method infers biological and sampling variation to calculate the expected false discovery rate, given the variation, based on a Wilcoxon Rank Sum test and Welch's t-test, a Kruskal-Wallis test, a generalized linear model, or a correlation test. All tests report p-values and Benjamini-Hochberg corrected p-values. ALDEx2 also calculates expected standardized effect sizes for paired or unpaired study designs.
Used in testing if the indirect effect from linear regression mediation analysis is equal to 0. Includes established methods such as the Sobel Test, Joint Significant test (maxP
), and tests based off the distribution of the Product or Normal Random Variables. Additionally, this package adds more powerful tests based on Intersection-Union theory. These tests are the S-Test, the ps-test, and the ascending squares test. These new methods are uniformly more powerful than maxP
, which is more powerful than Sobel and less anti-conservative than the Product of Normal Random Variables. These methods are explored by Kidd and Lin, (2024) <doi:10.1007/s12561-023-09386-6> and Kidd et al., (2025) <doi:10.1007/s10260-024-00777-7>.
Computation of t-year survival probabilities and t-year risks with right censored survival data. The Kaplan-Meier estimator is used to provide estimates for data without competing risks and the Aalen-Johansen estimator is used when there are competing risks. Confidence intervals and p-values are obtained using either usual Wald-type inference or empirical likelihood inference, as described in Thomas and Grunkemeier (1975) <doi:10.1080/01621459.1975.10480315> and Blanche (2020) <doi:10.1007/s10985-018-09458-6>. Functions for both one-sample and two-sample inference are provided. Unlike Wald-type inference, empirical likelihood inference always leads to consistent conclusions, in terms of statistical significance, when comparing two risks (or survival probabilities) via either a ratio or a difference.
This is a package for estimation and inference from generalized linear models based on various methods for bias reduction and maximum penalized likelihood with powers of the Jeffreys prior as penalty. The brglmFit
fitting method can achieve reduction of estimation bias by solving either the mean bias-reducing adjusted score equations in Firth (1993) <doi:10.1093/biomet/80.1.27> and Kosmidis and Firth (2009) <doi:10.1093/biomet/asp055>, or the median bias-reduction adjusted score equations in Kenne et al. (2017) <doi:10.1093/biomet/asx046>, or through the direct subtraction of an estimate of the bias of the maximum likelihood estimator from the maximum likelihood estimates as in Cordeiro and McCullagh (1991) <https://www.jstor.org/stable/2345592>.
Commonly used classification and regression tree methods like the CART algorithm are recursive partitioning methods that build the model in a forward stepwise search. Although this approach is known to be an efficient heuristic, the results of recursive tree methods are only locally optimal, as splits are chosen to maximize homogeneity at the next step only. An alternative way to search over the parameter space of trees is to use global optimization methods like evolutionary algorithms. The evtree package implements an evolutionary algorithm for learning globally optimal classification and regression trees in R. CPU and memory-intensive tasks are fully computed in C++ while the partykit package is leveraged to represent the resulting trees in R, providing unified infrastructure for summaries, visualizations, and predictions.
This package implements the Exploratory Graph Analysis (EGA) framework for dimensionality and psychometric assessment. EGA estimates the number of dimensions in psychological data using network estimation methods and community detection algorithms. A bootstrap method is provided to assess the stability of dimensions and items. Fit is evaluated using the Entropy Fit family of indices. Unique Variable Analysis evaluates the extent to which items are locally dependent (or redundant). Network loadings provide similar information to factor loadings and can be used to compute network scores. A bootstrap and permutation approach are available to assess configural and metric invariance. Hierarchical structures can be detected using Hierarchical EGA. Time series and intensive longitudinal data can be analyzed using Dynamic EGA, supporting individual, group, and population level assessments.
Generates efficient balanced mixed-level k-circulant supersaturated designs by interchanging the elements of the generator vector. Attempts to generate a supersaturated design that has EfNOD
efficiency more than user specified efficiency level (mef). Displays the progress of generation of an efficient mixed-level k-circulant design through a progress bar. The progress of 100 per cent means that one full round of interchange is completed. More than one full round (typically 4-5 rounds) of interchange may be required for larger designs. For more details, please see Mandal, B.N., Gupta V. K. and Parsad, R. (2011). Construction of Efficient Mixed-Level k-Circulant Supersaturated Designs, Journal of Statistical Theory and Practice, 5:4, 627-648, <doi:10.1080/15598608.2011.10483735>.
Efficient framework to estimate high-dimensional generalized matrix factorization models using penalized maximum likelihood under a dispersion exponential family specification. Either deterministic and stochastic methods are implemented for the numerical maximization. In particular, the package implements the stochastic gradient descent algorithm with a block-wise mini-batch strategy to speed up the computations and an efficient adaptive learning rate schedule to stabilize the convergence. All the theoretical details can be found in Castiglione et al. (2024, <doi:10.48550/arXiv.2412.20509>
). Other methods considered for the optimization are the alternated iterative re-weighted least squares and the quasi-Newton method with diagonal approximation of the Fisher information matrix discussed in Kidzinski et al. (2022, <http://jmlr.org/papers/v23/20-1104.html>).
The focus is on simulating and modeling families with founders drawn from a structured population (for example, with different ancestries or other potentially non-family relatedness), in contrast to traditional pedigree analysis that treats all founders as equally unrelated. Main function simulates a random pedigree for many generations, avoiding close relatives, pairing closest individuals according to a 1D geography and their randomly-drawn sex, and with variable children sizes to result in a target population size per generation. Auxiliary functions calculate kinship matrices, admixture matrices, and draw random genotypes across arbitrary pedigree structures starting from the corresponding founder values. The code is built around the plink FAM table format for pedigrees. Described in Yao and Ochoa (2022) <doi:10.1101/2022.03.25.485885>.
An extension to the individual claim simulator called SynthETIC
(on CRAN), to simulate the evolution of case estimates of incurred losses through the lifetime of an insurance claim. The transactional simulation output now comprises key dates, and both claim payments and revisions of estimated incurred losses. An initial set of test parameters, designed to mirror the experience of a real insurance portfolio, were set up and applied by default to generate a realistic test data set of incurred histories (see vignette). However, the distributional assumptions used to generate this data set can be easily modified by users to match their experiences. Reference: Avanzi B, Taylor G, Wang M (2021) "SPLICE: A Synthetic Paid Loss and Incurred Cost Experience Simulator" <arXiv:2109.04058>
.