Generates blocked designs for mixed-level factorial experiments for a given block size. Internally, it uses finite-field based, collapsed, and heuristic methods to construct block structures that minimize confounding between block effects and factorial effects. The package creates the full treatment combination table, partitions runs into blocks, and computes detailed confounding diagnostics for main effects and two-factor interactions. It also checks orthogonal factorial structure (OFS) and computes efficiencies of factorial effects using the methods of Nair and Rao (1948) <doi:10.1111/j.2517-6161.1948.tb00005.x>. When OFS is not satisfied but the design has equal treatment replications and equal block sizes, a general method based on the C-matrix and custom contrast vectors is used to compute efficiencies. The output includes the generated design, finite-field metadata, confounding summaries, OFS diagnostics, and efficiency results.
This package provides a unified tidyverse-compatible interface to R's machine learning packages. Wraps established implementations from glmnet', randomForest', xgboost', e1071', rpart', gbm', nnet', cluster', dbscan', and others - providing consistent function signatures, tidy tibble output, unified ggplot2'-based visualization, and optional formatted gt tables via the tl_table() family of functions. The underlying algorithms are unchanged; tidylearn simply makes them easier to use together. Access raw model objects via the $fit slot for package-specific functionality. Methods include random forests Breiman (2001) <doi:10.1023/A:1010933404324>, LASSO regression Tibshirani (1996) <doi:10.1111/j.2517-6161.1996.tb02080.x>, elastic net Zou and Hastie (2005) <doi:10.1111/j.1467-9868.2005.00503.x>, support vector machines Cortes and Vapnik (1995) <doi:10.1007/BF00994018>, and gradient boosting Friedman (2001) <doi:10.1214/aos/1013203451>.
Implementation of the Generalized Pairwise Comparisons (GPC) as defined in Buyse (2010) <doi:10.1002/sim.3923> for complete observations, and extended in Peron (2018) <doi:10.1177/0962280216658320> to deal with right-censoring. GPC compare two groups of observations (intervention vs. control group) regarding several prioritized endpoints to estimate the probability that a random observation drawn from one group performs better/worse/equivalently than a random observation drawn from the other group. Summary statistics such as the net treatment benefit, win ratio, or win odds are then deduced from these probabilities. Confidence intervals and p-values are obtained based on asymptotic results (Ozenne 2021 <doi:10.1177/09622802211037067>), non-parametric bootstrap, or permutations. The software enables the use of thresholds of minimal importance difference, stratification, non-prioritized endpoints (O Brien test), and can handle right-censoring and competing-risks.
Data practitioners regularly use the R and Python programming languages to prepare data for analyses. Thus, they encode important data preprocessing decisions in R and Python code. The smallsets package subsequently decodes these decisions into a Smallset Timeline, a static, compact visualisation of data preprocessing decisions (Lucchesi et al. (2022) <doi:10.1145/3531146.3533175>). The visualisation consists of small data snapshots of different preprocessing steps. The smallsets package builds this visualisation from a user's dataset and preprocessing code located in an R', R Markdown', Python', or Jupyter Notebook file. Users simply add structured comments with snapshot instructions to the preprocessing code. One optional feature in smallsets requires installation of the Gurobi optimisation software and gurobi R package, available from <https://www.gurobi.com>. More information regarding the optional feature and gurobi installation can be found in the smallsets vignette.
This package provides functions to design and apply tests that are anytime valid. The functions can be used to design hypothesis tests in the prospective/randomised control trial setting or in the observational/retrospective setting. The resulting tests remain valid under both optional stopping and optional continuation. The current version includes safe t-tests and safe tests of two proportions. For details on the theory of safe tests, see Grunwald, de Heide and Koolen (2019) "Safe Testing" <arXiv:1906.07801>, for details on safe logrank tests see ter Schure, Perez-Ortiz, Ly and Grunwald (2020) "The Safe Logrank Test: Error Control under Continuous Monitoring with Unlimited Horizon" <arXiv:2011.06931v3> and Turner, Ly and Grunwald (2021) "Safe Tests and Always-Valid Confidence Intervals for contingency tables and beyond" <arXiv:2106.02693> for details on safe contingency table tests.
Efficient implementations of functions for the creation, modification and analysis of phylogenetic trees. Applications include: generation of trees with specified shapes; tree rearrangement; analysis of tree shape; rooting of trees and extraction of subtrees; calculation and depiction of split support; plotting the position of rogue taxa (Klopfstein & Spasojevic 2019) <doi:10.1371/journal.pone.0212942>; calculation of ancestor-descendant relationships, of stemwardness (Asher & Smith, 2022) <doi:10.1093/sysbio/syab072>, and of tree balance (Mir et al. 2013, Lemant et al. 2022) <doi:10.1016/j.mbs.2012.10.005>, <doi:10.1093/sysbio/syac027>; artificial extinction (Asher & Smith, 2022) <doi:10.1093/sysbio/syab072>; import and export of trees from Newick, Nexus (Maddison et al. 1997) <doi:10.1093/sysbio/46.4.590>, and TNT <https://www.lillo.org.ar/phylogeny/tnt/> formats; and analysis of splits and cladistic information.
GBScleanR is a package for quality check, filtering, and error correction of genotype data derived from next generation sequcener (NGS) based genotyping platforms. GBScleanR takes Variant Call Format (VCF) file as input. The main function of this package is `estGeno()` which estimates the true genotypes of samples from given read counts for genotype markers using a hidden Markov model with incorporating uneven observation ratio of allelic reads. This implementation gives robust genotype estimation even in noisy genotype data usually observed in Genotyping-By-Sequnencing (GBS) and similar methods, e.g. RADseq. The current implementation accepts genotype data of a diploid population at any generation of multi-parental cross, e.g. biparental F2 from inbred parents, biparental F2 from outbred parents, and 8-way recombinant inbred lines (8-way RILs) which can be refered to as MAGIC population.
MethylMix is an algorithm implemented to identify hyper and hypomethylated genes for a disease. MethylMix is based on a beta mixture model to identify methylation states and compares them with the normal DNA methylation state. MethylMix uses a novel statistic, the Differential Methylation value or DM-value defined as the difference of a methylation state with the normal methylation state. Finally, matched gene expression data is used to identify, besides differential, functional methylation states by focusing on methylation changes that effect gene expression. References: Gevaert 0. MethylMix: an R package for identifying DNA methylation-driven genes. Bioinformatics (Oxford, England). 2015;31(11):1839-41. doi:10.1093/bioinformatics/btv020. Gevaert O, Tibshirani R, Plevritis SK. Pancancer analysis of DNA methylation-driven genes using MethylMix. Genome Biology. 2015;16(1):17. doi:10.1186/s13059-014-0579-8.
This package provides convenient methods for accessing the data in dist objects with minimal memory and computational overhead. disttools can be used to extract the distance between any pair or combination of points encoded by a dist object using only the indices of those points. This is an improvement over existing functionality, which requires either coercing a dist object into a matrix or calculating the one dimensional index corresponding to a pair of observations. Coercion to a matrix is undesirable because doing so doubles the amount of memory required for storage. In contrast, there is no inherent downside to the latter solution. However, in part due to several edge cases, correctly and efficiently implementing such a solution can be challenging. disttools abstracts away these challenges and provides a simple interface to access the data in a dist object using the latter approach.
Quantitatively analyse depth time-series data from pop-up satellite archival tags (PSATs) through the application of continuous wavelet transformation (CWT) combined with Principal Component Analysis (PCA), and k-means clustering. Import, crop, and plot depth time-depth records (TDRs). Using CWT to detect important signals within the non-stationary data, we create daily wavelet statistics to summarise vertical movements on different wavelet periods and combine with daily and diel depth statistics. Classify depth time-series with unsupervised k-means clustering into 24-hour periods of vertical movement behaviour with distinct patterns of vertical movement. Plot example days from each behaviour cluster, and plot the TDR coloured by cluster. Based on principals of combining CWT with k-means first developed by Sakamoto (2009) <doi:10.1371/journal.pone.0005379> and redeveloped by Beale (2026) <doi:10.21203/rs.3.rs-6907076/v1>.
The Clutter model is a significant forest growth simulation tool. Grounded on individual trees and comprehensively considering factors such as competition among trees and the impact of environmental elements on growth, it can accurately reflect the growth process of forest stands. It can be applied in areas like forest resource management, harvesting planning, and ecological research. With the help of the Clutter model, people can better understand the dynamic changes of forests and provide a scientific basis for rational forest management and protecting the ecological environment. This R package can effectively realize the construction of forest growth and harvest models based on the Clutter model and achieve optimized forest management.References: Farias A, Soares C, Leite H et al(2021)<doi:10.1007/s10342-021-01380-1>. Guera O, Silva J, Ferreira R, et al(2019)<doi:10.1590/2179-8087.038117>.
This package provides functions to prepare time priors for MCMCtree analyses in the PAML software from Yang (2007)<doi:10.1093/molbev/msm088> and plot time-scaled phylogenies from any Bayesian divergence time analysis. Most time-calibrated node prior distributions require user-specified parameters. The package provides functions to refine these parameters, so that the resulting prior distributions accurately reflect confidence in known, usually fossil, time information. These functions also enable users to visualise distributions and write MCMCtree ready input files. Additionally, the package supplies flexible functions to visualise age uncertainty on a plotted tree with using node bars, using branch widths proportional to the age uncertainty, or by plotting the full posterior distributions on nodes. Time-scaled phylogenetic plots can be visualised with absolute and geological timescales . All plotting functions are applicable with output from any Bayesian software, not just MCMCtree'.
This package implements a Super Learner framework for right-censored survival data. The package fits convex combinations of parametric, semiparametric, and machine learning survival learners by minimizing cross-validated risk using inverse probability of censoring weighting (IPCW). It provides tools for automated hyperparameter grid search, high-dimensional variable screening, and evaluation of prediction performance using metrics such as the Brier score, Uno's C-index, and time-dependent area under the curve (AUC). Additional utilities support model interpretation for survival ensembles, including Shapley additive explanations (SHAP), and estimation of covariate-adjusted restricted mean survival time (RMST) contrasts. The methodology is related to treatment-specific survival curve estimation using machine learning described by Westling, Luedtke, Gilbert and Carone (2024) <doi:10.1080/01621459.2023.2205060>, and the unified ensemble framework described in Lyu et al. (2026) <doi:10.64898/2026.03.11.711010>.
This package provides a set of functions for applying a restricted linear algebra to the analysis of count-based data. See the accompanying preprint manuscript: "Normalizing need not be the norm: count-based math for analyzing single-cell data" Church et al (2022) <doi:10.1101/2022.06.01.494334> This tool is specifically designed to analyze count matrices from single cell RNA sequencing assays. The tools implement several count-based approaches for standard steps in single-cell RNA-seq analysis, including scoring genes and cells, comparing cells and clustering, calculating differential gene expression, and several methods for rank reduction. There are many opportunities for further optimization that may prove useful in the analysis of other data. We provide the source code freely available at <https://github.com/shchurch/countland> and encourage users and developers to fork the code for their own purposes.
Implement some models for correlation/covariance matrices including two approaches to model correlation matrices from a graphical structure. One use latent parent variables as proposed in Sterrantino et. al. (2024) <doi:10.1007/s10260-025-00788-y>. The other uses a graph to specify conditional relations between the variables. The graphical structure makes correlation matrices interpretable and avoids the quadratic increase of parameters as a function of the dimension. In the first approach a natural sequence of simpler models along with a complexity penalization is used. The second penalizes deviations from a base model. These can be used as prior for model parameters, considering C code through the cgeneric interface for the INLA package (<https://www.r-inla.org>). This allows one to use these models as building blocks combined and to other latent Gaussian models in order to build complex data models.
Simplified odds ratio calculation of GAM(M)s & GLM(M)s. Provides structured output (data frame) of all predictors and their corresponding odds ratios and confident intervals for further analyses. It helps to avoid false references of predictors and increments by specifying these parameters in a list instead of using exp(coef(model)) (standard approach of odds ratio calculation for GLMs) which just returns a plain numeric output. For GAM(M)s, odds ratio calculation is highly simplified with this package since it takes care of the multiple predict() calls of the chosen predictor while holding other predictors constant. Also, this package allows odds ratio calculation of percentage steps across the whole predictor distribution range for GAM(M)s. In both cases, confident intervals are returned additionally. Calculated odds ratio of GAM(M)s can be inserted into the smooth function plot.
Decision support tool for prioritizing sites for ecological surveys based on their potential to improve plans for conserving biodiversity (e.g. plans for establishing protected areas). Given a set of sites that could potentially be acquired for conservation management, it can be used to generate and evaluate plans for surveying additional sites. Specifically, plans for ecological surveys can be generated using various conventional approaches (e.g. maximizing expected species richness, geographic coverage, diversity of sampled environmental algorithms. After generating such survey plans, they can be evaluated using conditions) and maximizing value of information. Please note that several functions depend on the Gurobi optimization software (available from <https://www.gurobi.com>). Additionally, the JAGS software (available from <https://mcmc-jags.sourceforge.io/>) is required to fit hierarchical generalized linear models. For further details, see Hanson et al. (2023) <doi:10.1111/1365-2664.14309>.
The package contains methods to visualise the expression profile of genes from a microarray or RNA-seq experiment, and offers a supervised clustering approach to identify GO terms containing genes with expression levels that best classify two or more predefined groups of samples. Annotations for the genes present in the expression dataset may be obtained from Ensembl through the biomaRt package, if not provided by the user. The default random forest framework is used to evaluate the capacity of each gene to cluster samples according to the factor of interest. Finally, GO terms are scored by averaging the rank (alternatively, score) of their respective gene sets to cluster the samples. P-values may be computed to assess the significance of GO term ranking. Visualisation function include gene expression profile, gene ontology-based heatmaps, and hierarchical clustering of experimental samples using gene expression data.
The anomalize package enables a "tidy" workflow for detecting anomalies in data. The main functions are time_decompose(), anomalize(), and time_recompose(). When combined, it's quite simple to decompose time series, detect anomalies, and create bands separating the "normal" data from the anomalous data at scale (i.e. for multiple time series). Time series decomposition is used to remove trend and seasonal components via the time_decompose() function and methods include seasonal decomposition of time series by Loess ("stl") and seasonal decomposition by piecewise medians ("twitter"). The anomalize() function implements two methods for anomaly detection of residuals including using an inner quartile range ("iqr") and generalized extreme studentized deviation ("gesd"). These methods are based on those used in the forecast package and the Twitter AnomalyDetection package. Refer to the associated functions for specific references for these methods.
Extract data from Birdscan MR1 SQL vertical-looking radar databases, filter, and process them to Migration Traffic Rates (#objects per hour and km) or density (#objects per km3) of, for example birds, and insects. Object classifications in the Birdscan MR1 databases are based on the dataset of Haest et al. (2021) <doi:10.5281/zenodo.5734960>). Migration Traffic Rates and densities can be calculated separately for different height bins (with a height resolution of choice) as well as over time periods of choice (e.g., 1/2 hour, 1 hour, 1 day, day/night, the full time period of observation, and anything in between). Two plotting functions are also included to explore the data in the SQL databases and the resulting Migration Traffic Rate results. For details on the Migration Traffic Rate calculation procedures, see Schmid et al. (2019) <doi:10.1111/ecog.04025>.
Statistical tools for analyzing cognitive diagnosis (CD) data collected from small settings using the nonparametric classification (NPCD) framework. The core methods of the NPCD framework includes the nonparametric classification (NPC) method developed by Chiu and Douglas (2013) <DOI:10.1007/s00357-013-9132-9> and the general NPC (GNPC) method developed by Chiu, Sun, and Bian (2018) <DOI:10.1007/s11336-017-9595-4> and Chiu and Köhn (2019) <DOI:10.1007/s11336-019-09660-x>. An extension of the NPCD framework included in the package is the nonparametric method for multiple-choice items (MC-NPC) developed by Wang, Chiu, and Koehn (2023) <DOI:10.3102/10769986221133088>. Functions associated with various extensions concerning the evaluation, validation, and feasibility of the CD analysis are also provided. These topics include the completeness of Q-matrix, Q-matrix refinement method, as well as Q-matrix estimation.
This package implements Bayesian data analyses of balanced repeatability and reproducibility studies with ordinal measurements. Model fitting is based on MCMC posterior sampling with rjags'. Function ordinalRR() directly carries out the model fitting, and this function has the flexibility to allow the user to specify key aspects of the model, e.g., fixed versus random effects. Functions for preprocessing data and for the numerical and graphical display of a fitted model are also provided. There are also functions for displaying the model at fixed (user-specified) parameters and for simulating a hypothetical data set at a fixed (user-specified) set of parameters for a random-effects rater population. For additional technical details, refer to Culp, Ryan, Chen, and Hamada (2018) and cite this Technometrics paper when referencing any aspect of this work. The demo of this package reproduces results from the Technometrics paper.
Many complex diseases are known to be affected by the interactions between genetic variants and environmental exposures beyond the main genetic and environmental effects. Existing Bayesian methods for gene-environment (GÃ E) interaction studies are challenged by the high-dimensional nature of the study and the complexity of environmental influences. We have developed a novel and powerful semi-parametric Bayesian variable selection method that can accommodate linear and nonlinear GÃ E interactions simultaneously (Ren et al. (2020) <doi:10.1002/sim.8434>). Furthermore, the proposed method can conduct structural identification by distinguishing nonlinear interactions from main effects only case within Bayesian framework. Spike-and-slab priors are incorporated on both individual and group level to shrink coefficients corresponding to irrelevant main and interaction effects to zero exactly. The Markov chain Monte Carlo algorithms of the proposed and alternative methods are efficiently implemented in C++.
This package contains functions to perform various models and methods for test equating (Kolen and Brennan, 2014 <doi:10.1007/978-1-4939-0317-7> ; Gonzalez and Wiberg, 2017 <doi:10.1007/978-3-319-51824-4> ; von Davier et. al, 2004 <doi:10.1007/b97446>). It currently implements the traditional mean, linear and equipercentile equating methods. Both IRT observed-score and true-score equating are also supported, as well as the mean-mean, mean-sigma, Haebara and Stocking-Lord IRT linking methods. It also supports newest methods such that local equating, kernel equating (using Gaussian, logistic, Epanechnikov, uniform and adaptive kernels) with presmoothing, and IRT parameter linking methods based on asymmetric item characteristic functions. Functions to obtain both standard error of equating (SEE) and standard error of equating differences between two equating functions (SEED) are also implemented for the kernel method of equating.