An object is called "outlier" if it remarkably deviates from the other objects in a data set. Outlier detection is the process to find outliers by using the methods that are based on distance measures, clustering and spatial methods (Ben-Gal, 2005 <ISBN 0-387-24435-2>). It is one of the intensively studied research topics for identification of novelties, frauds, anomalies, deviations or exceptions in addition to its use for outlier removing in data processing. This package provides the implementations of some novel approaches to detect the outliers based on typicality degrees that are obtained with the soft partitioning clustering algorithms such as Fuzzy C-means and its variants.
This package provides functions to estimate and interpret the alpha-NOMINATE ideal point model developed in Carroll et al. (2013, <doi:10.1111/ajps.12029>). alpha-NOMINATE extends traditional spatial voting frameworks by allowing for a mixture of Gaussian and quadratic utility functions, providing flexibility in modeling political actors preferences. The package uses Markov Chain Monte Carlo (MCMC) methods for parameter estimation, supporting robust inference about individuals ideological positions and the shape of their utility functions. It also contains functions to simulate data from the model and to calculate the probability of a vote passing given the ideal points of the legislators/voters and the estimated location of the choice alternatives.
This package contains Bayesian implementations of the Mixed-Effects Accelerated Failure Time (MEAFT) models for censored data. Those can be not only right-censored but also interval-censored, doubly-interval-censored or misclassified interval-censored. The methods implemented in the package have been published in Komárek and Lesaffre (2006, Stat. Modelling) <doi:10.1191/1471082X06st107oa>, Komárek, Lesaffre and Legrand (2007, Stat. in Medicine) <doi:10.1002/sim.3083>, Komárek and Lesaffre (2007, Stat. Sinica) <https://www3.stat.sinica.edu.tw/statistica/oldpdf/A17n27.pdf>, Komárek and Lesaffre (2008, JASA) <doi:10.1198/016214507000000563>, Garcà a-Zattera, Jara and Komárek (2016, Biometrics) <doi:10.1111/biom.12424>.
This package provides a class of Bayesian beta regression models for the analysis of continuous data with support restricted to an unknown finite support. The response variable is modeled using a four-parameter beta distribution with the mean or mode parameter depending linearly on covariates through a link function. When the response support is known to be (0,1), the above class of models reduce to traditional (0,1) supported beta regression models. Model choice is carried out via the logarithm of the pseudo marginal likelihood (LPML), the deviance information criterion (DIC), and the Watanabe-Akaike information criterion (WAIC). See Zhou and Huang (2022) <doi:10.1016/j.csda.2021.107345>.
Generic Machine Learning Inference on heterogeneous treatment effects in randomized experiments as proposed in Chernozhukov, Demirer, Duflo and Fernández-Val (2020) <arXiv:1712.04802>. This package's workhorse is the mlr3 framework of Lang et al. (2019) <doi:10.21105/joss.01903>, which enables the specification of a wide variety of machine learners. The main functionality, GenericML(), runs Algorithm 1 in Chernozhukov, Demirer, Duflo and Fernández-Val (2020) <arXiv:1712.04802> for a suite of user-specified machine learners. All steps in the algorithm are customizable via setup functions. Methods for printing and plotting are available for objects returned by GenericML(). Parallel computing is supported.
Determining potential output and the output gap - two inherently unobservable variables - is a major challenge for macroeconomists. sectorgap features a flexible modeling and estimation framework for a multivariate Bayesian state space model identifying economic output fluctuations consistent with subsectors of the economy. The proposed model is able to capture various correlations between output and a set of aggregate as well as subsector indicators. Estimation of the latent states and parameters is achieved using a simple Gibbs sampling procedure and various plotting options facilitate the assessment of the results. For details on the methodology and an illustrative example, see Streicher (2024) <https://www.research-collection.ethz.ch/handle/20.500.11850/653682>.
An interface for creating, registering, and resolving content-based identifiers for data management. Content-based identifiers rely on the cryptographic hashes to refer to the files they identify, thus, anyone possessing the file can compute the identifier using a well-known standard algorithm, such as SHA256'. By registering a URL at which the content is accessible to a public archive (such as Hash Archive) or depositing data in a scientific repository such Zenodo', DataONE or SoftwareHeritage', the content identifier can serve many functions typically associated with A Digital Object Identifier ('DOI'). Unlike location-based identifiers like DOIs', content-based identifiers permit the same content to be registered in many locations.
Efficient design matrix free lasso penalized estimation in large scale 2 and 3-dimensional generalized linear array model framework. The procedure is based on the gdpg algorithm from Lund et al. (2017) <doi:10.1080/10618600.2017.1279548>. Currently Lasso or Smoothly Clipped Absolute Deviation (SCAD) penalized estimation is possible for the following models: The Gaussian model with identity link, the Binomial model with logit link, the Poisson model with log link and the Gamma model with log link. It is also possible to include a component in the model with non-tensor design e.g an intercept. Also provided are functions, glamlassoRR() and glamlassoS(), fitting special cases of GLAMs.
This package provides functions for forest objects detection, structure metrics computation, model calibration and mapping with airborne laser scanning: co-registration of field plots (Monnet and Mermin (2014) <doi:10.3390/f5092307>); tree detection (method 1 in Eysn et al. (2015) <doi:10.3390/f6051721>) and segmentation; forest parameters estimation with the area-based approach: model calibration with ground reference, and maps export (Aussenac et al. (2023) <doi:10.12688/openreseurope.15373.2>); extraction of both physical (gaps, edges, trees) and statistical features useful for e.g. habitat suitability modeling (Glad et al. (2020) <doi:10.1002/rse2.117>) and forest maturity mapping (Fuhr et al. (2022) <doi:10.1002/rse2.274>).
Grey model is commonly used in time series forecasting when statistical assumptions are violated with a limited number of data points. The minimum number of data points required to fit a grey model is four observations. This package fits Grey model of First order and One Variable, i.e., GM (1,1) for multivariate time series data and returns the parameters of the model, model evaluation criteria and h-step ahead forecast values for each of the time series variables. For method details see, Akay, D. and Atak, M. (2007) <DOI:10.1016/j.energy.2006.11.014>, Hsu, L. and Wang, C. (2007).<DOI:10.1016/j.techfore.2006.02.005>.
The method m:Explorer associates a given list of target genes (e.g. those involved in a biological process) to gene regulators such as transcription factors. Transcription factors that bind DNA near significantly many target genes or correlate with target genes in transcriptional (microarray or RNAseq data) are selected. Selection of candidate master regulators is carried out using multinomial regression models, likelihood ratio tests and multiple testing correction. Reference: m:Explorer: multinomial regression models reveal positive and negative regulators of longevity in yeast quiescence. Juri Reimand, Anu Aun, Jaak Vilo, Juan M Vaquerizas, Juhan Sedman and Nicholas M Luscombe. Genome Biology (2012) 13:R55 <doi:10.1186/gb-2012-13-6-r55>.
Despite that several tests for normality in stationary processes have been proposed in the literature, consistent implementations of these tests in programming languages are limited. Seven normality test are implemented. The asymptotic Lobato & Velasco's, asymptotic Epps, Psaradakis and Vávra, Lobato & Velasco's and Epps sieve bootstrap approximations, El bouch et al., and the random projections tests for univariate stationary process. Some other diagnostics such as, unit root test for stationarity, seasonal tests for seasonality, and arch effect test for volatility; are also performed. Additionally, the El bouch test performs normality tests for bivariate time series. The package also offers residual diagnostic for linear time series models developed in several packages.
This package provides a random forest based implementation of the method described in Chapter 7.1.2 (Regression model based anomaly detection) of Chandola et al. (2009) <doi:10.1145/1541880.1541882>. It works as follows: Each numeric variable is regressed onto all other variables by a random forest. If the scaled absolute difference between observed value and out-of-bag prediction of the corresponding random forest is suspiciously large, then a value is considered an outlier. The package offers different options to replace such outliers, e.g. by realistic values found via predictive mean matching. Once the method is trained on a reference data, it can be applied to new data.
This package provides a collection of functions which (i) assess the quality of variable subsets as surrogates for a full data set, in either an exploratory data analysis or in the context of a multivariate linear model, and (ii) search for subsets which are optimal under various criteria. Theoretical support for the heuristic search methods and exploratory data analysis criteria is in Cadima, Cerdeira, Minhoto (2003, <doi:10.1016/j.csda.2003.11.001>). Theoretical support for the leap and bounds algorithm and the criteria for the general multivariate linear model is in Duarte Silva (2001, <doi:10.1006/jmva.2000.1920>). There is a package vignette "subselect", which includes additional references.
Herein, we provide a broad variety of functions which are useful for handling, manipulating, and visualizing satellite-based remote sensing data. These operations range from mere data import and layer handling (eg subsetting), over Raster* typical data wrangling (eg crop, extend), to more sophisticated (pre-)processing tasks typically applied to satellite imagery (eg atmospheric and topographic correction). This functionality is complemented by a full access to the satellite layers metadata at any stage and the documentation of performed actions in a separate log file. Currently available sensors include Landsat 4-5 (TM), 7 (ETM+), and 8 (OLI/TIRS Combined), and additional compatibility is ensured for the Landsat Global Land Survey data set.
Mixed models for repeated measures (MMRM) are a popular choice for analyzing longitudinal continuous outcomes in randomized clinical trials and beyond; see for example Cnaan, Laird and Slasor (1997) <doi:10.1002/(SICI)1097-0258(19971030)16:20%3C2349::AID-SIM667%3E3.0.CO;2-E>. This package provides an interface for fitting MMRM within the tern <https://cran.r-project.org/package=tern> framework by Zhu et al. (2023) and tabulate results easily using rtables <https://cran.r-project.org/package=rtables> by Becker et al. (2023). It builds on mmrm <https://cran.r-project.org/package=mmrm> by Sabanés Bové et al. (2023) for the actual MMRM computations.
The extrafont package makes it easier to use fonts other than the basic PostScript fonts that R uses. Fonts that are imported into extrafont can be used with PDF or PostScript output files. There are two hurdles for using fonts in PDF (or Postscript) output files:
- Making R aware of the font and the dimensions of the characters. 
- Embedding the fonts in the PDF file so that the PDF can be displayed properly on a device that doesn't have the font. This is usually needed if you want to print the PDF file or share it with others. 
The extrafont package makes both of these things easier.
Add mean comparison annotations to a ggplot'. This package provides an easy way to indicate if two or more groups are significantly different in a ggplot'. Usually you do not need to specify the test method, you only need to tell stat_compare() whether you want to perform a parametric test or a nonparametric test, and stat_compare() will automatically choose the appropriate test method based on your data. For comparisons between two groups, the p-value is calculated by t-test (parametric) or Wilcoxon rank sum test (nonparametric). For comparisons among more than two groups, the p-value is calculated by One-way ANOVA (parametric) or Kruskal-Wallis test (nonparametric).
This package provides a comprehensive analytics framework for building reproducible pipelines on T-cell and B-cell immune receptor repertoire data. Delivers multi-modal immune profiling (bulk, single-cell, CITE-seq/AbSeq, spatial, immunogenicity data), feature engineering (ML-ready feature tables and matrices), and biomarker discovery workflows (cohort comparisons, longitudinal tracking, repertoire similarity, enrichment). Provides a user-friendly interface to widely used AIRR methods â clonality/diversity, V(D)J usage, similarity, annotation, tracking, and many more. Think Scanpy or Seurat, but for AIRR data, a.k.a. Adaptive Immune Receptor Repertoire, VDJ-seq, RepSeq, or VDJ sequencing data. A successor to our previously published "tcR" R package (Nazarov 2015).
Offers an interactive RStudio gadget interface for communicating with OpenAI large language models (e.g., gpt-5', gpt-5-mini', gpt-5-nano') (<https://platform.openai.com/docs/api-reference>). Enables users to conduct multiple chat conversations simultaneously in separate tabs. Supports uploading local files (R, PDF, DOCX) to provide context for the models. Allows per-conversation configuration of system messages (where supported by the model). API interactions via the httr package are performed asynchronously using promises and future to avoid blocking the R console. Useful for tasks like code generation, text summarization, and document analysis directly within the RStudio environment. Requires an OpenAI API key set as an environment variable.
This package provides tools for profiling a user-supplied log-likelihood function to calculate confidence intervals for model parameters. Speed of computation can be improved by adjusting the step sizes in the profiling and/or starting the profiling from limits based on the approximate large sample normal distribution for the maximum likelihood estimator of a parameter. The accuracy of the limits can be set by the user. A plot method visualises the log-likelihood and confidence interval. Cases where the profile log-likelihood flattens above the value at which a confidence limit is defined can be handled, leading to a limit at plus or minus infinity. Disjoint confidence intervals will not be found.
Analysis tools to investigate changes in intercellular communication from scRNA-seq data. Using a Seurat object as input, the package infers which cell-cell interactions are present in the dataset and how these interactions change between two conditions of interest (e.g. young vs old). It relies on an internal database of ligand-receptor interactions (available for human, mouse and rat) that have been gathered from several published studies. Detection and differential analyses rely on permutation tests. The package also contains several tools to perform over-representation analysis and visualize the results. See Lagger, C. et al. (2023) <doi:10.1038/s43587-023-00514-x> for a full description of the methodology.
This is a data only package providing the algorithmic complexity of short strings, computed using the coding theorem method. For a given set of symbols in a string, all possible or a large number of random samples of Turing machines with a given number of states (e.g., 5) and number of symbols corresponding to the number of symbols in the strings were simulated until they reached a halting state or failed to end. This package contains data on 4.5 million strings from length 1 to 12 simulated on Turing machines with 2, 4, 5, 6, and 9 symbols. The complexity of the string corresponds to the distribution of the halting states.
Enables off-the-shelf functionality for fully Bayesian, nonstationary Gaussian process modeling. The approach to nonstationary modeling involves a closed-form, convolution-based covariance function with spatially-varying parameters; these parameter processes can be specified either deterministically (using covariates or basis functions) or stochastically (using approximate Gaussian processes). Stationary Gaussian processes are a special case of our methodology, and we furthermore implement approximate Gaussian process inference to account for very large spatial data sets (Finley, et al (2017) <arXiv:1702.00434v2>). Bayesian inference is carried out using Markov chain Monte Carlo methods via the nimble package, and posterior prediction for the Gaussian process at unobserved locations is provided as a post-processing step.