Statistical inference with non-probability samples when auxiliary information from external sources such as probability samples or population totals or means is available. The package implements various methods such as inverse probability (propensity score) weighting, mass imputation and doubly robust approach. Details can be found in: Chen et al. (2020) <doi:10.1080/01621459.2019.1677241>, Yang et al. (2020) <doi:10.1111/rssb.12354>, Kim et al. (2021) <doi:10.1111/rssa.12696>, Yang et al. (2021) <https://www150.statcan.gc.ca/n1/pub/12-001-x/2021001/article/00004-eng.htm> and Wu (2022) <https://www150.statcan.gc.ca/n1/pub/12-001-x/2022002/article/00002-eng.htm>. For details on the package and its functionalities see <doi:10.48550/arXiv.2504.04255>
.
This package provides an integrated pipeline for the analysis of PAR-CLIP data. PAR-CLIP-induced transitions are first discriminated from sequencing errors, SNPs and additional non-experimental sources by a non- parametric mixture model. The protein binding sites (clusters) are then resolved at high resolution and cluster statistics are estimated using a rigorous Bayesian framework. Post-processing of the results, data export for UCSC genome browser visualization and motif search analysis are provided. In addition, the package integrates RNA-Seq data to estimate the False Discovery Rate of cluster detection. Key functions support parallel multicore computing. While wavClusteR was designed for PAR-CLIP data analysis, it can be applied to the analysis of other NGS data obtained from experimental procedures that induce nucleotide substitutions (e.g. BisSeq).
This package provides a graph community detection algorithm that aims to be performant on large graphs and robust, returning consistent results across runs. SpeakEasy
2 (SE2), the underlying algorithm, is described in Chris Gaiteri, David R. Connell & Faraz A. Sultan et al. (2023) <doi:10.1186/s13059-023-03062-0>. The core algorithm is written in C', providing speed and keeping the memory requirements low. This implementation can take advantage of multiple computing cores without increasing memory usage. SE2 can detect community structure across scales, making it a good choice for biological data, which often has hierarchical structure. Graphs can be passed to the algorithm as adjacency matrices using base R matrices, the Matrix library, igraph graphs, or any data that can be coerced into a matrix.
The Datasaurus Dozen is a set of datasets with the same summary statistics. They retain the same summary statistics despite having radically different distributions. The datasets represent a larger and quirkier object lesson that is typically taught via Anscombe's Quartet (available in the 'datasets' package). Anscombe's Quartet contains four very different distributions with the same summary statistics and as such highlights the value of visualisation in understanding data, over and above summary statistics. As well as being an engaging variant on the Quartet, the data is generated in a novel way. The simulated annealing process used to derive datasets from the original Datasaurus is detailed in "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing" doi:10.1145/3025453.3025912.
This package provides a powerful and flexible tool for visualizing proportional data across spatially resolved contexts. By combining the concepts of scatter plots and stacked bar charts, scatterbar allows users to create scattered bar chart plots, which effectively display the proportions of different categories at each (x, y) location. This visualization is particularly useful for applications where understanding the distribution of categories across spatial coordinates is essential. This package features automatic determination of optimal scaling factors based on data, customizable scaling and padding options for both x and y axes, flexibility to specify custom colors for each category, options to customize the legend title, and integration with ggplot2 for robust and high-quality visualizations. For more details, see Velazquez et al. (2024) <doi:10.1101/2024.08.14.606810>.
In computer experiments space-filling designs are having great impact. Most popularly used space-filling designs are Uniform designs (UDs), Latin hypercube designs (LHDs) etc. For further references one can see Mckay (1979) <DOI:10.1080/00401706.1979.10489755> and Fang (1980) <https://cir.nii.ac.jp/crid/1570291225616774784>. In this package, we have provided algorithms for generate efficient LHDs and UDs. Here, generated LHDs are efficient as they possess lower value of Maxpro measure, Phi_p value and Maximum Absolute Correlation (MAC) value based on the weightage given to each criterion. On the other hand, the produced UDs are having good space-filling property as they always attain the lower bound of Discrete Discrepancy measure. Further, some useful functions added in this package for adding more value to this package.
This package provides a customisable R shiny app for immersively visualising, mapping and annotating panospheric (360 degree) imagery. The flexible interface allows annotation of any geocoded images using up to 4 user specified dropdown menus. The app uses leaflet to render maps that display the geo-locations of images and panellum <https://pannellum.org/>, a lightweight panorama viewer for the web, to render images in virtual 360 degree viewing mode. Key functions include the ability to draw on & export parts of 360 images for downstream applications. Users can also draw polygons and points on map imagery related to the panoramic images and export them for further analysis. Downstream applications include using annotations to train Artificial Intelligence/Machine Learning (AI/ML) models and geospatial modelling and analysis of camera based survey data.
Unleash the power of time-series data visualization with ease using our package. Designed with simplicity in mind, it offers three key features through the shiny package output. The first tab shows time- series charts with forecasts, allowing users to visualize trends and changes effortlessly. The second one displays Averages per country presented in tables with accompanying sparklines, providing a quick and attractive overview of the data. The last tab presents A customizable world map colored based on user-defined variables for any chosen number of countries, offering an advanced visual approach to understanding geographical data distributions. This package operates with just a few simple arguments, enabling users to conduct sophisticated analyses without the need for complex programming skills. Transform your time-series data analysis experience with our user-friendly tool.
The cyclotomic numbers are complex numbers that can be thought of as the rational numbers extended with the roots of unity. They are represented exactly, enabling exact computations. They contain the Gaussian rationals (complex numbers with rational real and imaginary parts) as well as the square roots of all rational numbers. They also contain the sine and cosine of all rational multiples of pi. The algorithms implemented in this package are taken from the Haskell package cyclotomic', whose algorithms are adapted from code by Martin Schoenert and Thomas Breuer in the GAP project (<https://www.gap-system.org/>). Cyclotomic numbers have applications in number theory, algebraic geometry, algebraic number theory, coding theory, and in the theory of graphs and combinatorics. They have connections to the theory of modular functions and modular curves.
Data analysis often requires coding, especially when data are collected through interviews, observations, or questionnaires. As a result, code counting and data preparation are essential steps in the analysis process. Analysts may need to count the codes in a text (Tokenization, counting of pre-established codes, computing the co-occurrence matrix by line) and prepare the data (e.g., min-max normalization, Z-score, robust scaling, Box-Cox transformation, and non-parametric bootstrap). For the Box-Cox transformation (Box & Cox, 1964, <https://www.jstor.org/stable/2984418>), the optimal Lambda is determined using the log-likelihood method. Non-parametric bootstrap involves randomly sampling data with replacement. Two random number generators are also integrated: a Lehmer congruential generator for uniform distribution and a Box-Muller generator for normal distribution. Package for educational purposes.
This package performs multi-omic differential network analysis by revealing differential interactions between molecular entities (genes, proteins, transcription factors, or other biomolecules) across the omic datasets provided. For each omic dataset, a differential network is constructed where links represent statistically significant differential interactions between entities. These networks are then integrated into a comprehensive visualization using distinct colors to distinguish interactions from different omic layers. This unified display allows interactive exploration of cross-omic patterns, such as differential interactions present at both transcript and protein levels. For each link, users can access differential statistical significance metrics (p values or adjusted p values, calculated via robust or traditional linear regression with interaction term) and differential regression plots. The methods implemented in this package are described in Sciacca et al. (2023) <doi:10.1093/bioinformatics/btad192>.
Calculates exact hypothesis tests to compare a treatment and a reference group with respect to multiple binary endpoints. The tested null hypothesis is an identical multidimensional distribution of successes and failures in both groups. The alternative hypothesis is a larger success proportion in the treatment group in at least one endpoint. The tests are based on the multivariate permutation distribution of subjects between the two groups. For this permutation distribution, rejection regions are calculated that satisfy one of different possible optimization criteria. In particular, regions with maximal exhaustion of the nominal significance level, maximal power under a specified alternative or maximal number of elements can be found. Optimization is achieved by a branch-and-bound algorithm. By application of the closed testing principle, the global hypothesis tests are extended to multiple testing procedures.
Fast randomization based two sample tests. Testing the hypothesis that two samples come from the same distribution using randomization to create p-values. Included tests are: Kolmogorov-Smirnov, Kuiper, Cramer-von Mises, Anderson-Darling, Wasserstein, and DTS. The default test (two_sample) is based on the DTS test statistic, as it is the most powerful, and thus most useful to most users. The DTS test statistic builds on the Wasserstein distance by using a weighting scheme like that of Anderson-Darling. See the companion paper at <arXiv:2007.01360>
or <https://codowd.com/public/DTS.pdf> for details of that test statistic, and non-standard uses of the package (parallel for big N, weighted observations, one sample tests, etc). We also include the permutation scheme to make test building simple for others.
In the context of paid research studies and clinical trials, budget considerations and patient sampling from available populations are subject to inherent constraints. We introduce the CDsampling package, which integrates optimal design theories within the framework of constrained sampling. This package offers the possibility to find both D-optimal approximate and exact allocations for samplings with or without constraints. Additionally, it provides functions to find constrained uniform sampling as a robust sampling strategy with limited model information. Our package offers functions for the computation of the Fisher information matrix under generalized linear models (including regular linear regression model) and multinomial logistic models.To demonstrate the applications, we also provide a simulated dataset and a real dataset embedded in the package. Yifei Huang, Liping Tong, and Jie Yang (2025)<doi:10.5705/ss.202022.0414>.
This package provides a decorator is a function that receives a function, extends its behaviour, and returned the altered function. Any caller that uses the decorated function uses the same interface as it were the original, undecorated function. Decorators serve two primary uses: (1) Enhancing the response of a function as it sends data to a second component; (2) Supporting multiple optional behaviours. An example of the first use is a timer decorator that runs a function, outputs its execution time on the console, and returns the original function's result. An example of the second use is input type validation decorator that during running time tests whether the caller has passed input arguments of a particular class. Decorators can reduce execution time, say by memoization, or reduce bugs by adding defensive programming routines.
An implementation of logistic normal multinomial (LNM) clustering. It is an extension of LNM mixture model proposed by Fang and Subedi (2020) <arXiv:2011.06682>
, and is designed for clustering compositional data. The package includes 3 extended models: LNM Factor Analyzer (LNM-FA), LNM Bicluster Mixture Model (LNM-BMM) and Penalized LNM Factor Analyzer (LNM-FA). There are several advantages of LNM models: 1. LNM provides more flexible covariance structure; 2. Factor analyzer can reduce the number of parameters to estimate; 3. Bicluster can simultaneously cluster subjects and taxa, and provides significant biological insights; 4. Penalty term allows sparse estimation in the covariance matrix. Details for model assumptions and interpretation can be found in papers: Tu and Subedi (2021) <arXiv:2101.01871>
and Tu and Subedi (2022) <doi:10.1002/sam.11555>.
Systematic reviews should be described in a high degree of methodological detail. The PRISMA Statement calls for a high level of reporting detail in systematic reviews and meta-analyses. An integral part of the methodological description of a review is a flow diagram. This package produces an interactive flow diagram that conforms to the PRISMA2020 preprint. When made interactive, the reader/user can click on each box and be directed to another website or file online (e.g. a detailed description of the screening methods, or a list of excluded full texts), with a mouse-over tool tip that describes the information linked to in more detail. Interactive versions can be saved as HTML files, whilst static versions for inclusion in manuscripts can be saved as HTML, PDF, PNG, SVG, PS or WEBP files.
Implementation of a probabilistic method to calculate nicheROVER
(_niche_ _r_egion and niche _over_lap) metrics using multidimensional niche indicator data (e.g., stable isotopes, environmental variables, etc.). The niche region is defined as the joint probability density function of the multidimensional niche indicators at a user-defined probability alpha (e.g., 95%). Uncertainty is accounted for in a Bayesian framework, and the method can be extended to three or more indicator dimensions. It provides directional estimates of niche overlap, accounts for species-specific distributions in multivariate niche space, and produces unique and consistent bivariate projections of the multivariate niche region. The article by Swanson et al. (2015) <doi:10.1890/14-0235.1> provides a detailed description of the methodology. See the package vignette for a worked example using fish stable isotope data.
Rdiff-backup backs up one directory to another, possibly over a network. The target directory ends up a copy of the source directory, but extra reverse diffs are stored in a special subdirectory of that target directory, so you can still recover files lost some time ago. The idea is to combine the best features of a mirror and an incremental backup. Rdiff-backup also preserves subdirectories, hard links, dev files, permissions, uid/gid ownership, modification times, extended attributes, acls, and resource forks. Also, rdiff-backup can operate in a bandwidth efficient manner over a pipe, like rsync. Thus you can use rdiff-backup and ssh to securely back a hard drive up to a remote location, and only the differences will be transmitted. Finally, rdiff-backup is easy to use and settings have sensible defaults.
This package provides a function to calculate multiple performance metrics for actual and predicted values. In total eight metrics will be calculated for particular actual and predicted series. Helps to describe a Statistical model's performance in predicting a data. Also helps to compare various models performance. The metrics are Root Mean Squared Error (RMSE), Relative Root Mean Squared Error (RRMSE), Mean absolute Error (MAE), Mean absolute percentage error (MAPE), Mean Absolute Scaled Error (MASE), Nash-Sutcliffe Efficiency (NSE), Willmottâ s Index (WI), and Legates and McCabe
Index (LME). Among them, first five are expected to be lesser whereas, the last three are greater the better. More details can be found from Garai and Paul (2023) <doi:10.1016/j.iswa.2023.200202> and Garai et al. (2024) <doi:10.1007/s11063-024-11552-w>.
Procedures for simulating biomes by equilibrium vegetation models, with a special focus on paleoenvironmental applications. Three widely used equilibrium biome models are currently implemented in the package: the Holdridge Life Zone (HLZ) system (Holdridge 1947, <doi:10.1126/science.105.2727.367>), the Köppen-Geiger classification (KGC) system (Köppen 1936, <https://koeppen-geiger.vu-wien.ac.at/pdf/Koppen_1936.pdf>) and the BIOME model (Prentice et al. 1992, <doi:10.2307/2845499>). Three climatic forest-steppe models are also implemented. An approach for estimating monthly time series of relative sunshine duration from temperature and precipitation data (Yin 1999, <doi:10.1007/s007040050111>) is also adapted, allowing process-based biome models to be combined with high-resolution paleoclimate simulation datasets (e.g., CHELSA-TraCE21k
v1.0 dataset: <https://chelsa-climate.org/chelsa-trace21k/>).
It computes two frequently applied actuarial measures, the expected shortfall and the value at risk. Seven well-known classical distributions in connection to the Bell generalized family are used as follows: Bell-exponential distribution, Bell-extended exponential distribution, Bell-Weibull distribution, Bell-extended Weibull distribution, Bell-Lomax distribution, Bell-Burr-12 distribution, and Bell-Burr-X distribution. Related works include: a) Fayomi, A., Tahir, M. H., Algarni, A., Imran, M., & Jamal, F. (2022). "A new useful exponential model with applications to quality control and actuarial data". Computational Intelligence and Neuroscience, 2022. <doi:10.1155/2022/2489998>. b) Alsadat, N., Imran, M., Tahir, M. H., Jamal, F., Ahmad, H., & Elgarhy, M. (2023). "Compounded Bell-G class of statistical models with applications to COVID-19 and actuarial data". Open Physics, 21(1), 20220242. <doi:10.1515/phys-2022-0242>.
The Genie algorithm (Gagolewski, 2021 <DOI:10.1016/j.softx.2021.100722>) is a robust and outlier-resistant hierarchical clustering method (Gagolewski, Bartoszuk, Cena, 2016 <DOI:10.1016/j.ins.2016.05.003>). This package features its faster and more powerful version. It allows clustering with respect to mutual reachability distances, enabling it to act as a noise point detector or a version of HDBSCAN* that can identify a predefined number of clusters. The package also features an implementation of the Gini and Bonferroni inequality indices, external cluster validity measures (e.g., the normalised clustering accuracy, the adjusted Rand index, the Fowlkes-Mallows index, and normalised mutual information), and internal cluster validity indices (e.g., the Calinski-Harabasz, Davies-Bouldin, Ball-Hall, Silhouette, and generalised Dunn indices). The Python version of genieclust is available via PyPI
'.
An updated implementation of R package ranger by Wright et al, (2017) <doi:10.18637/jss.v077.i01> for training and predicting from random forests, particularly suited to high-dimensional data, and for embedding in Multiple Imputation by Chained Equations (MICE) by van Buuren (2007) <doi:10.1177/0962280206074463>. Ensembles of classification and regression trees are currently supported. Sparse data of class dgCMatrix
(R package Matrix') can be directly analyzed. Conventional bagged predictions are available alongside an efficient prediction for MICE via the algorithm proposed by Doove et al (2014) <doi:10.1016/j.csda.2013.10.025>. Trained forests can be written to and read from storage. Survival and probability forests are not supported in the update, nor is data of class gwaa.data (R package GenABEL
'); use the original ranger package for these analyses.