Easily compute education inequality measures and the distribution of educational attainments for any group of countries, using the data set developed in Jorda, V. and Alonso, JM. (2017) <DOI:10.1016/j.worlddev.2016.10.005>. The package offers the possibility to compute not only the Gini index, but also generalized entropy measures for different values of the sensitivity parameter. In particular, the package includes functions to compute the mean log deviation, which is more sensitive to the bottom part of the distribution; the Theilâ s entropy measure, equally sensitive to all parts of the distribution; and finally, the GE measure when the sensitivity parameter is set equal to 2, which gives more weight to differences in higher education. The decomposition of these measures in the components between-country and within-country inequality is also provided. Two graphical tools are also provided, to analyse the evolution of the distribution of educational attainments: The cumulative distribution function and the Lorenz curve.
It implements a hybrid spatial model for improved spatial prediction by combining the variable selection capability of LASSO (Least Absolute Shrinkage and Selection Operator) with the Geographically Weighted Regression (GWR) model that captures the spatially varying relationship efficiently. For method details see, Wheeler, D.C.(2009).<DOI:10.1068/a40256>. The developed hybrid model efficiently selects the relevant variables by using LASSO as the first step; these selected variables are then incorporated into the GWR framework, allowing the estimation of spatially varying regression coefficients at unknown locations and finally predicting the values of the response variable at unknown test locations while taking into account the spatial heterogeneity of the data. Integrating the LASSO and GWR models enhances prediction accuracy by considering spatial heterogeneity and capturing the local relationships between the predictors and the response variable. The developed hybrid spatial model can be useful for spatial modeling, especially in scenarios involving complex spatial patterns and large datasets with multiple predictor variables.
Designs plots in terms of core structure. See example(metaplot)'. Primary arguments are (unquoted) column names; order and type (numeric or not) dictate the resulting plot. Specify any y variables, x variable, any groups variable, and any conditioning variables to metaplot() to generate density plots, boxplots, mosaic plots, scatterplots, scatterplot matrices, or conditioned plots. Use multiplot() to arrange plots in grids. Wherever present, scalar column attributes label and guide are honored, producing fully annotated plots with minimal effort. Attribute guide is typically units, but may be encoded() to provide interpretations of categorical values (see ?encode'). Utility unpack() transforms scalar column attributes to row values and pack() does the reverse, supporting tool-neutral storage of metadata along with primary data. The package supports customizable aesthetics such as such as reference lines, unity lines, smooths, log transformation, and linear fits. The user may choose between trellis and ggplot output. Compact syntax and integrated metadata promote workflow scalability.
We propose a procedure for sample size calculation while controlling false discovery rate for RNA-seq experimental design. Our procedure depends on the Voom method proposed for RNA-seq data analysis by Law et al. (2014) <DOI:10.1186/gb-2014-15-2-r29> and the sample size calculation method proposed for microarray experiments by Liu and Hwang (2007) <DOI:10.1093/bioinformatics/btl664>. We develop a set of functions that calculates appropriate sample sizes for two-sample t-test for RNA-seq experiments with fixed or varied set of parameters. The outputs also contain a plot of power versus sample size, a table of power at different sample sizes, and a table of critical test values at different sample sizes. To install this package, please use source("http://bioconductor.org/biocLite.R"); biocLite("ssizeRNA")'. For R version 3.5 or greater, please use if(!requireNamespace("BiocManager", quietly = TRUE))install.packages("BiocManager"); BiocManager::install("ssizeRNA")'.
We provide a non-parametric and a parametric approach to investigate the equivalence (or non-inferiority) of two survival curves, obtained from two given datasets. The test is based on the creation of confidence intervals at pre-specified time points. For the non-parametric approach, the curves are given by Kaplan-Meier curves and the variance for calculating the confidence intervals is obtained by Greenwood's formula. The parametric approach is based on estimating the underlying distribution, where the user can choose between a Weibull, Exponential, Gaussian, Logistic, Log-normal or a Log-logistic distribution. Estimates for the variance for calculating the confidence bands are obtained by a (parametric) bootstrap approach. For this bootstrap censoring is assumed to be exponentially distributed and estimates are obtained from the datasets under consideration. All details can be found in K.Moellenhoff and A.Tresch: Survival analysis under non-proportional hazards: investigating non-inferiority or equivalence in time-to-event data <arXiv:2009.06699>.
Software of esDesign is developed to implement the adaptive enrichment designs with sample size re-estimation presented in Lin et al. (2021) <doi: 10.1016/j.cct.2020.106216>. In details, three-proposed trial designs are provided, including the AED1-SSR (or ES1-SSR), AED2-SSR (or ES2-SSR) and AED3-SSR (or ES3-SSR). In addition, this package also contains several widely used adaptive designs, such as the Marker Sequential Test (MaST) design proposed Freidlin et al. (2014) <doi:10.1177/1740774513503739>, the adaptive enrichment designs without early stopping (AED or ES), the sample size re-estimation procedure (SSR) based on the conditional power proposed by Proschan and Hunsberger (1995), and some useful functions. In details, we can calculate the futility and/or efficacy stopping boundaries, the sample size required, calibrate the value of the threshold of the difference between subgroup-specific test statistics, conduct the simulation studies in AED, SSR, AED1-SSR, AED2-SSR and AED3-SSR.
Given independent and identically distributed observations X(1), ..., X(n) from a density f, provides five methods to perform a multiscale analysis about f as well as the necessary critical values. The first method, introduced in Duembgen and Walther (2008), provides simultaneous confidence statements for the existence and location of local increases (or decreases) of f, based on all intervals I(all) spanned by any two observations X(j), X(k). The second method approximates the latter approach by using only a subset of I(all) and is therefore computationally much more efficient, but asymptotically equivalent. Omitting the additive correction term Gamma in either method offers another two approaches which are more powerful on small scales and less powerful on large scales, however, not asymptotically minimax optimal anymore. Finally, the block procedure is a compromise between adding Gamma or not, having intermediate power properties. The latter is again asymptotically equivalent to the first and was introduced in Rufibach and Walther (2010).
Detection of overdispersion in count data for multiple regression analysis. Log-linear count data regression is one of the most popular techniques for predictive modeling where there is a non-negative discrete quantitative dependent variable. In order to ensure the inferences from the use of count data models are appropriate, researchers may choose between the estimation of a Poisson model and a negative binomial model, and the correct decision for prediction from a count data estimation is directly linked to the existence of overdispersion of the dependent variable, conditional to the explanatory variables. Based on the studies of Cameron and Trivedi (1990) <doi:10.1016/0304-4076(90)90014-K> and Cameron and Trivedi (2013, ISBN:978-1107667273), the overdisp() command is a contribution to researchers, providing a fast and secure solution for the detection of overdispersion in count data. Another advantage is that the installation of other packages is unnecessary, since the command runs in the basic R language.
Simulate multivariate correlated data given nonparametric marginals and their joint structure characterized by a Pearson or Spearman correlation matrix. The simulator engages the problem from a purely computational perspective. It assumes no statistical models such as copulas or parametric distributions, and can approximate the target correlations regardless of theoretical feasibility. The algorithm integrates and advances the Iman-Conover (1982) approach <doi:10.1080/03610918208812265> and the Ruscio-Kaczetow iteration (2008) <doi:10.1080/00273170802285693>. Package functions are carefully implemented in C++ for squeezing computing speed, suitable for large input in a manycore environment. Precision of the approximation and computing speed both substantially outperform various CRAN packages to date. Benchmarks are detailed in function examples. A simple heuristic algorithm is additionally designed to optimize the joint distribution in the post-simulation stage. The heuristic demonstrated good potential of achieving the same level of precision of approximation without the enhanced Iman-Conover-Ruscio-Kaczetow. The package contains a copy of Permuted Congruential Generator.
Simple and efficient access to Yahoo Finance's screener API <https://finance.yahoo.com/research-hub/screener/> for querying and retrieval of financial data. The core functionality abstracts the complexities of interacting with Yahoo Finance APIs, such as session management, crumb and cookie handling, query construction, pagination, and JSON payload generation. This abstraction allows users to focus on filtering and retrieving data rather than managing API details. Use cases include screening across a range of security types including equities, mutual funds, ETFs, indices, and futures. The package supports advanced query capabilities, including logical operators, nested filters, and customizable payloads. It automatically handles pagination to ensure efficient retrieval of large datasets by fetching results in batches of up to 250 entries per request. Filters can be dynamically defined to accommodate a wide range of screening needs. The implementation leverages standard HTTP libraries to handle API interactions efficiently and provides support for both R and Python to ensure accessibility for a broad audience.
Cell surface proteins form a major fraction of the druggable proteome and can be used for tissue-specific delivery of oligonucleotide/cell-based therapeutics. Alternatively spliced surface protein isoforms have been shown to differ in their subcellular localization and/or their transmembrane (TM) topology. Surface proteins are hydrophobic and remain difficult to study thereby necessitating the use of TM topology prediction methods such as TMHMM and Phobius. However, there exists a need for bioinformatic approaches to streamline batch processing of isoforms for comparing and visualizing topologies. To address this gap, we have developed an R package, surfaltr. It pairs inputted isoforms, either known alternatively spliced or novel, with their APPRIS annotated principal counterparts, predicts their TM topologies using TMHMM or Phobius, and generates a customizable graphical output. Further, surfaltr facilitates the prioritization of biologically diverse isoform pairs through the incorporation of three different ranking metrics and through protein alignment functions. Citations for programs mentioned here can be found in the vignette.
This package can do non-parametric bootstrap and permutation resampling-based multiple testing procedures (including empirical Bayes methods) for controlling the family-wise error rate (FWER), generalized family-wise error rate (gFWER), tail probability of the proportion of false positives (TPPFP), and false discovery rate (FDR). Several choices of bootstrap-based null distribution are implemented (centered, centered and scaled, quantile-transformed). Single-step and step-wise methods are available. Tests based on a variety of T- and F-statistics (including T-statistics based on regression parameters from linear and survival models as well as those based on correlation parameters) are included. When probing hypotheses with T-statistics, users may also select a potentially faster null distribution which is multivariate normal with mean zero and variance covariance matrix derived from the vector influence function. Results are reported in terms of adjusted P-values, confidence regions and test statistic cutoffs. The procedures are directly applicable to identifying differentially expressed genes in DNA microarray experiments.
Offers calculation, visualization and comparison of algorithmic fairness metrics. Fair machine learning is an emerging topic with the overarching aim to critically assess whether ML algorithms reinforce existing social biases. Unfair algorithms can propagate such biases and produce predictions with a disparate impact on various sensitive groups of individuals (defined by sex, gender, ethnicity, religion, income, socioeconomic status, physical or mental disabilities). Fair algorithms possess the underlying foundation that these groups should be treated similarly or have similar prediction outcomes. The fairness R package offers the calculation and comparisons of commonly and less commonly used fairness metrics in population subgroups. These methods are described by Calders and Verwer (2010) <doi:10.1007/s10618-010-0190-x>, Chouldechova (2017) <doi:10.1089/big.2016.0047>, Feldman et al. (2015) <doi:10.1145/2783258.2783311> , Friedler et al. (2018) <doi:10.1145/3287560.3287589> and Zafar et al. (2017) <doi:10.1145/3038912.3052660>. The package also offers convenient visualizations to help understand fairness metrics.
The phenomis package provides methods to perform post-processing (i.e. quality control and normalization) as well as univariate statistical analysis of single and multi-omics data sets. These methods include quality control metrics, signal drift and batch effect correction, intensity transformation, univariate hypothesis testing, but also clustering (as well as annotation of metabolomics data). The data are handled in the standard Bioconductor formats (i.e. SummarizedExperiment and MultiAssayExperiment for single and multi-omics datasets, respectively; the alternative ExpressionSet and MultiDataSet formats are also supported for convenience). As a result, all methods can be readily chained as workflows. The pipeline can be further enriched by multivariate analysis and feature selection, by using the ropls and biosigner packages, which support the same formats. Data can be conveniently imported from and exported to text files. Although the methods were initially targeted to metabolomics data, most of the methods can be applied to other types of omics data (e.g., transcriptomics, proteomics).
Build graphs for landscape genetics analysis. This set of functions can be used to import and convert spatial and genetic data initially in different formats, import landscape graphs created with GRAPHAB software (Foltete et al., 2012) <doi:10.1016/j.envsoft.2012.07.002>, make diagnosis plots of isolation by distance relationships in order to choose how to build genetic graphs, create graphs with a large range of pruning methods, weight their links with several genetic distances, plot and analyse graphs, compare them with other graphs. It uses functions from other packages such as adegenet (Jombart, 2008) <doi:10.1093/bioinformatics/btn129> and igraph (Csardi et Nepusz, 2006) <https://igraph.org/>. It also implements methods commonly used in landscape genetics to create graphs, described by Dyer et Nason (2004) <doi:10.1111/j.1365-294X.2004.02177.x> and Greenbaum et Fefferman (2017) <doi:10.1111/mec.14059>, and to analyse distance data (van Strien et al., 2015) <doi:10.1038/hdy.2014.62>.
This package implements Additive Logistic Transformation (alr) for Small Area Estimation under Fay Herriot Model. Small Area Estimation is used to borrow strength from auxiliary variables to improve the effectiveness of a domain sample size. This package uses Empirical Best Linear Unbiased Prediction (EBLUP). The Additive Logistic Transformation (alr) are based on transformation by Aitchison J (1986). The covariance matrix for multivariate application is based on covariance matrix used by Esteban M, Lombardà a M, López-Vizcaà no E, Morales D, and Pérez A <doi:10.1007/s11749-019-00688-w>. The non-sampled models are modified area-level models based on models proposed by Anisa R, Kurnia A, and Indahwati I <doi:10.9790/5728-10121519>, with univariate model using model-3, and multivariate model using model-1. The MSE are estimated using Parametric Bootstrap approach. For non-sampled cases, MSE are estimated using modified approach proposed by Haris F and Ubaidillah A <doi:10.4108/eai.2-8-2019.2290339>.
Enables the analysis of spectroscopy data such as infrared ('IR'), Raman, and nuclear magnetic resonance ('NMR') using the tidy data framework from the tidyverse'. The tidyspec package provides functions for data transformation, normalization, baseline correction, smoothing, derivatives, and both interactive and static visualization. It promotes structured, reproducible workflows for spectral data exploration and preprocessing. Implemented methods include Savitzky and Golay (1964) "Smoothing and Differentiation of Data by Simplified Least Squares Procedures" <doi:10.1021/ac60214a047>, Sternberg (1983) "Biomedical Image Processing" <https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1654163>, Zimmermann and Kohler (1996) "Baseline correction using the rolling ball algorithm" <doi:10.1016/0168-583X(95)00908-6>, Beattie and Esmonde-White (2021) "Exploration of Principal Component Analysis: Deriving Principal Component Analysis Visually Using Spectra" <doi:10.1177/0003702820987847>, Wickham et al. (2019) "Welcome to the tidyverse" <doi:10.21105/joss.01686>, and Kuhn, Wickham and Hvitfeldt (2024) "recipes: Preprocessing and Feature Engineering Steps for Modeling" <https://CRAN.R-project.org/package=recipes>.
The centralized empirical cumulative average deviation function is utilized to develop both Ada-plot and Uda-plot as alternatives to Ad-plot and Ud-plot introduced by the author. Analogous to Ad-plot, Ada-plot can identify symmetry, skewness, and outliers of the data distribution. The Uda-plot is as exceptional as Ud-plot in assessing normality. The d-value that quantifies the degree of proximity between the Uda-plot and the graph of the estimated normal density function helps guide to make decisions on confirmation of normality. Extreme values in the data can be eliminated using the 1.5IQR rule to create its robust version if user demands. Full description of the methodology can be found in the article by Wijesuriya (2025a) <doi:10.1080/03610926.2025.2558108>. Further, the development of Ad-plot and Ud-plot is contained in both article and the adplots R package by Wijesuriya (2025b & 2025c) <doi:10.1080/03610926.2024.2440583> and <doi:10.32614/CRAN.package.adplots>.
In total it has 7 functions, three for calculating machine calibration, which determine application rate (L/ha), nozzle flow (L/min) and amount of product (L or kg) to be added. to the tank with each sprayer filling. Two functions for graphs of the flow distribution of the nozzles (L/min) in the application bar and, of the temporal variability of the meteorological conditions (air temperature, relative humidity of the air and wind speed). Two functions to determine the spray deposit (uL/cm2), through the methodology called spectrophotometry, with the aid of bright blue (Palladini, L.A., Raetano, C.G., Velini, E.D. (2005), <doi:10.1590/S0103-90162005000500005>) or metallic markers (Chaim, A., Castro, V.L.S.S., Correles, F.M., Galvão, J.A.H., Cabral, O.M.R., Nicolella, G. (1999), <doi:10.1590/S0100-204X1999000500003>). The package supports the analysis and representation of information, using a single free software that meets the most diverse areas of activity in application technology.
This package implements the non-iterative conditional expectation (NICE) algorithm of the g-formula algorithm (Robins (1986) <doi:10.1016/0270-0255(86)90088-6>, Hernán and Robins (2024, ISBN:9781420076165)). The g-formula can estimate an outcome's counterfactual mean or risk under hypothetical treatment strategies (interventions) when there is sufficient information on time-varying treatments and confounders. This package can be used for discrete or continuous time-varying treatments and for failure time outcomes or continuous/binary end of follow-up outcomes. The package can handle a random measurement/visit process and a priori knowledge of the data structure, as well as censoring (e.g., by loss to follow-up) and two options for handling competing events for failure time outcomes. Interventions can be flexibly specified, both as interventions on a single treatment or as joint interventions on multiple treatments. See McGrath et al. (2020) <doi:10.1016/j.patter.2020.100008> for a guide on how to use the package.
S4 tool box for capacity (or non-additive measure, fuzzy measure) and integral manipulation in a finite setting. It contains routines for handling various types of set functions such as games or capacities. It can be used to compute several non-additive integrals: the Choquet integral, the Sugeno integral, and the symmetric and asymmetric Choquet integrals. An analysis of capacities in terms of decision behavior can be performed through the computation of various indices such as the Shapley value, the interaction index, the orness degree, etc. The well-known Möbius transform, as well as other equivalent representations of set functions can also be computed. Kappalab further contains seven capacity identification routines: three least squares based approaches, a method based on linear programming, a maximum entropy like method based on variance minimization, a minimum distance approach and an unsupervised approach based on parametric entropies. The functions contained in Kappalab can for instance be used in the framework of multicriteria decision making or cooperative game theory.
This package provides a modeling tool dedicated to biological network modeling (Bertrand and others 2020, <doi:10.1093/bioinformatics/btaa855>). It allows for single or joint modeling of, for instance, genes and proteins. It starts with the selection of the actors that will be the used in the reverse engineering upcoming step. An actor can be included in that selection based on its differential measurement (for instance gene expression or protein abundance) or on its time course profile. Wrappers for actors clustering functions and cluster analysis are provided. It also allows reverse engineering of biological networks taking into account the observed time course patterns of the actors. Many inference functions are provided and dedicated to get specific features for the inferred network such as sparsity, robust links, high confidence links or stable through resampling links. Some simulation and prediction tools are also available for cascade networks (Jung and others 2014, <doi:10.1093/bioinformatics/btt705>). Example of use with microarray or RNA-Seq data are provided.
Mutational signatures are carcinogenic exposures or aberrant cellular processes that can cause alterations to the genome. We created musicatk (MUtational SIgnature Comprehensive Analysis ToolKit) to address shortcomings in versatility and ease of use in other pre-existing computational tools. Although many different types of mutational data have been generated, current software packages do not have a flexible framework to allow users to mix and match different types of mutations in the mutational signature inference process. Musicatk enables users to count and combine multiple mutation types, including SBS, DBS, and indels. Musicatk calculates replication strand, transcription strand and combinations of these features along with discovery from unique and proprietary genomic feature associated with any mutation type. Musicatk also implements several methods for discovery of new signatures as well as methods to infer exposure given an existing set of signatures. Musicatk provides functions for visualization and downstream exploratory analysis including the ability to compare signatures between cohorts and find matching signatures in COSMIC V2 or COSMIC V3.
This package provides functions to develop simulated continuous data (e.g., gene expression) from a sigma covariance matrix derived from a graph structure in igraph objects. Intended to extend mvtnorm to take igraph structures rather than sigma matrices as input. This allows the use of simulated data that correctly accounts for pathway relationships and correlations. This allows the use of simulated data that correctly accounts for pathway relationships and correlations. Here we present a versatile statistical framework to simulate correlated gene expression data from biological pathways, by sampling from a multivariate normal distribution derived from a graph structure. This package allows the simulation of biological pathways from a graph structure based on a statistical model of gene expression. For example methods to infer biological pathways and gene regulatory networks from gene expression data can be tested on simulated datasets using this framework. This also allows for pathway structures to be considered as a confounding variable when simulating gene expression data to test the performance of genomic analyses.