This package provides a series of functions for performing differential expression analysis from RNA-seq count data using robust normalization strategy (called DEGES). The basic idea of DEGES is that potential differentially expressed genes or transcripts (DEGs) among compared samples should be removed before data normalization to obtain a well-ranked gene list where true DEGs are top-ranked and non-DEGs are bottom ranked. This can be done by performing a multi-step normalization strategy (called DEGES for DEG elimination strategy). A major characteristic of TCC is to provide the robust normalization methods for several kinds of count data (two-group with or without replicates, multi-group/multi-factor, and so on) by virtue of the use of combinations of functions in depended packages.
Predicts individual race/ethnicity using surname, first name, middle name, geolocation, and other attributes, such as gender and age. The method utilizes Bayes Rule (with optional measurement error correction) to compute the posterior probability of each racial category for any given individual. The package implements methods described in Imai and Khanna (2016) "Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Records" Political Analysis <DOI:10.1093/pan/mpw001> and Imai, Olivella, and Rosenman (2022) "Addressing census data problems in race imputation via fully Bayesian Improved Surname Geocoding and name supplements" <DOI:10.1126/sciadv.adc9824>. The package also incorporates the data described in Rosenman, Olivella, and Imai (2023) "Race and ethnicity data for first, middle, and surnames" <DOI:10.1038/s41597-023-02202-2>.
This package provides a two-step approach to imputing missing data in metabolomics. Step 1 uses a random forest classifier to classify missing values as either Missing Completely at Random/Missing At Random (MCAR/MAR) or Missing Not At Random (MNAR). MCAR/MAR are combined because it is often difficult to distinguish these two missing types in metabolomics data. Step 2 imputes the missing values based on the classified missing mechanisms, using the appropriate imputation algorithms. Imputation algorithms tested and available for MCAR/MAR include Bayesian Principal Component Analysis (BPCA), Multiple Imputation No-Skip K-Nearest Neighbors (Multi_nsKNN
), and Random Forest. Imputation algorithms tested and available for MNAR include nsKNN
and a single imputation approach for imputation of metabolites where left-censoring is present.
This package provides a method of recovering the precision matrix for Gaussian graphical models efficiently. Our approach could be divided into three categories. First of all, we use Hard Graphical Thresholding for best subset selection problem of Gaussian graphical model, and the core concept of this method was proposed by Luo et al. (2014) <arXiv:1407.7819>
. Secondly, a closed form solution for graphical lasso under acyclic graph structure is implemented in our package (Fattahi and Sojoudi (2019) <https://jmlr.org/papers/v20/17-501.html>). Furthermore, we implement block coordinate descent algorithm to efficiently solve the covariance selection problem (Dempster (1972) <doi:10.2307/2528966>). Our package is computationally efficient and can solve ultra-high-dimensional problems, e.g. p > 10,000, in a few minutes.
This package provides functions for the joint analysis of Q sets of p-values obtained for the same list of items. This joint analysis is performed by querying a composite hypothesis, i.e. an arbitrary complex combination of simple hypotheses, as described in Mary-Huard et al. (2021) <doi:10.1093/bioinformatics/btab592> and De Walsche et al.(2023) <doi:10.1101/2024.03.17.585412>. In this approach, the Q-uplet of p-values associated with each item is distributed as a multivariate mixture, where each of the 2^Q components corresponds to a specific combination of simple hypotheses. The dependence between the p-value series is considered using a Gaussian copula function. A p-value for the composite hypothesis test is derived from the posterior probabilities.
This package implements the Stable Balancing Weights by Zubizarreta (2015) <DOI:10.1080/01621459.2015.1023805>. These are the weights of minimum variance that approximately balance the empirical distribution of the observed covariates. For an overview, see Chattopadhyay, Hase and Zubizarreta (2020) <DOI:10.1002/sim.8659>. To solve the optimization problem in sbw', the default solver is quadprog', which is readily available through CRAN. The solver osqp is also posted on CRAN. To enhance the performance of sbw', users are encouraged to install other solvers such as gurobi and Rmosek', which require special installation. For the installation of gurobi and pogs, please follow the instructions at <https://www.gurobi.com/documentation/current/refman/r_ins_the_r_package.html> and <http://foges.github.io/pogs/stp/r>.
This analytical framework is based on an analysis of the shape of the trait abundance distributions to better understand community assembly processes, and predict community dynamics under environmental changes. This framework mobilized a study of the relationship between the moments describing the shape of the distributions: the skewness and the kurtosis (SKR). The SKR allows the identification of commonalities in the shape of trait distributions across contrasting communities. Derived from the SKR, we developed mathematical parameters that summarise the complex pattern of distributions by assessing (i) the R², (ii) the Y-intercept, (iii) the slope, (iv) the functional stability of community (TADstab), and, (v) the distance from specific distribution families (i.e., the distance from the skew-uniform family a limit to the highest degree of evenness: TADeve).
Differential abundance testing in microbiome data challenges both parametric and non-parametric statistical methods, due to its sparsity, high variability and compositional nature. Microbiome-specific statistical methods often assume classical distribution models or take into account compositional specifics. These produce results that range within the specificity vs sensitivity space in such a way that type I and type II error that are difficult to ascertain in real microbiome data when a single method is used. Recently, a consensus approach based on multiple differential abundance (DA) methods was recently suggested in order to increase robustness. With dar, you can use dplyr-like pipeable sequences of DA methods and then apply different consensus strategies. In this way we can obtain more reliable results in a fast, consistent and reproducible way.
Calculate users prevalence of a product based on the prevalence of triers in the population. The measurement of triers is relatively easy. It is just a question of whether a person tried a product even once in his life or not. On the other hand, The measurement of people who also adopt it as part of their life is more complicated since adopting an innovative product is a subjective view of the individual. Mickey Kislev and Shira Kislev developed a formula to calculate the prevalence of a product's users to overcome this difficulty. The current package assists in calculating the users prevalence of a product based on the prevalence of triers in the population. See for: Kislev, M. M., and S. Kislev (2020) <doi:10.5539/ijms.v12n4p63>.
An implementation of the Aligned Rank Transform technique for factorial analysis (see references below for details) including models with missing terms (unsaturated factorial models). The function first computes a separate aligned ranked response variable for each effect of the user-specified model, and then runs a classic ANOVA on each of the aligned ranked responses. For further details, see Higgins, J. J. and Tashtoush, S. (1994). An aligned rank transform test for interaction. Nonlinear World 1 (2), pp. 201-211. Wobbrock, J.O., Findlater, L., Gergle, D. and Higgins,J.J. (2011). The Aligned Rank Transform for nonparametric factorial analyses using only ANOVA procedures. Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI 11). New York: ACM Press, pp. 143-146. <doi:10.1145/1978942.1978963>.
This package provides comprehensive functionalities for causal modeling with Coincidence Analysis (CNA), which is a configurational comparative method of causal data analysis that was first introduced in Baumgartner (2009) <doi:10.1177/0049124109339369>, and generalized in Baumgartner & Ambuehl (2020) <doi:10.1017/psrm.2018.45>. CNA is designed to recover INUS-causation from data, which is particularly relevant for analyzing processes featuring conjunctural causation (component causation) and equifinality (alternative causation). CNA is currently the only method for INUS-discovery that allows for multiple effects (outcomes/endogenous factors), meaning it can analyze common-cause and causal chain structures. Moreover, as of version 4.0, it is the only method of its kind that provides measures for model evaluation and selection that are custom-made for the problem of INUS-discovery.
This package provides functions for testing affine hypotheses on the regression coefficient vector in regression models with heteroskedastic errors: (i) a function for computing various test statistics (in particular using HC0-HC4 covariance estimators based on unrestricted or restricted residuals); (ii) a function for numerically approximating the size of a test based on such test statistics and a user-supplied critical value; and, most importantly, (iii) a function for determining size-controlling critical values for such test statistics and a user-supplied significance level (also incorporating a check of conditions under which such a size-controlling critical value exists). The three functions are based on results in Poetscher and Preinerstorfer (2021) "Valid Heteroskedasticity Robust Testing" <doi:10.48550/arXiv.2104.12597>
, which will appear as <doi:10.1017/S0266466623000269>.
Supports Bayesian models with full and partial (hence arbitrary) dependencies between random variables. Discrete and continuous variables are supported, and conditional joint probabilities and probability densities are estimated using Kernel Density Estimation (KDE). The full general form, which implements an extension to Bayes theorem, as well as the simple form, which is just a Bayesian network, both support regression through segmentation and KDE and estimation of probability or relative likelihood of discrete or continuous target random variables. This package also provides true statistical distance measures based on Bayesian models. Furthermore, these measures can be facilitated on neighborhood searches, and to estimate the similarity and distance between data points. Related work is by Bayes (1763) <doi:10.1098/rstl.1763.0053> and by Scutari (2010) <doi:10.18637/jss.v035.i03>.
The generalised lambda distribution, or Tukey lambda distribution, provides a wide variety of shapes with one functional form. This package provides random numbers, quantiles, probabilities, densities and density quantiles for four different types of the distribution, the FKML (Freimer et al 1988), RS (Ramberg and Schmeiser 1974), GPD (van Staden and Loots 2009) and FM5 - see documentation for details. It provides the density function, distribution function, and Quantile-Quantile plots. It implements a variety of estimation methods for the distribution, including diagnostic plots. Estimation methods include the starship (all 4 types), method of L-Moments for the GPD and FKML types, and a number of methods for only the FKML type. These include maximum likelihood, maximum product of spacings, Titterington's method, Moments, Trimmed L-Moments and Distributional Least Absolutes.
This package provides tools to download and manipulate the Permanent Household Survey from Argentina (EPH is the Spanish acronym for Permanent Household Survey). e.g: get_microdata()
for downloading the datasets, get_poverty_lines()
for downloading the official poverty baskets, calculate_poverty()
for the calculation of stating if a household is in poverty or not, following the official methodology. organize_panels()
is used to concatenate observations from different periods, and organize_labels()
adds the official labels to the data. The implemented methods are based on INDEC (2016) <http://www.estadistica.ec.gba.gov.ar/dpe/images/SOCIEDAD/EPH_metodologia_22_pobreza.pdf>. As this package works with the argentinian Permanent Household Survey and its main audience is from this country, the documentation was written in Spanish.
Do algebraic operations on neural networks. We seek here to implement in R, operations on neural networks and their resulting approximations. Our operations derive their descriptions mainly from Rafi S., Padgett, J.L., and Nakarmi, U. (2024), "Towards an Algebraic Framework For Approximating Functions Using Neural Network Polynomials", <doi:10.48550/arXiv.2402.01058>
, Grohs P., Hornung, F., Jentzen, A. et al. (2023), "Space-time error estimates for deep neural network approximations for differential equations", <doi:10.1007/s10444-022-09970-2>, Jentzen A., Kuckuck B., von Wurstemberger, P. (2023), "Mathematical Introduction to Deep Learning Methods, Implementations, and Theory" <doi:10.48550/arXiv.2310.20360>
. Our implementation is meant mainly as a pedagogical tool, and proof of concept. Faster implementations with deeper vectorizations may be made in future versions.
Construction of the Total Operating Characteristic (TOC) Curve and the Receiver (aka Relative) Operating Characteristic (ROC) Curve for spatial and non-spatial data. The TOC method is a modification of the ROC method which measures the ability of an index variable to diagnose either presence or absence of a characteristic. The diagnosis depends on whether the value of an index variable is above a threshold. Each threshold generates a two-by-two contingency table, which contains four entries: hits (H), misses (M), false alarms (FA), and correct rejections (CR). While ROC shows for each threshold only two ratios, H/(H + M) and FA/(FA + CR), TOC reveals the size of every entry in the contingency table for each threshold (Pontius Jr., R.G., Si, K. 2014. <doi:10.1080/13658816.2013.862623>).
The REUSE tool helps you achieve and confirm license compliance with the REUSE specification, a set of recommendations for licensing Free Software projects. REUSE makes it easy to declare the licenses under which your works are released, especially when reusing software from different projects released under different licenses. It avoids reliance on fuzzy heuristicts and allows both legal experts and computers to understand how your project is licensed. This allows generating a "bill of materials" for software.
This tool downloads full license texts, adds copyright and license information to file headers, and contains a linter to identify problems. There are other tools that have a lot more features and functionality surrounding the analysis and inspection of copyright and licenses in software projects. This one is designed to be simple.
The Aligned Corpus Toolkit (act) is designed for linguists that work with time aligned transcription data. It offers functions to import and export various annotation file formats ('ELAN .eaf, EXMARaLDA
.exb and Praat .TextGrid
files), create print transcripts in the style of conversation analysis, search transcripts (span searches across multiple annotations, search in normalized annotations, make concordances etc.), export and re-import search results (.csv and Excel .xlsx format), create cuts for the search results (print transcripts, audio/video cuts using FFmpeg and video sub titles in Subrib title .srt format), modify the data in a corpus (search/replace, delete, filter etc.), interact with Praat using Praat'-scripts, and exchange data with the rPraat
package. The package is itself written in R and may be expanded by other users.
This package implements a tree-based method specifically designed for personalized medicine applications. By using genomic and mutational data, ODT efficiently identifies optimal drug recommendations tailored to individual patient profiles. The ODT algorithm constructs decision trees that bifurcate at each node, selecting the most relevant markers (discrete or continuous) and corresponding treatments, thus ensuring that recommendations are both personalized and statistically robust. This iterative approach enhances therapeutic decision-making by refining treatment suggestions until a predefined group size is achieved. Moreover, the simplicity and interpretability of the resulting trees make the method accessible to healthcare professionals. Includes functions for training the decision tree, making predictions on new samples or patients, and visualizing the resulting tree. For detailed insights into the methodology, please refer to Gimeno et al. (2023) <doi:10.1093/bib/bbad200>.
This package provides functions are provided that implement the use of the Fieller's formula methodology, for calculating a confidence interval for a ratio of (commonly, correlated) means. See Fieller (1954) <doi:10.1111/j.2517-6161.1954.tb00159.x>. Here, the application of primary interest is to studies of insect mortality response to increasing doses of a fumigant, or, e.g., to time in coolstorage. The formula is used to calculate a confidence interval for the dose or time required to achieve a specified mortality proportion, commonly 0.5 or 0.99. Vignettes demonstrate link functions that may be considered, checks on fitted models, and alternative choices of error family. Note in particular the betabinomial error family. See also Maindonald, Waddell, and Petry (2001) <doi:10.1016/S0925-5214(01)00082-5>.
This package implements the Interpolate, Truncate, Project (ITP) root-finding algorithm developed by Oliveira and Takahashi (2021) <doi:10.1145/3423597>. The user provides the function, from the real numbers to the real numbers, and an interval with the property that the values of the function at its endpoints have different signs. If the function is continuous over this interval then the ITP method estimates the value at which the function is equal to zero. If the function is discontinuous then a point of discontinuity at which the function changes sign may be found. The function can be supplied using either an R function or an external pointer to a C++ function. Tuning parameters of the ITP algorithm can be set by the user. Default values are set based on arguments in Oliveira and Takahashi (2021).
Conducts Bayesian Hypothesis tests of a point null hypothesis against a two-sided alternative using Non-local Alternative Prior (NAP) for one- and two-sample z- and t-tests (Pramanik and Johnson, 2022). Under the alternative, the NAP is assumed on the standardized effects size in one-sample tests and on their differences in two-sample tests. The package considers two types of NAP densities: (1) the normal moment prior, and (2) the composite alternative. In fixed design tests, the functions calculate the Bayes factors and the expected weight of evidence for varied effect size and sample size. The package also provides a sequential testing framework using the Sequential Bayes Factor (SBF) design. The functions calculate the operating characteristics (OC) and the average sample number (ASN), and also conducts sequential tests for a sequentially observed data.
Biterm Topic Models find topics in collections of short texts. It is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns which are called biterms. This in contrast to traditional topic models like Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis which are word-document co-occurrence topic models. A biterm consists of two words co-occurring in the same short text window. This context window can for example be a twitter message, a short answer on a survey, a sentence of a text or a document identifier. The techniques are explained in detail in the paper 'A Biterm Topic Model For Short Text' by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng (2013) https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf.