This package implements the Bayesian and likelihood methods proposed in Imai, Lu, and Strauss (2008 <doi:10.1093/pan/mpm017>) and (2011 <doi:10.18637/jss.v042.i05>) for ecological inference in 2 by 2 tables as well as the method of bounds introduced by Duncan and Davis (1953). The package fits both parametric and nonparametric models using either the Expectation-Maximization algorithms (for likelihood models) or the Markov chain Monte Carlo algorithms (for Bayesian models). For all models, the individual-level data can be directly incorporated into the estimation whenever such data are available. Along with in-sample and out-of-sample predictions, the package also provides a functionality which allows one to quantify the effect of data aggregation on parameter estimation and hypothesis testing under the parametric likelihood models.
This package provides tools to download and manipulate the Permanent Household Survey from Argentina (EPH is the Spanish acronym for Permanent Household Survey). e.g: get_microdata() for downloading the datasets, get_poverty_lines() for downloading the official poverty baskets, calculate_poverty() for the calculation of stating if a household is in poverty or not, following the official methodology. organize_panels() is used to concatenate observations from different periods, and organize_labels() adds the official labels to the data. The implemented methods are based on INDEC (2016) <http://www.estadistica.ec.gba.gov.ar/dpe/images/SOCIEDAD/EPH_metodologia_22_pobreza.pdf>. As this package works with the argentinian Permanent Household Survey and its main audience is from this country, the documentation was written in Spanish.
Do algebraic operations on neural networks. We seek here to implement in R, operations on neural networks and their resulting approximations. Our operations derive their descriptions mainly from Rafi S., Padgett, J.L., and Nakarmi, U. (2024), "Towards an Algebraic Framework For Approximating Functions Using Neural Network Polynomials", <doi:10.48550/arXiv.2402.01058>, Grohs P., Hornung, F., Jentzen, A. et al. (2023), "Space-time error estimates for deep neural network approximations for differential equations", <doi:10.1007/s10444-022-09970-2>, Jentzen A., Kuckuck B., von Wurstemberger, P. (2023), "Mathematical Introduction to Deep Learning Methods, Implementations, and Theory" <doi:10.48550/arXiv.2310.20360>. Our implementation is meant mainly as a pedagogical tool, and proof of concept. Faster implementations with deeper vectorizations may be made in future versions.
The â TADâ package compiled an analytical framework based on an analysis of the shape of the trait abundance distributions to better understand community assembly processes, and predict community dynamics under environmental changes. This framework mobilized a study of the relationship between the moments describing the shape of the distributions: the skewness and the kurtosis (SKR). The SKR allows the identification of commonalities in the shape of trait distributions across contrasting communities. Derived from the SKR, we developed mathematical parameters that summarise the complex pattern of distributions by assessing (i) the R², (ii) the Y-intercept, (iii) the slope, (iv) the functional stability of community (TADstab), and, (v) the distance from specific distribution families (i.e., the distance from the skew-uniform family a limit to the highest degree of evenness: TADeve).
Construction of the Total Operating Characteristic (TOC) Curve and the Receiver (aka Relative) Operating Characteristic (ROC) Curve for spatial and non-spatial data. The TOC method is a modification of the ROC method which measures the ability of an index variable to diagnose either presence or absence of a characteristic. The diagnosis depends on whether the value of an index variable is above a threshold. Each threshold generates a two-by-two contingency table, which contains four entries: hits (H), misses (M), false alarms (FA), and correct rejections (CR). While ROC shows for each threshold only two ratios, H/(H + M) and FA/(FA + CR), TOC reveals the size of every entry in the contingency table for each threshold (Pontius Jr., R.G., Si, K. 2014. <doi:10.1080/13658816.2013.862623>).
The Aligned Corpus Toolkit (act) is designed for linguists that work with time aligned transcription data. It offers functions to import and export various annotation file formats ('ELAN .eaf, EXMARaLDA .exb and Praat .TextGrid files), create print transcripts in the style of conversation analysis, search transcripts (span searches across multiple annotations, search in normalized annotations, make concordances etc.), export and re-import search results (.csv and Excel .xlsx format), create cuts for the search results (print transcripts, audio/video cuts using FFmpeg and video sub titles in Subrib title .srt format), modify the data in a corpus (search/replace, delete, filter etc.), interact with Praat using Praat'-scripts, and exchange data with the rPraat package. The package is itself written in R and may be expanded by other users.
This package implements a tree-based method specifically designed for personalized medicine applications. By using genomic and mutational data, ODT efficiently identifies optimal drug recommendations tailored to individual patient profiles. The ODT algorithm constructs decision trees that bifurcate at each node, selecting the most relevant markers (discrete or continuous) and corresponding treatments, thus ensuring that recommendations are both personalized and statistically robust. This iterative approach enhances therapeutic decision-making by refining treatment suggestions until a predefined group size is achieved. Moreover, the simplicity and interpretability of the resulting trees make the method accessible to healthcare professionals. Includes functions for training the decision tree, making predictions on new samples or patients, and visualizing the resulting tree. For detailed insights into the methodology, please refer to Gimeno et al. (2023) <doi:10.1093/bib/bbad200>.
This package provides functions are provided that implement the use of the Fieller's formula methodology, for calculating a confidence interval for a ratio of (commonly, correlated) means. See Fieller (1954) <doi:10.1111/j.2517-6161.1954.tb00159.x>. Here, the application of primary interest is to studies of insect mortality response to increasing doses of a fumigant, or, e.g., to time in coolstorage. The formula is used to calculate a confidence interval for the dose or time required to achieve a specified mortality proportion, commonly 0.5 or 0.99. Vignettes demonstrate link functions that may be considered, checks on fitted models, and alternative choices of error family. Note in particular the betabinomial error family. See also Maindonald, Waddell, and Petry (2001) <doi:10.1016/S0925-5214(01)00082-5>.
Biterm Topic Models find topics in collections of short texts. It is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns which are called biterms. This in contrast to traditional topic models like Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis which are word-document co-occurrence topic models. A biterm consists of two words co-occurring in the same short text window. This context window can for example be a twitter message, a short answer on a survey, a sentence of a text or a document identifier. The techniques are explained in detail in the paper 'A Biterm Topic Model For Short Text' by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng (2013) https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf.
This package implements the Interpolate, Truncate, Project (ITP) root-finding algorithm developed by Oliveira and Takahashi (2021) <doi:10.1145/3423597>. The user provides the function, from the real numbers to the real numbers, and an interval with the property that the values of the function at its endpoints have different signs. If the function is continuous over this interval then the ITP method estimates the value at which the function is equal to zero. If the function is discontinuous then a point of discontinuity at which the function changes sign may be found. The function can be supplied using either an R function or an external pointer to a C++ function. Tuning parameters of the ITP algorithm can be set by the user. Default values are set based on arguments in Oliveira and Takahashi (2021).
Conducts Bayesian Hypothesis tests of a point null hypothesis against a two-sided alternative using Non-local Alternative Prior (NAP) for one- and two-sample z- and t-tests (Pramanik and Johnson, 2022). Under the alternative, the NAP is assumed on the standardized effects size in one-sample tests and on their differences in two-sample tests. The package considers two types of NAP densities: (1) the normal moment prior, and (2) the composite alternative. In fixed design tests, the functions calculate the Bayes factors and the expected weight of evidence for varied effect size and sample size. The package also provides a sequential testing framework using the Sequential Bayes Factor (SBF) design. The functions calculate the operating characteristics (OC) and the average sample number (ASN), and also conducts sequential tests for a sequentially observed data.
This package provides a single, phenome-wide permutation of large-scale biobank data. When a large number of phenotypes are analyzed in parallel, a single permutation across all phenotypes followed by genetic association analyses of the permuted data enables estimation of false discovery rates (FDRs) across the phenome. These FDR estimates provide a significance criterion for interpreting genetic associations in a biobank context. For the basic permutation of unrelated samples, this package takes a sample-by-variable file with ID, genotypic covariates, phenotypic covariates, and phenotypes as input. For data with related samples, it also takes a file with sample pair-wise identity-by-descent information. The function outputs a permuted sample-by-variable file ready for genome-wide association analysis. See Annis et al. (2021) <doi:10.21203/rs.3.rs-873449/v1> for details.
It performs interlaboratory studies (ILS) to detect those laboratories that provide non-consistent results when comparing to others. It permits to work simultaneously with various testing materials, from standard univariate, and functional data analysis (FDA) perspectives. The univariate approach based on ASTM E691-08 consist of estimating the Mandel's h and k statistics to identify those laboratories that provide more significant different results, testing also the presence of outliers by Cochran and Grubbs tests, Analysis of variance (ANOVA) techniques are provided (F and Tuckey tests) to test differences in means corresponding to different laboratories per each material. Taking into account the functional nature of data retrieved in analytical chemistry, applied physics and engineering (spectra, thermograms, etc.). ILS package provides a FDA approach for finding the Mandel's k and h statistics distribution by smoothing bootstrap resampling.
This package provides a single analysis path that includes distance-based ordination, global tests of any effect of the microbiome, and tests of the effects of individual taxa with false-discovery-rate (FDR) control. It accommodates both continuous and discrete covariates as well as interaction terms to be tested either singly or in combination, allows for adjustment of confounding covariates, and uses permutation-based p-values that can control for sample correlations. It can be applied to transformed data, and an omnibus test can combine results from analyses conducted on different transformation scales. It can also be used for testing presence-absence associations based on infinite number of rarefaction replicates, testing mediation effects of the microbiome, analyzing censored time-to-event outcomes, and for compositional analysis by fitting linear models to centered-log-ratio taxa count data.
This package provides functions for creating ensembles of optimal trees for regression, classification (Khan, Z., Gul, A., Perperoglou, A., Miftahuddin, M., Mahmoud, O., Adler, W., & Lausen, B. (2019). (2019) <doi:10.1007/s11634-019-00364-9>) and class membership probability estimation (Khan, Z, Gul, A, Mahmoud, O, Miftahuddin, M, Perperoglou, A, Adler, W & Lausen, B (2016) <doi:10.1007/978-3-319-25226-1_34>) are given. A few trees are selected from an initial set of trees grown by random forest for the ensemble on the basis of their individual and collective performance. Three different methods of tree selection for the case of classification are given. The prediction functions return estimates of the test responses and their class membership probabilities. Unexplained variations, error rates, confusion matrix, Brier scores, etc. are also returned for the test data.
It brings together several aspects of biodiversity data-cleaning in one place. bdc is organized in thematic modules related to different biodiversity dimensions, including 1) Merge datasets: standardization and integration of different datasets; 2) pre-filter: flagging and removal of invalid or non-interpretable information, followed by data amendments; 3) taxonomy: cleaning, parsing, and harmonization of scientific names from several taxonomic groups against taxonomic databases locally stored through the application of exact and partial matching algorithms; 4) space: flagging of erroneous, suspect, and low-precision geographic coordinates; and 5) time: flagging and, whenever possible, correction of inconsistent collection date. In addition, it contains features to visualize, document, and report data quality â which is essential for making data quality assessment transparent and reproducible. The reference for the methodology is Ribeiro and colleagues (2022) <doi:10.1111/2041-210X.13868>.
Support for fuzzy spatial objects, their operations, and fuzzy spatial inference models based on Spatial Plateau Algebra. It employs fuzzy set theory and fuzzy logic as foundation to deal with spatial fuzziness. It mainly implements underlying concepts defined in the following research papers: (i) "Spatial Plateau Algebra: An Executable Type System for Fuzzy Spatial Data Types" <doi:10.1109/FUZZ-IEEE.2018.8491565>; (ii) "A Systematic Approach to Creating Fuzzy Region Objects from Real Spatial Data Sets" <doi:10.1109/FUZZ-IEEE.2019.8858878>; (iii) "Spatial Data Types for Heterogeneously Structured Fuzzy Spatial Collections and Compositions" <doi:10.1109/FUZZ48607.2020.9177620>; (iv) "Fuzzy Inference on Fuzzy Spatial Objects (FIFUS) for Spatial Decision Support Systems" <doi:10.1109/FUZZ-IEEE.2017.8015707>; (v) "Evaluating Region Inference Methods by Using Fuzzy Spatial Inference Models" <doi:10.1109/FUZZ-IEEE55066.2022.9882658>.
Graph signals residing on the vertices of a graph have recently gained prominence in research in various fields. Many methodologies have been proposed to analyze graph signals by adapting classical signal processing tools. Recently, several notable graph signal decomposition methods have been proposed, which include graph Fourier decomposition based on graph Fourier transform, graph empirical mode decomposition, and statistical graph empirical mode decomposition. This package efficiently implements multiscale analysis applicable to various fields, and offers an effective tool for visualizing and decomposing graph signals. For the detailed methodology, see Ortega et al. (2018) <doi:10.1109/JPROC.2018.2820126>, Shuman et al. (2013) <doi:10.1109/MSP.2012.2235192>, Tremblay et al. (2014) <https://www.eurasip.org/Proceedings/Eusipco/Eusipco2014/HTML/papers/1569922141.pdf>, and Cho et al. (2024) "Statistical graph empirical mode decomposition by graph denoising and boundary treatment".
Tensor Composition Analysis (TCA) allows the deconvolution of two-dimensional data (features by observations) coming from a mixture of heterogeneous sources into a three-dimensional matrix of signals (features by observations by sources). The TCA framework further allows to test the features in the data for different statistical relations with an outcome of interest while modeling source-specific effects; particularly, it allows to look for statistical relations between source-specific signals and an outcome. For example, TCA can deconvolve bulk tissue-level DNA methylation data (methylation sites by individuals) into a three-dimensional tensor of cell-type-specific methylation levels for each individual (i.e. methylation sites by individuals by cell types) and it allows to detect cell-type-specific statistical relations (associations) with phenotypes. For more details see Rahmani et al. (2019) <DOI:10.1038/s41467-019-11052-9>.
This package provides a set of estimators for models and (robust) covariance matrices, and tests for panel data econometrics, including within/fixed effects, random effects, between, first-difference, nested random effects as well as instrumental-variable (IV) and Hausman-Taylor-style models, panel generalized method of moments (GMM) and general FGLS models, mean groups (MG), demeaned MG, and common correlated effects (CCEMG) and pooled (CCEP) estimators with common factors, variable coefficients and limited dependent variables models. Test functions include model specification, serial correlation, cross-sectional dependence, panel unit root and panel Granger (non-)causality. Typical references are general econometrics text books such as Baltagi (2021), Econometric Analysis of Panel Data (<doi:10.1007/978-3-030-53953-5>), Hsiao (2014), Analysis of Panel Data (<doi:10.1017/CBO9781139839327>), and Croissant and Millo (2018), Panel Data Econometrics with R (<doi:10.1002/9781119504641>).
Applies Beta Control Charts to defined values. The Beta Chart presents control limits based on the Beta probability distribution, making it suitable for monitoring fraction data from a Binomial distribution as a replacement for p-Charts. The Beta Chart has been applied in three real studies and compared with control limits from three different schemes. The comparative analysis showed that: (i) the Beta approximation to the Binomial distribution is more appropriate for values confined within the [0, 1] interval; and (ii) the proposed charts are more sensitive to the average run length (ARL) in both in-control and out-of-control process monitoring. Overall, the Beta Charts outperform the Shewhart control charts in monitoring fraction data. For more details, see à ngelo Márcio Oliveira Santâ Anna and Carla Schwengber ten Caten (2012) <doi:10.1016/j.eswa.2012.02.146>.
ANOVA and REML estimation of linear mixed models is implemented, once following Searle et al. (1991, ANOVA for unbalanced data), once making use of the lme4 package. The primary objective of this package is to perform a variance component analysis (VCA) according to CLSI EP05-A3 guideline "Evaluation of Precision of Quantitative Measurement Procedures" (2014). There are plotting methods for visualization of an experimental design, plotting random effects and residuals. For ANOVA type estimation two methods for computing ANOVA mean squares are implemented (SWEEP and quadratic forms). The covariance matrix of variance components can be derived, which is used in estimating confidence intervals. Linear hypotheses of fixed effects and LS means can be computed. LS means can be computed at specific values of covariables and with custom weighting schemes for factor variables. See ?VCA for a more comprehensive description of the features.
Implementation of the Dual Feature Reduction (DFR) approach for the Sparse Group Lasso (SGL) and the Adaptive Sparse Group Lasso (aSGL) (Feser and Evangelou (2024) <doi:10.48550/arXiv.2405.17094>). The DFR approach is a feature reduction approach that applies strong screening to reduce the feature space before optimisation, leading to speed-up improvements for fitting SGL (Simon et al. (2013) <doi:10.1080/10618600.2012.681250>) and aSGL (Mendez-Civieta et al. (2020) <doi:10.1007/s11634-020-00413-8> and Poignard (2020) <doi:10.1007/s10463-018-0692-7>) models. DFR is implemented using the Adaptive Three Operator Splitting (ATOS) (Pedregosa and Gidel (2018) <doi:10.48550/arXiv.1804.02339>) algorithm, with linear and logistic SGL models supported, both of which can be fit using k-fold cross-validation. Dense and sparse input matrices are supported.
Maximum likelihood estimation for the semi-parametric joint modeling of competing risks and longitudinal data in the presence of heterogeneous within-subject variability, proposed by Li and colleagues (2023) <doi:10.48550/arXiv.2506.12741>. The proposed method models the within-subject variability of the biomarker and associates it with the risk of the competing risks event. The time-to-event data is modeled using a (cause-specific) Cox proportional hazards regression model with time-fixed covariates. The longitudinal outcome is modeled using a mixed-effects location and scale model. The association is captured by shared random effects. The model is estimated using an Expectation Maximization algorithm. This is the final release of the JMH package. Active development has been moved to the FastJM package, which provides improved functionality and ongoing support. Users are strongly encouraged to transition to FastJM'.