The Structural Topic and Sentiment-Discourse (STS) model allows researchers to estimate topic models with document-level metadata that determines both topic prevalence and sentiment-discourse. The sentiment-discourse is modeled as a document-level latent variable for each topic that modulates the word frequency within a topic. These latent topic sentiment-discourse variables are controlled by the document-level metadata. The STS model can be useful for regression analysis with text data in addition to topic modelingâ s traditional use of descriptive analysis. The method was developed in Chen and Mankad (2024) <doi:10.1287/mnsc.2022.00261>.
This package provides a collection of integrated tools designed to seamlessly interact with each other for the analysis of biogenic silica bSi in inland and marine sediments. These tools share common data representations and follow a consistent API design. The primary goal of the bSi package is to simplify the installation process, facilitate data loading, and enable the analysis of multiple samples for biogenic silica fluxes. This package is designed to enhance the efficiency and coherence of the entire bSi analytic workflow, from data loading to model construction and visualization tailored towards reconstructing productivity in aquatic ecosystems.
This package produces statistical indicators of the impact of migration on the socio-demographic composition of an area. Three measures can be used: ratios, percentages and the Duncan index of dissimilarity. The input data files are assumed to be in an origin-destination matrix format, with each cell representing a flow count between an origin and a destination area. Columns are expected to represent origins, and rows are expected to represent destinations. The first row and column are assumed to contain labels for each area. See Rodriguez-Vignoli and Rowe (2018) <doi:10.1080/00324728.2017.1416155> for technical details.
The aim of the package is two-fold: (i) To implement the MMD method for attribution of individuals to sources using the Hamming distance between multilocus genotypes. (ii) To select informative genetic markers based on information theory concepts (entropy, mutual information and redundancy). The package implements the functions introduced by Perez-Reche, F. J., Rotariu, O., Lopes, B. S., Forbes, K. J. and Strachan, N. J. C. Mining whole genome sequence data to efficiently attribute individuals to source populations. Scientific Reports 10, 12124 (2020) <doi:10.1038/s41598-020-68740-6>. See more details and examples in the README file.
This package provides a collection of privacy-preserving distributed algorithms (PDAs) for conducting federated statistical learning across multiple data sites. The PDA framework includes models for various tasks such as regression, trial emulation, causal inference, design-specific analysis, and clustering. The PDA algorithms run on a lead site and only require summary statistics from collaborating sites, with one or few iterations. The package can be used together with the online data transfer system (<https://pda-ota.pdamethods.org/>) for safe and convenient collaboration. For more information, please visit our software websites: <https://github.com/Penncil/pda>, and <https://pdamethods.org/>.
This package provides methods for decomposing seasonal data: STR (a Seasonal-Trend time series decomposition procedure based on Regression) and Robust STR. In some ways, STR is similar to Ridge Regression and Robust STR can be related to LASSO. They allow for multiple seasonal components, multiple linear covariates with constant, flexible and seasonal influence. Seasonal patterns (for both seasonal components and seasonal covariates) can be fractional and flexible over time; moreover they can be either strictly periodic or have a more complex topology. The methods provide confidence intervals for the estimated components. The methods can also be used for forecasting.
An implementation of the RuleFit algorithm as described in Friedman & Popescu (2008) <doi:10.1214/07-AOAS148>. eXtreme Gradient Boosting ('XGBoost') is used to build rules, and glmnet is used to fit a sparse linear model on the raw and rule features. The result is a model that learns similarly to a tree ensemble, while often offering improved interpretability and achieving improved scoring runtime in live applications. Several algorithms for reducing rule complexity are provided, most notably hyperrectangle de-overlapping. All algorithms scale to several million rows and support sparse representations to handle tens of thousands of dimensions.
Bayesian network analysis is a form of probabilistic graphical models which derives from empirical data a directed acyclic graph, DAG, describing the dependency structure between random variables. An additive Bayesian network model consists of a form of a DAG where each node comprises a generalized linear model (GLM). Additive Bayesian network models are equivalent to Bayesian multivariate regression using graphical modelling, they generalises the usual multivariable regression, GLM, to multiple dependent variables. This package provides routines to help determine optimal Bayesian network models for a given data set, where these models are used to identify statistical dependencies in messy, complex data.
The Molecular Degree of Perturbation webtool quantifies the heterogeneity of samples. It takes a data.frame of omic data that contains at least two classes (control and test) and assigns a score to all samples based on how perturbed they are compared to the controls. It is based on the Molecular Distance to Health (Pankla et al. 2009), and expands on this algorithm by adding the options to calculate the z-score using the modified z-score (using median absolute deviation), change the z-score zeroing threshold, and look at genes that are most perturbed in the test versus control classes.
This package implements Meng's data defect index (ddi), which represents the degree of sample bias relative to an iid sample. The data defect correlation (ddc) represents the correlation between the outcome of interest and the selection into the sample; when the sample selection is independent across the population, the ddc is zero. Details are in Meng (2018) <doi:10.1214/18-AOAS1161SF>, "Statistical Paradises and Paradoxes in Big Data (I): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election." Survey estimates from the Cooperative Congressional Election Study (CCES) is included to replicate the article's results.
This package provides a collection of psychometric methods to process item metadata and use target assessment and measurement blueprint constraints to assemble a test form. Currently two automatic test assembly (ata) approaches are enabled. For example, the weighted (positive) deviations method, wdm(), proposed by Swanson and Stocking (1993) <doi:10.1177/014662169301700205> was implemented in its full specification allowing for both item selection as well as test form refinement. The linear constraint programming approach, atalp(), uses the linear equation solver by Berkelaar et. al (2014) <http://lpsolve.sourceforge.net/5.5/> to enable a variety of approaches to select items.
Computations for approximations and alternatives for the DPQ (Density (pdf), Probability (cdf) and Quantile) functions for probability distributions in R. Primary focus is on (central and non-central) beta, gamma and related distributions such as the chi-squared, F, and t. -- For several distribution functions, provide functions implementing formulas from Johnson, Kotz, and Kemp (1992) <doi:10.1002/bimj.4710360207> and Johnson, Kotz, and Balakrishnan (1995) for discrete or continuous distributions respectively. This is for the use of researchers in these numerical approximation implementations, notably for my own use in order to improve standard R pbeta(), qgamma(), ..., etc: '"dpq"'-functions.
Opinionated functions that enable easier and faster analysis of Viva Insights data. There are three main types of functions in wpa': (i) Standard functions create a ggplot visual or a summary table based on a specific Viva Insights metric; (2) Report Generation functions generate HTML reports on a specific analysis area, e.g. Collaboration; (3) Other miscellaneous functions cover more specific applications (e.g. Subject Line text mining) of Viva Insights data. This package adheres to tidyverse principles and works well with the pipe syntax. wpa is built with the beginner-to-intermediate R users in mind, and is optimised for simplicity.
This package provides a distance density clustering (DDC) algorithm in R. DDC uses dynamic time warping (DTW) to compute a similarity matrix, based on which cluster centers and cluster assignments are found. DDC inherits dynamic time warping (DTW) arguments and constraints. The cluster centers are centroid points that are calculated using the DTW Barycenter Averaging (DBA) algorithm. The clustering process is divisive. At each iteration, cluster centers are updated and data is reassigned to cluster centers. Early stopping is possible. The output includes cluster centers and clustering assignment, as described in the paper (Ma et al (2017) <doi:10.1109/ICDMW.2017.11>).
Various tools for the analysis of univariate, multivariate and functional extremes. Exact simulation from max-stable processes (Dombry, Engelke and Oesting, 2016, <doi:10.1093/biomet/asw008>, R-Pareto processes for various parametric models, including Brown-Resnick (Wadsworth and Tawn, 2014, <doi:10.1093/biomet/ast042>) and Extremal Student (Thibaud and Opitz, 2015, <doi:10.1093/biomet/asv045>). Threshold selection methods, including Wadsworth (2016) <doi:10.1080/00401706.2014.998345>, and Northrop and Coleman (2014) <doi:10.1007/s10687-014-0183-z>. Multivariate extreme diagnostics. Estimation and likelihoods for univariate extremes, e.g., Coles (2001) <doi:10.1007/978-1-4471-3675-0>.
This package implements maximum likelihood and bootstrap methods based on the diversity-dependent birth-death process to test whether speciation or extinction are diversity-dependent, under various models including various types of key innovations. See Etienne et al. 2012, Proc. Roy. Soc. B 279: 1300-1309, <DOI:10.1098/rspb.2011.1439>, Etienne & Haegeman 2012, Am. Nat. 180: E75-E89, <DOI:10.1086/667574>, Etienne et al. 2016. Meth. Ecol. Evol. 7: 1092-1099, <DOI:10.1111/2041-210X.12565> and Laudanno et al. 2021. Syst. Biol. 70: 389â 407, <DOI:10.1093/sysbio/syaa048>. Also contains functions to simulate the diversity-dependent process.
This package provides a collection of process capability index functions, such as C_p(), C_pk(), C_pm(), and others, along with metadata about each, like LaTeX equations and R expressions. Its primary purpose is to form a foundation for other quality control packages to build on top of, by providing basic resources and functions. The indices belong to the field of statistical quality control, and quantify the degree to which a manufacturing process is able to create items that adhere to a certain standard of quality. For details see Montgomery, D. C. (2019, ISBN:978-1-119-39930-8).
In panel data settings, specifies set of candidate models, fits them to data from pre-treatment validation periods, and selects model as average over candidate models, weighting each by posterior probability of being most robust given its differential average prediction errors in pre-treatment validation periods. Subsequent estimation and inference of causal effect's bounds accounts for both model and sampling uncertainty, and calculates the robustness changepoint value at which bounds go from excluding to including 0. The package also includes a range of diagnostic plots, such as those illustrating models differential average prediction errors and the posterior distribution of which model is most robust.
Sequential and batch change detection for univariate data streams, using the change point model framework. Functions are provided to allow nonparametric distribution-free change detection in the mean, variance, or general distribution of a given sequence of observations. Parametric change detection methods are also provided for Gaussian, Bernoulli and Exponential sequences. Both the batch (Phase I) and sequential (Phase II) settings are supported, and the sequences may contain either a single or multiple change points. A full description of this package is available in Ross, G.J (2015) - "Parametric and nonparametric sequential change detection in R" available at <https://www.jstatsoft.org/article/view/v066i03>.
Maximum likelihood estimation for the semi-parametric joint modeling of competing risks and longitudinal data in the presence of heterogeneous within-subject variability, proposed by Li and colleagues (2023) <arXiv:2301.06584>. The proposed method models the within-subject variability of the biomarker and associates it with the risk of the competing risks event. The time-to-event data is modeled using a (cause-specific) Cox proportional hazards regression model with time-fixed covariates. The longitudinal outcome is modeled using a mixed-effects location and scale model. The association is captured by shared random effects. The model is estimated using an Expectation Maximization algorithm.
Time series decomposition for univariate time series using the "Verallgemeinerte Berliner Verfahren" (Generalized Berlin Method) as described in Kontinuierliche Messgröà en und Stichprobenstrategien in Raum und Zeit mit Anwendungen in den Natur-, Umwelt-, Wirtschafts- und Finanzwissenschaften', by Hebbel and Steuer, Springer Berlin Heidelberg, 2022 <doi:10.1007/978-3-662-65638-9>, or Decomposition of Time Series using the Generalised Berlin Method (VBV) by Hebbel and Steuer, in Jan Beran, Yuanhua Feng, Hartmut Hebbel (Eds.): Empirical Economic and Financial Research - Theory, Methods and Practice, Festschrift in Honour of Prof. Siegfried Heiler. Series: Advanced Studies in Theoretical and Applied Econometrics. Springer 2014, p. 9-40.
Computes the Weighted Topological Overlap with positive and negative signs (wTO) networks given a data frame containing the mRNA count/ expression/ abundance per sample, and a vector containing the interested nodes of interaction (a subset of the elements of the full data frame). It also computes the cut-off threshold or p-value based on the individuals bootstrap or the values reshuffle per individual. It also allows the construction of a consensus network, based on multiple wTO networks. The package includes a visualization tool for the networks. More about the methodology can be found at <doi:10.1186/s12859-018-2351-7>.
Compare two classifications or clustering solutions that may or may not have the same number of classes, and that might have hard or soft (fuzzy, probabilistic) membership. Calculate various metrics to assess how the clusters compare to each other. The calculations are simple, but provide a handy tool for users unfamiliar with matrix multiplication. This package is not geared towards traditional accuracy assessment for classification/ mapping applications - the motivating use case is for comparing a probabilistic clustering solution to a set of reference or existing class labels that could have any number of classes (that is, without having to degrade the probabilistic clustering to hard classes).
Spatio-temporal Fixation Pattern Analysis (FPA) is a new method of analyzing eye movement data, developed by Mr. Jinlu Cao under the supervision of Prof. Chen Hsuan-Chih at The Chinese University of Hong Kong, and Prof. Wang Suiping at the South China Normal Univeristy. The package "fpa" is a R implementation which makes FPA analysis much easier. There are four major functions in the package: ft2fp(), get_pattern(), plot_pattern(), and lineplot(). The function ft2fp() is the core function, which can complete all the preprocessing within moments. The other three functions are supportive functions which visualize the eye fixation patterns.