Implementation of double machine learning (DML) algorithms in R, based on Emmenegger and Buehlmann (2021) "Regularizing Double Machine Learning in Partially Linear Endogenous Models" <arXiv:2101.12525> and Emmenegger and Buehlmann (2021) <arXiv:2108.13657> "Double Machine Learning for Partially Linear Mixed-Effects Models with Repeated Measurements". First part: our goal is to perform inference for the linear parameter in partially linear models with confounding variables. The standard DML estimator of the linear parameter has a two-stage least squares interpretation, which can lead to a large variance and overwide confidence intervals. We apply regularization to reduce the variance of the estimator, which produces narrower confidence intervals that are approximately valid. Nuisance terms can be flexibly estimated with machine learning algorithms. Second part: our goal is to estimate and perform inference for the linear coefficient in a partially linear mixed-effects model with DML. Machine learning algorithms allows us to incorporate more complex interaction structures and high-dimensional variables.
Tool for easy and efficient discretization of continuous and categorical data. The package calculates the most optimal binning of a given explanatory variable with respect to a user-specified target variable. The purpose is to assign a unique Weight-of-Evidence value to each of the calculated binpoints in order to recode the original variable. The package allows users to impose certain restrictions on the functional form on the resulting binning while maximizing the overall information value in the original data. The package is well suited for logistic scoring models where input variables may be subject to restrictions such as linearity by e.g. regulatory authorities. An excellent source describing in detail the development of scorecards, and the role of Weight-of-Evidence coding in credit scoring is (Siddiqi 2006, ISBN: 978â 0-471â 75451â 0). The package utilizes the discrete nature of decision trees and Isotonic Regression to accommodate the trade-off between flexible functional forms and maximum information value.
The at-Risk (aR) approach is based on a two-step parametric estimation procedure that allows to forecast the full conditional distribution of an economic variable at a given horizon, as a function of a set of factors. These density forecasts are then be used to produce coherent forecasts for any downside risk measure, e.g., value-at-risk, expected shortfall, downside entropy. Initially introduced by Adrian et al. (2019) <doi:10.1257/aer.20161923> to reveal the vulnerability of economic growth to financial conditions, the aR approach is currently extensively used by international financial institutions to provide Value-at-Risk (VaR) type forecasts for GDP growth (Growth-at-Risk) or inflation (Inflation-at-Risk). This package provides methods for estimating these models. Datasets for the US and the Eurozone are available to allow testing of the Adrian et al. (2019) model. This package constitutes a useful toolbox (data and functions) for private practitioners, scholars as well as policymakers.
This package provides graph-constrained regression methods in which regularization parameters are selected automatically via estimation of equivalent Linear Mixed Model formulation. riPEER (ridgified Partially Empirical Eigenvectors for Regression) method employs a penalty term being a linear combination of graph-originated and ridge-originated penalty terms, whose two regularization parameters are ML estimators from corresponding Linear Mixed Model solution; a graph-originated penalty term allows imposing similarity between coefficients based on graph information given whereas additional ridge-originated penalty term facilitates parameters estimation: it reduces computational issues arising from singularity in a graph-originated penalty matrix and yields plausible results in situations when graph information is not informative. riPEERc (ridgified Partially Empirical Eigenvectors for Regression with constant) method utilizes addition of a diagonal matrix multiplied by a predefined (small) scalar to handle the non-invertibility of a graph Laplacian matrix. vrPEER (variable reducted PEER) method performs variable-reduction procedure to handle the non-invertibility of a graph Laplacian matrix.
Create short sprint acceleration-velocity (AVP) and force-velocity (FVP) profiles and predict kinematic and kinetic variables using the timing-gate split times, laser or radar gun data, tether devices data, as well as the data provided by the GPS and LPS monitoring systems. The modeling method utilized in this package is based on the works of Furusawa K, Hill AV, Parkinson JL (1927) <doi: 10.1098/rspb.1927.0035>, Greene PR. (1986) <doi: 10.1016/0025-5564(86)90063-5>, Chelly SM, Denis C. (2001) <doi: 10.1097/00005768-200102000-00024>, Clark KP, Rieger RH, Bruno RF, Stearne DJ. (2017) <doi: 10.1519/JSC.0000000000002081>, Samozino P. (2018) <doi: 10.1007/978-3-319-05633-3_11>, Samozino P. and Peyrot N., et al (2022) <doi: 10.1111/sms.14097>, Clavel, P., et al (2023) <doi: 10.1016/j.jbiomech.2023.111602>, Jovanovic M. (2023) <doi: 10.1080/10255842.2023.2170713>, and Jovanovic M., et al (2024) <doi: 10.3390/s24092894>.
Fit Hawkes and log-Gaussian Cox process models with extensions. Introduced in Hawkes (1971) <doi:10.2307/2334319> a Hawkes process is a self-exciting temporal point process where the occurrence of an event immediately increases the chance of another. We extend this to consider self-inhibiting process and a non-homogeneous background rate. A log-Gaussian Cox process is a Poisson point process where the log-intensity is given by a Gaussian random field. We extend this to a joint likelihood formulation fitting a marked log-Gaussian Cox model. In addition, the package offers functionality to fit self-exciting spatiotemporal point processes. Models are fitted via maximum likelihood using TMB (Template Model Builder). Where included 1) random fields are assumed to be Gaussian and are integrated over using the Laplace approximation and 2) a stochastic partial differential equation model, introduced by Lindgren, Rue, and Lindström. (2011) <doi:10.1111/j.1467-9868.2011.00777.x>, is defined for the field(s).
Correspondence analysis (CA) is a matrix factorization method, and is similar to principal components analysis (PCA). Whereas PCA is designed for application to continuous, approximately normally distributed data, CA is appropriate for non-negative, count-based data that are in the same additive scale. The corral package implements CA for dimensionality reduction of a single matrix of single-cell data, as well as a multi-table adaptation of CA that leverages data-optimized scaling to align data generated from different sequencing platforms by projecting into a shared latent space. corral utilizes sparse matrices and a fast implementation of SVD, and can be called directly on Bioconductor objects (e.g., SingleCellExperiment) for easy pipeline integration. The package also includes additional options, including variations of CA to address overdispersion in count data (e.g., Freeman-Tukey chi-squared residual), as well as the option to apply CA-style processing to continuous data (e.g., proteomic TOF intensities) with the Hellinger distance adaptation of CA.
Computing and plotting the distance covariance and correlation function of a univariate or a multivariate time series. Both versions of biased and unbiased estimators of distance covariance and correlation are provided. Test statistics for testing pairwise independence are also implemented. Some data sets are also included. References include: a) Edelmann Dominic, Fokianos Konstantinos and Pitsillou Maria (2019). An Updated Literature Review of Distance Correlation and Its Applications to Time Series'. International Statistical Review, 87(2): 237--262. <doi:10.1111/insr.12294>. b) Fokianos Konstantinos and Pitsillou Maria (2018). Testing independence for multivariate time series via the auto-distance correlation matrix'. Biometrika, 105(2): 337--352. <doi:10.1093/biomet/asx082>. c) Fokianos Konstantinos and Pitsillou Maria (2017). Consistent testing for pairwise dependence in time series'. Technometrics, 59(2): 262--270. <doi:10.1080/00401706.2016.1156024>. d) Pitsillou Maria and Fokianos Konstantinos (2016). dCovTS: Distance Covariance/Correlation for Time Series'. R Journal, 8(2):324-340. <doi:10.32614/RJ-2016-049>.
An implementation of sparsity-ranked lasso and related methods for time series data. This methodology is especially useful for large time series with exogenous features and/or complex seasonality. Originally described in Peterson and Cavanaugh (2022) <doi:10.1007/s10182-021-00431-7> in the context of variable selection with interactions and/or polynomials, ranked sparsity is a philosophy with methods useful for variable selection in the presence of prior informational asymmetry. This situation exists for time series data with complex seasonality, as shown in Peterson and Cavanaugh (2024) <doi:10.1177/1471082X231225307>, which also describes this package in greater detail. The sparsity-ranked penalization methods for time series implemented in fastTS can fit large/complex/high-frequency time series quickly, even with a high-dimensional exogenous feature set. The method is considerably faster than its competitors, while often producing more accurate predictions. Also included is a long hourly series of arrivals into the University of Iowa Emergency Department with concurrent local temperature.
This package contains the functions for testing the spatial patterns (of segregation, spatial symmetry, association, disease clustering, species correspondence, and reflexivity) based on nearest neighbor relations, especially using contingency tables such as nearest neighbor contingency tables (Ceyhan (2010) <doi:10.1007/s10651-008-0104-x> and Ceyhan (2017) <doi:10.1016/j.jkss.2016.10.002> and references therein), nearest neighbor symmetry contingency tables (Ceyhan (2014) <doi:10.1155/2014/698296>), species correspondence contingency tables and reflexivity contingency tables (Ceyhan (2018) <doi:10.2436/20.8080.02.72> for two (or higher) dimensional data. The package also contains functions for generating patterns of segregation, association, uniformity in a multi-class setting (Ceyhan (2014) <doi:10.1007/s00477-013-0824-9>), and various non-random labeling patterns for disease clustering in two dimensional cases (Ceyhan (2014) <doi:10.1002/sim.6053>), and for visualization of all these patterns for the two dimensional data. The tests are usually (asymptotic) normal z-tests or chi-square tests.
mistyR is an implementation of the Multiview Intercellular SpaTialmodeling framework (MISTy). MISTy is an explainable machine learning framework for knowledge extraction and analysis of single-cell, highly multiplexed, spatially resolved data. MISTy facilitates an in-depth understanding of marker interactions by profiling the intra- and intercellular relationships. MISTy is a flexible framework able to process a custom number of views. Each of these views can describe a different spatial context, i.e., define a relationship among the observed expressions of the markers, such as intracellular regulation or paracrine regulation, but also, the views can also capture cell-type specific relationships, capture relations between functional footprints or focus on relations between different anatomical regions. Each MISTy view is considered as a potential source of variability in the measured marker expressions. Each MISTy view is then analyzed for its contribution to the total expression of each marker and is explained in terms of the interactions with other measurements that led to the observed contribution.
Log-multiplicative association models (LMA) are models for cross-classifications of categorical variables where interactions are represented by products of category scale values and an association parameter. Maximum likelihood estimation (MLE) fails for moderate to large numbers of categorical variables. The pleLMA package overcomes this limitation of MLE by using pseudo-likelihood estimation to fit the models to small or large cross-classifications dichotomous or multi-category variables. Originally proposed by Besag (1974, <doi:10.1111/j.2517-6161.1974.tb00999.x>), pseudo-likelihood estimation takes large complex models and breaks it down into smaller ones. Rather than maximizing the likelihood of the joint distribution of all the variables, a pseudo-likelihood function, which is the product likelihoods from conditional distributions, is maximized. LMA models can be derived from a number of different frameworks including (but not limited to) graphical models and uni-dimensional and multi-dimensional item response theory models. More details about the models and estimation can be found in the vignette.
Allows the user to carry out GLM on very large data sets. Data can be created using the data_frame() function and appended to the object with object$append(data); data_frame and data_matrix objects are available that allow the user to store large data on disk. The data is stored as doubles in binary format and any character columns are transformed to factors and then stored as numeric (binary) data while a look-up table is stored in a separate .meta_data file in the same folder. The data is stored in blocks and GLM regression algorithm is modified and carries out a MapReduce- like algorithm to fit the model. The functions bglm(), and summary() and bglm_predict() are available for creating and post-processing of models. The library requires Armadillo installed on your system. It may not function on windows since multi-core processing is done using mclapply() which forks R on Unix/Linux type operating systems.
This package provides a system for writing hierarchical statistical models largely compatible with BUGS and JAGS', writing nimbleFunctions to operate models and do basic R-style math, and compiling both models and nimbleFunctions via custom-generated C++. NIMBLE includes default methods for MCMC, Laplace Approximation, Monte Carlo Expectation Maximization, and some other tools. The nimbleFunction system makes it easy to do things like implement new MCMC samplers from R, customize the assignment of samplers to different parts of a model from R, and compile the new samplers automatically via C++ alongside the samplers NIMBLE provides. NIMBLE extends the BUGS'/'JAGS language by making it extensible: New distributions and functions can be added, including as calls to external compiled code. Although most people think of MCMC as the main goal of the BUGS'/'JAGS language for writing models, one can use NIMBLE for writing arbitrary other kinds of model-generic algorithms as well. A full User Manual is available at <https://r-nimble.org>.
This package provides the vcd2df function, which loads a IEEE 1364-1995/2001 VCD (.vcd) file, specified as a parameter of type string containing exactly a file path, and returns an R dataframe containing values over time. A VCD file captures the register values at discrete timepoints from a simulated trace of execution of a hardware design in Verilog or VHDL. The returned dataframe contains a row for each register, by name, and a column for each time point, specified VCD-style using octothorpe-prefixed multiples of the timescale as strings. The only non-trivial implementation details are that (1) VCD x and z non-numerical values are encoded as negative value -1 (as otherwise all bit values are positive) and (2) registers with repeated names in distinct modules are ignored, rather than duplicated, as we anticipate these registers to have the same values. Read more in arXiv preprint: vcd2df -- Leveraging Data Science Insights for Hardware Security Research <doi:10.48550/arXiv.2505.06470>.
The TRONCO (TRanslational ONCOlogy) R package collects algorithms to infer progression models via the approach of Suppes-Bayes Causal Network, both from an ensemble of tumors (cross-sectional samples) and within an individual patient (multi-region or single-cell samples). The package provides parallel implementation of algorithms that process binary matrices where each row represents a tumor sample and each column a single-nucleotide or a structural variant driving the progression; a 0/1 value models the absence/presence of that alteration in the sample. The tool can import data from plain, MAF or GISTIC format files, and can fetch it from the cBioPortal for cancer genomics. Functions for data manipulation and visualization are provided, as well as functions to import/export such data to other bioinformatics tools for, e.g, clustering or detection of mutually exclusive alterations. Inferred models can be visualized and tested for their confidence via bootstrap and cross-validation. TRONCO is used for the implementation of the Pipeline for Cancer Inference (PICNIC).
The tools for MicroRNA Set Enrichment Analysis can identify risk pathways(or prior gene sets) regulated by microRNA set in the context of microRNA expression data. (1) This package constructs a correlation profile of microRNA and pathways by the hypergeometric statistic test. The gene sets of pathways derived from the three public databases (Kyoto Encyclopedia of Genes and Genomes ('KEGG'); Reactome'; Biocarta') and the target gene sets of microRNA are provided by four databases('TarBaseV6.0'; mir2Disease'; miRecords'; miRTarBase';). (2) This package can quantify the change of correlation between microRNA for each pathway(or prior gene set) based on a microRNA expression data with cases and controls. (3) This package uses the weighted Kolmogorov-Smirnov statistic to calculate an enrichment score (ES) of a microRNA set that co-regulate to a pathway , which reflects the degree to which a given pathway is associated with the specific phenotype. (4) This package can provide the visualization of the results.
Designed for estimating variants of hidden (latent) Markov models (HMMs), mixture HMMs, and non-homogeneous HMMs (NHMMs) for social sequence data and other categorical time series. Special cases include feedback-augmented NHMMs, Markov models without latent layer, mixture Markov models, and latent class models. The package supports models for one or multiple subjects with one or multiple parallel sequences (channels). External covariates can be added to explain cluster membership in mixture models as well as initial, transition and emission probabilities in NHMMs. The package provides functions for evaluating and comparing models, as well as functions for visualizing of multichannel sequence data and HMMs. For NHMMs, methods for computing average causal effects and marginal state and emission probabilities are available. Models are estimated using maximum likelihood via the EM algorithm or direct numerical maximization with analytical gradients. Documentation is available via several vignettes, and Helske and Helske (2019, <doi:10.18637/jss.v088.i03>). For methodology behind the NHMMs, see Helske (2025, <doi:10.48550/arXiv.2503.16014>).
Interactive visualizations of graphs created with the igraph package using a htmlwidgets wrapper for the sigma.js network visualization v2.4.0 <https://www.sigmajs.org/>, enabling to display several thousands of nodes. While several R packages have been developed to interface sigma.js', all were developed for v1.x.x and none have migrated to v2.4.0 nor are they planning to. This package builds upon the sigmaNet package, and users familiar with it will recognize the similar design approach. Two extensions have been added to the classic sigma.js visualizations by overriding the underlying JavaScript code, enabling to draw a frame around node labels, and to display labels on multiple lines by parsing line breaks. Other additional functionalities that did not require overriding sigma.js code include toggling node visibility when clicked using a node attribute and highlighting specific edges. sigma.js is currently preparing a stable release v3.0.0, and this package plans to update to it when it is available.
An integrated set of tools for thermodynamic calculations in aqueous geochemistry and geobiochemistry. Functions are provided for writing balanced reactions to form species from user-selected basis species and for calculating the standard molal properties of species and reactions, including the standard Gibbs energy and equilibrium constant. Calculations of the non-equilibrium chemical affinity and equilibrium chemical activity of species can be portrayed on diagrams as a function of temperature, pressure, or activity of basis species; in two dimensions, this gives a maximum affinity or predominance diagram. The diagrams have formatted chemical formulas and axis labels, and water stability limits can be added to Eh-pH, oxygen fugacity- temperature, and other diagrams with a redox variable. The package has been developed to handle common calculations in aqueous geochemistry, such as solubility due to complexation of metal ions, mineral buffers of redox or pH, and changing the basis species across a diagram ("mosaic diagrams"). CHNOSZ also implements a group additivity algorithm for the standard thermodynamic properties of proteins.
This package provides functions are provided for estimation, testing, diagnostic checking and forecasting of generalized linear autoregressive moving average (GLARMA) models for discrete valued time series with regression variables. These are a class of observation driven non-linear non-Gaussian state space models. The state vector consists of a linear regression component plus an observation driven component consisting of an autoregressive-moving average (ARMA) filter of past predictive residuals. Currently three distributions (Poisson, negative binomial and binomial) can be used for the response series. Three options (Pearson, score-type and unscaled) for the residuals in the observation driven component are available. Estimation is via maximum likelihood (conditional on initializing values for the ARMA process) optimized using Fisher scoring or Newton Raphson iterative methods. Likelihood ratio and Wald tests for the observation driven component allow testing for serial dependence in generalized linear model settings. Graphical diagnostics including model fits, autocorrelation functions and probability integral transform residuals are included in the package. Several standard data sets are included in the package.
Real-time quantitative polymerase chain reaction (qPCR) data sets by Karlen et al. (2007) <doi:10.1186/1471-2105-8-131>. Provides one single tabular tidy data set in long format, encompassing 32 dilution series, for seven PCR targets and four biological samples. The targeted amplicons are within the murine genes: Cav1, Ccn2, Eln, Fn1, Rpl27, Hspg2, and Serpine1, respectively. Dilution series: scheme 1 (Cav1, Eln, Hspg2, Serpine1): 1-fold, 10-fold, 50-fold, and 100-fold; scheme 2 (Ccn2, Rpl27, Fn1): 1-fold, 10-fold, 50-fold, 100-fold and 1000-fold. For each concentration there are five replicates, except for the 1000-fold concentration, where only two replicates were performed. Each amplification curve is 40 cycles long. Original raw data file is Additional file 2 from "Statistical significance of quantitative PCR" by Y. Karlen, A. McNair, S. Perseguers, C. Mazza, and N. Mermod (2007) <https://static-content.springer.com/esm/art%3A10.1186%2F1471-2105-8-131/MediaObjects/12859_2006_1503_MOESM2_ESM.ZIP>.
Flexible functions that use lme4 as computational engine for fitting models used in Genomic Selection (GS). GS is a technology used for genetic improvement, and it has many advantages over phenotype-based selection. There are several statistical models that adequately approach the statistical challenges in GS, such as in linear mixed models (LMMs). The lme4 is the standard package for fitting linear and generalized LMMs in the R-package, but its use for genetic analysis is limited because it does not allow the correlation between individuals or groups of individuals to be defined. The lme4GS package is focused on fitting LMMs with covariance structures defined by the user, bandwidth selection, and genomic prediction. The new package is focused on genomic prediction of the models used in GS and can fit LMMs using different variance-covariance matrices. Several examples of GS models are presented using this package as well as the analysis using real data. For more details see Caamal-Pat et.al. (2021) <doi:10.3389/fgene.2021.680569>.
To meet the needs of statistical power calculation for stepped wedge cluster randomized trials, we developed this software. Different parameters can be specified by users for different scenarios, including: cross-sectional and cohort designs, binary and continuous outcomes, marginal (GEE) and conditional models (mixed effects model), three link functions (identity, log, logit links), with and without time effects (the default specification assumes no-time-effect) under exchangeable, nested exchangeable and block exchangeable correlation structures. Unequal numbers of clusters per sequence are also allowed. The methods included in this package: Zhou et al. (2020) <doi:10.1093/biostatistics/kxy031>, Li et al. (2018) <doi:10.1111/biom.12918>. Supplementary documents can be found at: <https://ysph.yale.edu/cmips/research/software/study-design-power-calculation/swdpwr/>. The Shiny app for swdpwr can be accessed at: <https://jiachenchen322.shinyapps.io/swdpwr_shinyapp/>. The package also includes functions that perform calculations for the intra-cluster correlation coefficients based on the random effects variances as input variables for continuous and binary outcomes, respectively.