Companion R package for the course "Statistical analysis of correlated and repeated measurements for health science researchers" taught by the section of Biostatistics of the University of Copenhagen. It implements linear mixed models where the model for the variance-covariance of the residuals is specified via patterns (compound symmetry, toeplitz, unstructured, ...). Statistical inference for mean, variance, and correlation parameters is performed based on the observed information and a Satterthwaite approximation of the degrees of freedom. Normalized residuals are provided to assess model misspecification. Statistical inference can be performed for arbitrary linear or non-linear combination(s) of model coefficients. Predictions can be computed conditional to covariates only or also to outcome values.
Machine learning method specifically designed for pre-miRNA
prediction. It takes advantage of unlabeled sequences to improve the prediction rates even when there are just a few positive examples, when the negative examples are unreliable or are not good representatives of its class. Furthermore, the method can automatically search for negative examples if the user is unable to provide them. MiRNAss
can find a good boundary to divide the pre-miRNAs
from other groups of sequences; it automatically optimizes the threshold that defines the classes boundaries, and thus, it is robust to high class imbalance. Each step of the method is scalable and can handle large volumes of data.
Transfer learning, as a prevailing technique in computer sciences, aims to improve the performance of a target model by leveraging auxiliary information from heterogeneous source data. We provide novel tools for multi-source transfer learning under statistical models based on model averaging strategies, including linear regression models, partially linear models. Unlike existing transfer learning approaches, this method integrates the auxiliary information through data-driven weight assignments to avoid negative transfer. This is the first package for transfer learning based on the optimal model averaging frameworks, providing efficient implementations for practitioners in multi-source data modeling. The details are described in Hu and Zhang (2023) <https://jmlr.org/papers/v24/23-0030.html>.
This package provides a function for estimating the transition probabilities in an illness-death model. The transition probabilities can be estimated from the unsmoothed landmark estimators developed by de Una-Alvarez and Meira-Machado (2015) <doi:10.1111/biom.12288>. Presmoothed estimates can also be obtained through the use of a parametric family of binary regression curves, such as logit, probit or cauchit. The additive logistic regression model and nonparametric regression are also alternatives which have been implemented. The idea behind the presmoothed landmark estimators is to use the presmoothing techniques developed by Cao et al. (2005) <doi:10.1007/s00180-007-0076-6> in the landmark estimation of the transition probabilities.
Penalized and non-penalized maximum likelihood estimation of smooth transition vector autoregressive models with various types of transition weight functions, conditional distributions, and identification methods. Constrained estimation with various types of constraints is available. Residual based model diagnostics, forecasting, simulations, counterfactual analysis, and computation of impulse response functions, generalized impulse response functions, generalized forecast error variance decompositions, as well as historical decompositions. See Heather Anderson, Farshid Vahid (1998) <doi:10.1016/S0304-4076(97)00076-6>, Helmut Lütkepohl, Aleksei Netšunajev (2017) <doi:10.1016/j.jedc.2017.09.001>, Markku Lanne, Savi Virolainen (2025) <doi:10.48550/arXiv.2403.14216>
, Savi Virolainen (2025) <doi:10.48550/arXiv.2404.19707>
.
An introduction to a couple of novel predictive variable selection methods for generalised boosted regression modeling (gbm). They are based on various variable influence methods (i.e., relative variable influence (RVI) and knowledge informed RVI (i.e., KIRVI, and KIRVI2)) that adopted similar ideas as AVI, KIAVI and KIAVI2 in the steprf package, and also based on predictive accuracy in stepwise algorithms. For details of the variable selection methods, please see: Li, J., Siwabessy, J., Huang, Z. and Nichol, S. (2019) <doi:10.3390/geosciences9040180>. Li, J., Alvarez, B., Siwabessy, J., Tran, M., Huang, Z., Przeslawski, R., Radke, L., Howard, F., Nichol, S. (2017). <DOI: 10.13140/RG.2.2.27686.22085>.
Computation of stopping boundaries for a single-arm trial using a Bayesian criterion; i.e., for each m<=n (n= total patient number of the trial) the smallest number of observed toxicities is calculated leading to the termination of the trial/accrual according to the specified criteria. The probabilities of stopping the trial/accrual at and up until (resp.) the m-th patient (m<=n) is also calculated. This design is more conservative than the frequentist approach (using Clopper Pearson CIs) which might be preferred as it concerns safety.See also Aamot et.al.(2010) "Continuous monitoring of toxicity in clinical trials - simulating the risk of stopping prematurely" <doi:10.5414/cpp48476>.
Supports systematic scrutiny, modification, and integration of data. The function status()
counts rows that have missing values in grouping columns (returned by na()
), have non-unique combinations of grouping columns (returned by dup()
), and that are not locally sorted (returned by unsorted()
). Functions enumerate()
and itemize()
give sorted unique combinations of columns, with or without occurrence counts, respectively. Function ignore()
drops columns in x that are present in y, and informative()
drops columns in x that are entirely NA; constant()
returns values that are constant, given a key. Data that have defined unique combinations of grouping values behave more predictably during merge operations.
Intuitive framework for identifying spatially variable genes (SVGs) and differential spatial variable pattern (DSP) between conditions via edgeR
, a popular method for performing differential expression analyses. Based on pre-annotated spatial clusters as summarized spatial information, DESpace models gene expression using a negative binomial (NB), via edgeR
, with spatial clusters as covariates. SVGs are then identified by testing the significance of spatial clusters. For multi-sample, multi-condition datasets, we again fit a NB model via edgeR
, incorporating spatial clusters, conditions and their interactions as covariates. DSP genes-representing differences in spatial gene expression patterns across experimental conditions-are identified by testing the interaction between spatial clusters and conditions.
We provide tools to estimate two prediction accuracy metrics, the average positive predictive values (AP) as well as the well-known AUC (the area under the receiver operator characteristic curve) for risk scores. The outcome of interest is either binary or censored event time. Note that for censored event time, our functions estimates, the AP and the AUC, are time-dependent for pre-specified time interval(s). A function that compares the APs of two risk scores/markers is also included. Optional outputs include positive predictive values and true positive fractions at the specified marker cut-off values, and a plot of the time-dependent AP versus time (available for event time data).
This package provides a specialized tool is designed for assessing contextual bandit algorithms, particularly those aimed at handling overdispersed and zero-inflated count data. It offers a simulated testing environment that includes various models like Poisson, Overdispersed Poisson, Zero-inflated Poisson, and Zero-inflated Overdispersed Poisson. The package is capable of executing five specific algorithms: Linear Thompson sampling with log transformation on the outcome, Thompson sampling Poisson, Thompson sampling Negative Binomial, Thompson sampling Zero-inflated Poisson, and Thompson sampling Zero-inflated Negative Binomial. Additionally, it can generate regret plots to evaluate the performance of contextual bandit algorithms. This package is based on the algorithms by Liu et al. (2023) <arXiv:2311.14359>
.
Computations of Fisher's z-tests concerning different kinds of correlation differences. The diffpwr family entails approaches to estimating statistical power via Monte Carlo simulations. Important to note, the Pearson correlation coefficient is sensitive to linear association, but also to a host of statistical issues such as univariate and bivariate outliers, range restrictions, and heteroscedasticity (e.g., Duncan & Layard, 1973 <doi:10.1093/BIOMET/60.3.551>; Wilcox, 2013 <doi:10.1016/C2010-0-67044-1>). Thus, every power analysis requires that specific statistical prerequisites are fulfilled and can be invalid if the prerequisites do not hold. To this end, the bootcor family provides bootstrapping confidence intervals for the incorporated correlation difference tests.
Doubly censored data, as described in Chang and Yang (1987) <doi: 10.1214/aos/1176350608>), are commonly seen in many fields. We use EM algorithm to compute the non-parametric MLE (NPMLE) of the cummulative probability function/survival function and the two censoring distributions. One can also specify a constraint F(T)=C, it will return the constrained NPMLE and the -2 log empirical likelihood ratio for this constraint. This can be used to test the hypothesis about the constraint and, by inverting the test, find confidence intervals for probability or quantile via empirical likelihood ratio theorem. Influence functions of hat F may also be calculated, but currently, the it may be slow.
Simulation and estimation of Exponential Random Graph Models (ERGMs) for small networks using exact statistics as shown in Vega Yon et al. (2020) <DOI:10.1016/j.socnet.2020.07.005>. As a difference from the ergm package, ergmito circumvents using Markov-Chain Maximum Likelihood Estimator (MC-MLE) and instead uses Maximum Likelihood Estimator (MLE) to fit ERGMs for small networks. As exhaustive enumeration is computationally feasible for small networks, this R package takes advantage of this and provides tools for calculating likelihood functions, and other relevant functions, directly, meaning that in many cases both estimation and simulation of ERGMs for small networks can be faster and more accurate than simulation-based algorithms.
This package provides a set of simplified functions for creating funnel plots for proportion data. This package supports user defined benchmarks, confidence limits and estimation methods (i.e. exact or approximate) based on Spiegelhalter (2005) <doi:10.1002/sim.1970>. Additional routines for returning scored unit level data according to a set of specifications is also implemented for convenience. Specifically, both a categorical and a continuous score variable is returned to the sample data frame, which identifies which observations are deemed extreme or in control. Typically, such variables are useful as stratifications or covariates in further exploratory analyses. Lastly, the plotting routine returns a base funnel plot ('ggplot2'), which can also be tailored.
Generalized competing event model based on Cox PH model and Fine-Gray model. This function is designed to develop optimized risk-stratification methods for competing risks data, such as described in: 1. Carmona R, Gulaya S, Murphy JD, Rose BS, Wu J, Noticewala S,McHale
MT, Yashar CM, Vaida F, and Mell LK (2014) <DOI:10.1016/j.ijrobp.2014.03.047>. 2. Carmona R, Zakeri K, Green G, Hwang L, Gulaya S, Xu B, Verma R, Williamson CW, Triplett DP, Rose BS, Shen H, Vaida F, Murphy JD, and Mell LK (2016) <DOI:10.1200/JCO.2015.65.0739>. 3. Lunn, Mary, and Don McNeil
(1995) <DOI:10.2307/2532940>.
This package provides a tool for Hierarchical Climate Regionalization applicable to any correlation-based clustering. It adds several features and a new clustering method (called, regional linkage) to hierarchical clustering in R ('hclust function in stats library): data regridding, coarsening spatial resolution, geographic masking, contiguity-constrained clustering, data filtering by mean and/or variance thresholds, data preprocessing (detrending, standardization, and PCA), faster correlation function with preliminary big data support, different clustering methods, hybrid hierarchical clustering, multivariate clustering (MVC), cluster validation, visualization of regionalization results, and exporting region map and mean timeseries into NetCDF-4
file. The technical details are described in Badr et al. (2015) <doi:10.1007/s12145-015-0221-7>.
This package provides access to the Idea Data Center (IDC) application for conducting nonresponse bias analysis (NRBA). The IDC NRBA app is an interactive, browser-based Shiny application that can be used to analyze survey data with respect to response rates, representativeness, and nonresponse bias. This app provides a user-friendly interface to statistical methods implemented by the nrba package. Krenzke, Van de Kerckhove, and Mohadjer (2005) <http://www.asasrms.org/Proceedings/y2005/files/JSM2005-000572.pdf> and Lohr and Riddles (2016) <https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2016002/article/14677-eng.pdf?st=q7PyNsGR>
provide an overview of the statistical methods implemented in the application.
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq
) is the premier technology for profiling genome-wide localization of chromatin-binding proteins, including transcription factors and histones with various modifications. This package provides a robust method for normalizing ChIP-seq
signals across individual samples or groups of samples. It also designs a self-contained system of statistical models for calling differential ChIP-seq
signals between two or more biological conditions as well as for calling hypervariable ChIP-seq
signals across samples. Refer to Tu et al. (2021) <doi:10.1101/gr.262675.120> and Chen et al. (2022) <doi:10.1186/s13059-022-02627-9> for associated statistical details.
Read depth data from genotyping-by-sequencing (GBS) or restriction site-associated DNA sequencing (RAD-seq) are imported and used to make Bayesian probability estimates of genotypes in polyploids or diploids. The genotype probabilities, posterior mean genotypes, or most probable genotypes can then be exported for downstream analysis. polyRAD
is described by Clark et al. (2019) <doi:10.1534/g3.118.200913>, and the Hind/He statistic for marker filtering is described by Clark et al. (2022) <doi:10.1186/s12859-022-04635-9>. A variant calling pipeline for highly duplicated genomes is also included and is described by Clark et al. (2020, Version 1) <doi:10.1101/2020.01.11.902890>.
Evaluates moments of ratios (and products) of quadratic forms in normal variables, specifically using recursive algorithms developed by Bao and Kan (2013) <doi:10.1016/j.jmva.2013.03.002> and Hillier et al. (2014) <doi:10.1017/S0266466613000364>. Also provides distribution, quantile, and probability density functions of simple ratios of quadratic forms in normal variables with several algorithms. Originally developed as a supplement to Watanabe (2023) <doi:10.1007/s00285-023-01930-8> for evaluating average evolvability measures in evolutionary quantitative genetics, but can be used for a broader class of statistics. Generating functions for these moments are also closely related to the top-order zonal and invariant polynomials of matrix arguments.
Fits time trend models for routine disease surveillance tasks and returns probability distributions for a variety of quantities of interest, including age-standardized rates, period and cumulative percent change, and measures of health inequality. The models are appropriate for count data such as disease incidence and mortality data, employing a Poisson or binomial likelihood and the first-difference (random-walk) prior for unknown risk. Optionally add a covariance matrix for multiple, correlated time series models. Inference is completed using Markov chain Monte Carlo via the Stan modeling language. References: Donegan, Hughes, and Lee (2022) <doi:10.2196/34589>; Stan Development Team (2021) <https://mc-stan.org>; Theil (1972, ISBN:0-444-10378-3).
This package provides functions that solve initial value problems of a system of first-order ordinary differential equations (ODE), of partial differential equations (PDE), of differential algebraic equations (DAE), and of delay differential equations. The functions provide an interface to the FORTRAN functions lsoda
, lsodar
, lsode
, lsodes
of the ODEPACK collection, to the FORTRAN functions dvode
and daspk
and a C-implementation of solvers of the Runge-Kutta family with fixed or variable time steps. The package contains routines designed for solving ODEs resulting from 1-D, 2-D and 3-D partial differential equations that have been converted to ODEs by numerical differencing.
This package provides a recent method proposed by Yi and Chen (2023) <doi:10.1177/09622802221146308> is used to estimate the average treatment effects using noisy data containing both measurement error and spurious variables. The package AteMeVs
contains a set of functions that provide a step-by-step estimation procedure, including the correction of the measurement error effects, variable selection for building the model used to estimate the propensity scores, and estimation of the average treatment effects. The functions contain multiple options for users to implement, including different ways to correct for the measurement error effects, distinct choices of penalty functions to do variable selection, and various regression models to characterize propensity scores.