The MARSS package provides maximum-likelihood parameter estimation for constrained and unconstrained linear multivariate autoregressive state-space (MARSS) models, including partially deterministic models. MARSS models are a class of dynamic linear model (DLM) and vector autoregressive model (VAR) model. Fitting available via Expectation-Maximization (EM), BFGS (using optim), and TMB (using the marssTMB
companion package). Functions are provided for parametric and innovations bootstrapping, Kalman filtering and smoothing, model selection criteria including bootstrap AICb, confidences intervals via the Hessian approximation or bootstrapping, and all conditional residual types. See the user guide for examples of dynamic factor analysis, dynamic linear models, outlier and shock detection, and multivariate AR-p models. Online workshops (lectures, eBook
, and computer labs) at <https://atsa-es.github.io/>.
Implementations of several methods for principal component analysis using the L1 norm. The package depends on COIN-OR Clp version >= 1.17.4. The methods implemented are PCA-L1 (Kwak 2008) <DOI:10.1109/TPAMI.2008.114>, L1-PCA (Ke and Kanade 2003, 2005) <DOI:10.1109/CVPR.2005.309>, L1-PCA* (Brooks, Dula, and Boone 2013) <DOI:10.1016/j.csda.2012.11.007>, L1-PCAhp (Visentin, Prestwich and Armagan 2016) <DOI:10.1007/978-3-319-46227-1_37>, wPCA
(Park and Klabjan 2016) <DOI: 10.1109/ICDM.2016.0054>, awPCA
(Park and Klabjan 2016) <DOI: 10.1109/ICDM.2016.0054>, PCA-Lp (Kwak 2014) <DOI:10.1109/TCYB.2013.2262936>, and SharpEl1-PCA
(Brooks and Dula, submitted).
The developed package can be used to generate a spatial population for different levels of relationships among the dependent and auxiliary variables along with spatially varying model parameters. A spatial layout is designed as a [0,k-1]x[0,k-1] square region on which observations are collected at (k x k) lattice points with a unit distance between any two neighbouring points along the horizontal and vertical axes. For method details see Chao, Liu., Chuanhua, Wei. and Yunan, Su. (2018).<doi:10.1080/10485252.2018.1499907>. The generated spatial population can be utilized in Geographically Weighted Regression model based analysis for studying the spatially varying relationships among the variables. Furthermore, various statistical analysis can be performed on this spatially generated data.
Implementation of functions for fitting taper curves (a semiparametric linear mixed effects taper model) to diameter measurements along stems. Further functions are provided to estimate the uncertainty around the predicted curves, to calculate timber volume (also by sections) and marginal (e.g., upper) diameters. For cases where tree heights are not measured, methods for estimating additional variance in volume predictions resulting from uncertainties in tree height models (tariffs) are provided. The example data include the taper curve parameters for Norway spruce used in the 3rd German NFI fitted to 380 trees and a subset of section-wise diameter measurements of these trees. The functions implemented here are detailed in Kublin, E., Breidenbach, J., Kaendler, G. (2013) <doi:10.1007/s10342-013-0715-0>.
The _CAGEr_ package identifies transcription start sites (TSS) and their usage frequency from CAGE (Cap Analysis Gene Expression) sequencing data. It normalises raw CAGE tag count, clusters TSSs into tag clusters (TC) and aggregates them across multiple CAGE experiments to construct consensus clusters (CC) representing the promoterome. CAGEr provides functions to profile expression levels of these clusters by cumulative expression and rarefaction analysis, and outputs the plots in ggplot2 format for further facetting and customisation. After clustering, CAGEr performs analyses of promoter width and detects differential usage of TSSs (promoter shifting) between samples. CAGEr also exports its data as genome browser tracks, and as R objects for downsteam expression analysis by other Bioconductor packages such as DESeq2, CAGEfightR
, or seqArchR
.
Calculates concentration and dispersion in ordered rating scales. It implements various measures of concentration and dispersion to describe what researchers variably call agreement, concentration, consensus, dispersion, or polarization among respondents in ordered data. It also implements other related measures to classify distributions. In addition to a generic city-block based concentration measure and a generic dispersion measure, the package implements various measures, including van der Eijk's (2001) <DOI: 10.1023/A:1010374114305> measure of agreement A, measures of concentration by Leik, Tatsle and Wierman, Blair and Lacy, Kvalseth, Berry and Mielke, Reardon, and Garcia-Montalvo and Reynal-Querol. Furthermore, the package provides an implementation of Galtungs AJUS-system to classify distributions, as well as a function to identify the position of multiple modes.
This package provides a set of control charts for batch processes based on the VAR model. The package contains the implementation of T2.var and W.var control charts based on VAR model coefficients using the couple vectors theory. In each time-instant the VAR coefficients are estimated from a historical in-control dataset and a decision rule is made for online classifying of a new batch data. Those charts allow efficient online monitoring since the very first time-instant. The offline version is available too. In order to evaluate the chart's performance, this package contains functions to generate batch data for offline and online monitoring.See in Danilo Marcondes Filho and Marcio Valk (2020) <doi:10.1016/j.ejor.2019.12.038>.
This package provides a way to apply Distance-Based Common Spatial Patterns (DB-CSP) techniques in different fields, both classical Common Spatial Patterns (CSP) as well as DB-CSP. The method is composed of two phases: applying the DB-CSP algorithm and performing a classification. The main idea behind the CSP is to use a linear transform to project data into low-dimensional subspace with a projection matrix, in such a way that each row consists of weights for signals. This transformation maximizes the variance of two-class signal matrices.The dbcsp object is created to compute the projection vectors. For exploratory and descriptive purpose, plot and boxplot functions can be used. Functions train, predict and selectQ
are implemented for the classification step.
Approximate marginal maximum likelihood estimation of multidimensional latent variable models via adaptive quadrature or Laplace approximations to the integrals in the likelihood function, as presented for confirmatory factor analysis models in Jin, S., Noh, M., and Lee, Y. (2018) <doi:10.1080/10705511.2017.1403287>, for item response theory models in Andersson, B., and Xin, T. (2021) <doi:10.3102/1076998620945199>, and for generalized linear latent variable models in Andersson, B., Jin, S., and Zhang, M. (2023) <doi:10.1016/j.csda.2023.107710>. Models implemented include the generalized partial credit model, the graded response model, and generalized linear latent variable models for Poisson, negative-binomial and normal distributions. Supports a combination of binary, ordinal, count and continuous observed variables and multiple group models.
Developed a data-driven estimation framework for the multi-threshold accelerate failure time (MTAFT) model. The MTAFT model features different linear forms in different subdomains, and one of the major challenges is determining the number of threshold effects. The package introduces a data-driven approach that utilizes a Schwarz information criterion, which demonstrates consistency under mild conditions. Additionally, a cross-validation (CV) criterion with an order-preserved sample-splitting scheme is proposed to achieve consistent estimation, without the need for additional parameters. The package establishes the asymptotic properties of the parameter estimates and includes an efficient score-type test to examine the existence of threshold effects. The methodologies are supported by numerical experiments and theoretical results, showcasing their reliable performance in finite-sample cases.
This package implements empirical Bayes approaches to genotype polyploids from next generation sequencing data while accounting for allele bias, overdispersion, and sequencing error. The main functions are flexdog()
and multidog()
, which allow the specification of many different genotype distributions. Also provided are functions to simulate genotypes, rgeno()
, and read-counts, rflexdog()
, as well as functions to calculate oracle genotyping error rates, oracle_mis()
, and correlation with the true genotypes, oracle_cor()
. These latter two functions are useful for read depth calculations. Run browseVignettes(package
= "updog") in R for example usage. See Gerard et al. (2018) <doi:10.1534/genetics.118.301468> and Gerard and Ferrao (2020) <doi:10.1093/bioinformatics/btz852> for details on the implemented methods.
dinoR
tests for significant differences in NOMe-seq footprints between two conditions, using genomic regions of interest (ROI) centered around a landmark, for example a transcription factor (TF) motif. This package takes NOMe-seq data (GCH methylation/protection) in the form of a Ranged Summarized Experiment as input. dinoR
can be used to group sequencing fragments into 3 or 5 categories representing characteristic footprints (TF bound, nculeosome bound, open chromatin), plot the percentage of fragments in each category in a heatmap, or averaged across different ROI groups, for example, containing a common TF motif. It is designed to compare footprints between two sample groups, using edgeR's
quasi-likelihood methods on the total fragment counts per ROI, sample, and footprint category.
The implement of integrative analysis methods based on a two-part penalization, which realizes dimension reduction analysis and mining the heterogeneity and association of multiple studies with compatible designs. The software package provides the integrative analysis methods including integrative sparse principal component analysis (Fang et al., 2018), integrative sparse partial least squares (Liang et al., 2021) and integrative sparse canonical correlation analysis, as well as corresponding individual analysis and meta-analysis versions. References: (1) Fang, K., Fan, X., Zhang, Q., and Ma, S. (2018). Integrative sparse principal component analysis. Journal of Multivariate Analysis, <doi:10.1016/j.jmva.2018.02.002>. (2) Liang, W., Ma, S., Zhang, Q., and Zhu, T. (2021). Integrative sparse partial least squares. Statistics in Medicine, <doi:10.1002/sim.8900>.
This is a method (MinED
) for mining probability distributions using deterministic sampling which is proposed by Joseph, Wang, Gu, Lv, and Tuo (2019) <DOI:10.1080/00401706.2018.1552203>. The MinED
samples can be used for approximating the target distribution. They can be generated from a density function that is known only up to a proportionality constant and thus, it might find applications in Bayesian computation. Moreover, the MinED
samples are generated with much fewer evaluations of the density function compared to random sampling-based methods such as MCMC and therefore, this method will be especially useful when the unnormalized posterior is expensive or time consuming to evaluate. This research is supported by a U.S. National Science Foundation grant DMS-1712642.
Bayesian supervised predictive classifiers, hypothesis testing, and parametric estimation under Partition Exchangeability are implemented. The two classifiers presented are the marginal classifier (that assumes test data is i.i.d.) next to a more computationally costly but accurate simultaneous classifier (that finds a labelling for the entire test dataset at once based on simultanous use of all the test data to predict each label). We also provide the Maximum Likelihood Estimation (MLE) of the only underlying parameter of the partition exchangeability generative model as well as hypothesis testing statistics for equality of this parameter with a single value, alternative, or multiple samples. We present functions to simulate the sequences from Ewens Sampling Formula as the realisation of the Poisson-Dirichlet distribution and their respective probabilities.
This package provides a sensitivity analysis approach for unmeasured confounding in observational data with multiple treatments and a binary outcome. This approach derives the general bias formula and provides adjusted causal effect estimates in response to various assumptions about the degree of unmeasured confounding. Nested multiple imputation is embedded within the Bayesian framework to integrate uncertainty about the sensitivity parameters and sampling variability. Bayesian Additive Regression Model (BART) is used for outcome modeling. The causal estimands are the conditional average treatment effects (CATE) based on the risk difference. For more details, see paper: Hu L et al. (2020) A flexible sensitivity analysis approach for unmeasured confounding with multiple treatments and a binary outcome with application to SEER-Medicare lung cancer data <arXiv:2012.06093>
.
Genome wide studies of translational control is emerging as a tool to study various biological conditions. The output from such analysis is both the mRNA level (e.g. cytosolic mRNA level) and the level of mRNA actively involved in translation (the actively translating mRNA level) for each mRNA. The standard analysis of such data strives towards identifying differential translational between two or more sample classes - i.e., differences in actively translated mRNA levels that are independent of underlying differences in cytosolic mRNA levels. This package allows for such analysis using partial variances and the random variance model. As 10s of thousands of mRNAs are analyzed in parallel the library performs a number of tests to assure that the data set is suitable for such analysis.
This package provides a set of psychometric tools for cognitive diagnosis modeling based on the generalized deterministic inputs, noisy and gate (G-DINA) model by de la Torre (2011) doi:10.1007/s11336-011-9207-7 and its extensions, including the sequential G-DINA model by Ma and de la Torre (2016) doi:10.1111/bmsp.12070 for polytomous responses, and the polytomous G-DINA model by Chen and de la Torre doi:10.1177/0146621613479818 for polytomous attributes. Joint attribute distribution can be independent, saturated, higher-order, loglinear smoothed or structured. Q-matrix validation, item and model fit statistics, model comparison at test and item level and differential item functioning can also be conducted. A graphical user interface is also provided.
This package provides a computational toolbox for recursive partitioning. The core of the package is ctree()
, an implementation of conditional inference trees which embed tree-structured regression models into a well defined theory of conditional inference procedures. This non-parametric class of regression trees is applicable to all kinds of regression problems, including nominal, ordinal, numeric, censored as well as multivariate response variables and arbitrary measurement scales of the covariates. Based on conditional inference trees, cforest()
provides an implementation of Breiman's random forests. The function mob()
implements an algorithm for recursive partitioning based on parametric models (e.g. linear models, GLMs or survival regression) employing parameter instability tests for split selection. Extensible functionality for visualizing tree-structured regression models is available.
It provides a method based on EM algorithm to estimate the parameter of a mixture model, Sigmoid-Normal Model, where the samples come from several normal distributions (also call them subgroups) whose mean is determined by co-variable Z and coefficient alpha while the variance are homogeneous. Meanwhile, the subgroup each item belongs to is determined by co-variables X and coefficient eta through Sigmoid link function which is the extension of Logistic Link function. It uses bootstrap to estimate the standard error of parameters. When sample is indeed separable, removing estimation with abnormal sigma, the estimation of alpha is quite well. I used this method to explore the subgroup structure of HIV patients and it can be used in other domains where exists subgroup structure.
Maximum likelihood estimation, random values generation, density computation and other functions for the exponential-Poisson generalised exponential-Poisson and Poisson-exponential distributions. References include: Rodrigues G. C., Louzada F. and Ramos P. L. (2018). "Poisson-exponential distribution: different methods of estimation". Journal of Applied Statistics, 45(1): 128--144. <doi:10.1080/02664763.2016.1268571>. Louzada F., Ramos, P. L. and Ferreira, H. P. (2020). "Exponential-Poisson distribution: estimation and applications to rainfall and aircraft data with zero occurrence". Communications in Statistics--Simulation and Computation, 49(4): 1024--1043. <doi:10.1080/03610918.2018.1491988>. Barreto-Souza W. and Cribari-Neto F. (2009). "A generalization of the exponential-Poisson distribution". Statistics and Probability Letters, 79(24): 2493--2500. <doi:10.1016/j.spl.2009.09.003>.
NanoString
nCounter
data are gene expression assays where there is no need for the use of enzymes or amplification protocols and work with fluorescent barcodes (Geiss et al. (2018) <doi:10.1038/nbt1385>). Each barcode is assigned a messenger-RNA/micro-RNA (mRNA/miRNA
) which after bonding with its target can be counted. As a result each count of a specific barcode represents the presence of its target mRNA/miRNA
. NACHO (NAnoString
quality Control dasHbOard
) is able to analyse the exported NanoString
nCounter
data and facilitates the user in performing a quality control. NACHO does this by visualising quality control metrics, expression of control genes, principal components and sample specific size factors in an interactive web application.
The Proteomics Eye ('ProtE
') offers a comprehensive and intuitive framework for the univariate analysis of label-free proteomics data. By integrating essential data wrangling and processing steps into a single function, ProtE
streamlines pairwise statistical comparisons for categorical variables. It provides quality checks and generates publication-ready visualizations, enabling efficient and robust data analysis. ProtE
is compatible with proteomics data outputs from MaxQuant
(Cox & Mann, (2008) <doi:10.1038/nbt.1511>), DIA-NN (Demichev et al., (2020) <doi:10.1038/s41592-019-0638-x>), and Proteome Discoverer (Thermo Fisher Scientific, version 2.5). The package leverages ggplot2 for visualization (Wickham, (2016) <doi:10.1007/978-3-319-24277-4>) and limma for statistical analysis (Ritchie et al., (2015) <doi:10.1093/nar/gkv007>).
Makes it easy to build panel data in wide format from Panel Survey of Income Dynamics (PSID) delivered raw data. Downloads data directly from the PSID server using the SAScii package. psidR
takes care of merging data from each wave onto a cross-period index file, so that individuals can be followed over time. The user must specify which years they are interested in, and the PSID variable names (e.g. ER21003) for each year (they differ in each year). The package offers helper functions to retrieve variable names from different waves. There are different panel data designs and sample subsetting criteria implemented ("SRC", "SEO", "immigrant" and "latino" samples). More information about the PSID can be obtained at <https://simba.isr.umich.edu/data/data.aspx>.