Genome wide studies of translational control is emerging as a tool to study various biological conditions. The output from such analysis is both the mRNA level (e.g. cytosolic mRNA level) and the level of mRNA actively involved in translation (the actively translating mRNA level) for each mRNA. The standard analysis of such data strives towards identifying differential translational between two or more sample classes - i.e., differences in actively translated mRNA levels that are independent of underlying differences in cytosolic mRNA levels. This package allows for such analysis using partial variances and the random variance model. As 10s of thousands of mRNAs are analyzed in parallel the library performs a number of tests to assure that the data set is suitable for such analysis.
This package provides a computational toolbox for recursive partitioning. The core of the package is ctree()
, an implementation of conditional inference trees which embed tree-structured regression models into a well defined theory of conditional inference procedures. This non-parametric class of regression trees is applicable to all kinds of regression problems, including nominal, ordinal, numeric, censored as well as multivariate response variables and arbitrary measurement scales of the covariates. Based on conditional inference trees, cforest()
provides an implementation of Breiman's random forests. The function mob()
implements an algorithm for recursive partitioning based on parametric models (e.g. linear models, GLMs or survival regression) employing parameter instability tests for split selection. Extensible functionality for visualizing tree-structured regression models is available.
This package provides a set of psychometric tools for cognitive diagnosis modeling based on the generalized deterministic inputs, noisy and gate (G-DINA) model by de la Torre (2011) doi:10.1007/s11336-011-9207-7 and its extensions, including the sequential G-DINA model by Ma and de la Torre (2016) doi:10.1111/bmsp.12070 for polytomous responses, and the polytomous G-DINA model by Chen and de la Torre doi:10.1177/0146621613479818 for polytomous attributes. Joint attribute distribution can be independent, saturated, higher-order, loglinear smoothed or structured. Q-matrix validation, item and model fit statistics, model comparison at test and item level and differential item functioning can also be conducted. A graphical user interface is also provided.
It provides a method based on EM algorithm to estimate the parameter of a mixture model, Sigmoid-Normal Model, where the samples come from several normal distributions (also call them subgroups) whose mean is determined by co-variable Z and coefficient alpha while the variance are homogeneous. Meanwhile, the subgroup each item belongs to is determined by co-variables X and coefficient eta through Sigmoid link function which is the extension of Logistic Link function. It uses bootstrap to estimate the standard error of parameters. When sample is indeed separable, removing estimation with abnormal sigma, the estimation of alpha is quite well. I used this method to explore the subgroup structure of HIV patients and it can be used in other domains where exists subgroup structure.
Maximum likelihood estimation, random values generation, density computation and other functions for the exponential-Poisson generalised exponential-Poisson and Poisson-exponential distributions. References include: Rodrigues G. C., Louzada F. and Ramos P. L. (2018). "Poisson-exponential distribution: different methods of estimation". Journal of Applied Statistics, 45(1): 128--144. <doi:10.1080/02664763.2016.1268571>. Louzada F., Ramos, P. L. and Ferreira, H. P. (2020). "Exponential-Poisson distribution: estimation and applications to rainfall and aircraft data with zero occurrence". Communications in Statistics--Simulation and Computation, 49(4): 1024--1043. <doi:10.1080/03610918.2018.1491988>. Barreto-Souza W. and Cribari-Neto F. (2009). "A generalization of the exponential-Poisson distribution". Statistics and Probability Letters, 79(24): 2493--2500. <doi:10.1016/j.spl.2009.09.003>.
NanoString
nCounter
data are gene expression assays where there is no need for the use of enzymes or amplification protocols and work with fluorescent barcodes (Geiss et al. (2018) <doi:10.1038/nbt1385>). Each barcode is assigned a messenger-RNA/micro-RNA (mRNA/miRNA
) which after bonding with its target can be counted. As a result each count of a specific barcode represents the presence of its target mRNA/miRNA
. NACHO (NAnoString
quality Control dasHbOard
) is able to analyse the exported NanoString
nCounter
data and facilitates the user in performing a quality control. NACHO does this by visualising quality control metrics, expression of control genes, principal components and sample specific size factors in an interactive web application.
Makes it easy to build panel data in wide format from Panel Survey of Income Dynamics (PSID) delivered raw data. Downloads data directly from the PSID server using the SAScii package. psidR
takes care of merging data from each wave onto a cross-period index file, so that individuals can be followed over time. The user must specify which years they are interested in, and the PSID variable names (e.g. ER21003) for each year (they differ in each year). The package offers helper functions to retrieve variable names from different waves. There are different panel data designs and sample subsetting criteria implemented ("SRC", "SEO", "immigrant" and "latino" samples). More information about the PSID can be obtained at <https://simba.isr.umich.edu/data/data.aspx>.
This package provides a non-parametric framework based on estimation statistics principle. Its main purpose is to infer orders of empirical distributions from different categories based on a probability of finding a value in one distribution that is greater than an expectation of another distribution. Given a set of ordered-pair of real-category values the framework is capable of 1) inferring orders of domination of categories and representing orders in the form of a graph; 2) estimating magnitude of difference between a pair of categories in forms of mean-difference confidence intervals; and 3) visualizing domination orders and magnitudes of difference of categories. The publication of this package is at Chainarong Amornbunchornvej, Navaporn Surasvadi, Anon Plangprasopchok, and Suttipong Thajchayapong (2020) <doi:10.1016/j.heliyon.2020.e05435>.
Runs the eDITH
(environmental DNA Integrating Transport and Hydrology) model, which implements a mass balance of environmental DNA (eDNA
) transport at a river network scale coupled with a species distribution model to obtain maps of species distribution. eDITH
can work with both eDNA
concentration (e.g., obtained via quantitative polymerase chain reaction) or metabarcoding (read count) data. Parameter estimation can be performed via Bayesian techniques (via the BayesianTools
package) or optimization algorithms. An interface to the DHARMa package for posterior predictive checks is provided. See Carraro and Altermatt (2024) <doi:10.1111/2041-210X.14317> for a package introduction; Carraro et al. (2018) <doi:10.1073/pnas.1813843115> and Carraro et al. (2020) <doi:10.1038/s41467-020-17337-8> for methodological details.
Full Consistency Method (FUCOM) for multi-criteria decision-making (MCDM), developed by Dragam Pamucar in 2018 (<doi:10.3390/sym10090393>). The goal of the method is to determine the weights of criteria such that the deviation from full consistency is minimized. Users provide a character vector specifying the ranking of each criterion according to its significance, starting from the criterion expected to have the highest weight to the least significant one. Additionally, users provide a numeric vector specifying the priority values for each criterion. The comparison is made with respect to the first-ranked (most significant) criterion. The function returns the optimized weights for each criterion (summing to 1), the comparative priority (Phi) values, the mathematical transitivity condition (w) value, and the minimum deviation from full consistency (DFC).
Fits a generalized linear density ratio model (GLDRM). A GLDRM is a semiparametric generalized linear model. In contrast to a GLM, which assumes a particular exponential family distribution, the GLDRM uses a semiparametric likelihood to estimate the reference distribution. The reference distribution may be any discrete, continuous, or mixed exponential family distribution. The model parameters, which include both the regression coefficients and the cdf of the unspecified reference distribution, are estimated by maximizing a semiparametric likelihood. Regression coefficients are estimated with no loss of efficiency, i.e. the asymptotic variance is the same as if the true exponential family distribution were known. Huang (2014) <doi:10.1080/01621459.2013.824892>. Huang and Rathouz (2012) <doi:10.1093/biomet/asr075>. Rathouz and Gao (2008) <doi:10.1093/biostatistics/kxn030>.
Estimates statistically significant marker combination values within which one immunologically distinctive group (i.e., disease case) is more associated than another group (i.e., healthy control), successively, using various combinations (i.e., "gates") of markers to examine features of cells that may be different between groups. For a two-group comparison, the gateR
package uses the spatial relative risk function estimated using the sparr package. Details about the sparr package methods can be found in the tutorial: Davies et al. (2018) <doi:10.1002/sim.7577>. Details about kernel density estimation can be found in J. F. Bithell (1990) <doi:10.1002/sim.4780090616>. More information about relative risk functions using kernel density estimation can be found in J. F. Bithell (1991) <doi:10.1002/sim.4780101112>.
An implementation of several machine learning algorithms for multivariate time series. The package includes functions allowing the execution of clustering, classification or outlier detection methods, among others. It also incorporates a collection of multivariate time series datasets which can be used to analyse the performance of new proposed algorithms. Some of these datasets are stored in GitHub
data packages ueadata1 to ueadata8'. To access these data packages, run install.packages(c('ueadata1', ueadata2', ueadata3', ueadata4', ueadata5', ueadata6', ueadata7', ueadata8'), repos='<https://anloor7.github.io/drat/>')'. The installation takes a couple of minutes but we strongly encourage the users to do it if they want to have available all datasets of mlmts. Practitioners from a broad variety of fields could benefit from the general framework provided by mlmts'.
It is a toolbox for Sequential Probability Ratio Tests (SPRT), Wald (1945) <doi:10.2134/agronj1947.00021962003900070011x>. SPRTs are applied to the data during the sampling process, ideally after each observation. At any stage, the test will return a decision to either continue sampling or terminate and accept one of the specified hypotheses. The seq_ttest()
function performs one-sample, two-sample, and paired t-tests for testing one- and two-sided hypotheses (Schnuerch & Erdfelder (2019) <doi:10.1037/met0000234>). The seq_anova()
function allows to perform a sequential one-way fixed effects ANOVA (Steinhilber et al. (2023) <doi:10.31234/osf.io/m64ne>). Learn more about the package by using vignettes "browseVignettes(package
= "sprtt")" or go to the website <https://meikesteinhilber.github.io/sprtt/>.
Extremely efficient toolkit for solving the best subset selection problem <https://www.jmlr.org/papers/v23/21-1060.html>. This package is its R interface. The package implements and generalizes algorithms designed in <doi:10.1073/pnas.2014241117> that exploits a novel sequencing-and-splicing technique to guarantee exact support recovery and globally optimal solution in polynomial times for linear model. It also supports best subset selection for logistic regression, Poisson regression, Cox proportional hazard model, Gamma regression, multiple-response regression, multinomial logistic regression, ordinal regression, (sequential) principal component analysis, and robust principal component analysis. The other valuable features such as the best subset of group selection <doi:10.1287/ijoc.2022.1241> and sure independence screening <doi:10.1111/j.1467-9868.2008.00674.x> are also provided.
Co-clustering of the rows and columns of a contingency or binary matrix, or double binary matrices and model selection for the number of row and column clusters. Three models are considered: the Poisson latent block model for contingency matrix, the binary latent block model for binary matrix and a new model we develop: the multiple latent block model for double binary matrices. A new procedure named bikm1 is implemented to investigate more efficiently the grid of numbers of clusters. Then, the studied model selection criteria are the integrated completed likelihood (ICL) and the Bayesian integrated likelihood (BIC). Finally, the co-clustering adjusted Rand index (CARI) to measure agreement between co-clustering partitions is implemented. Robert Valerie, Vasseur Yann, Brault Vincent (2021) <doi:10.1007/s00357-020-09379-w>.
This package implements the Bayesian Augmented Control (BAC, a.k.a. Bayesian historical data borrowing) method under clinical trial setting by calling Just Another Gibbs Sampler ('JAGS') software. In addition, the BACCT package evaluates user-specified decision rules by computing the type-I error/power, or probability of correct go/no-go decision at interim look. The evaluation can be presented numerically or graphically. Users need to have JAGS 4.0.0 or newer installed due to a compatibility issue with rjags package. Currently, the package implements the BAC method for binary outcome only. Support for continuous and survival endpoints will be added in future releases. We would like to thank AbbVie's
Statistical Innovation group and Clinical Statistics group for their support in developing the BACCT package.
Identifying important factors from a large number of potentially important factors of a highly nonlinear and computationally expensive black box model is a difficult problem. Xiao, Joseph, and Ray (2022) <doi:10.1080/00401706.2022.2141897> proposed Maximum One-Factor-at-a-Time (MOFAT) designs for doing this. A MOFAT design can be viewed as an improvement to the random one-factor-at-a-time (OFAT) design proposed by Morris (1991) <doi:10.1080/00401706.1991.10484804>. The improvement is achieved by exploiting the connection between Morris screening designs and Monte Carlo-based Sobol designs, and optimizing the design using a space-filling criterion. This work is supported by a U.S. National Science Foundation (NSF) grant CMMI-1921646 <https://www.nsf.gov/awardsearch/showAward?AWD_ID=1921646>
.
Interactions between different biological entities are crucial for the function of biological systems. In such networks, nodes represent biological elements, such as genes, proteins and microbes, and their interactions can be defined by edges, which can be either binary or weighted. The dysregulation of these networks can be associated with different clinical conditions such as diseases and response to treatments. However, such variations often occur locally and do not concern the whole network. To capture local variations of such networks, we propose multiplex network differential analysis (MNDA). MNDA allows to quantify the variations in the local neighborhood of each node (e.g. gene) between the two given clinical states, and to test for statistical significance of such variation. Yousefi et al. (2023) <doi:10.1101/2023.01.22.525058>.
Providing new german-wide TapeR
Models and functions for their evaluation. Included are the most common tree species in Germany (Norway spruce, Scots pine, European larch, Douglas fir, Silver fir as well as European beech, Common/Sessile oak and Red oak). Many other species are mapped to them so that 36 tree species / groups can be processed. Single trees are defined by species code, one or multiple diameters in arbitrary measuring height and tree height. The functions then provide information on diameters along the stem, bark thickness, height of diameters, volume of the total or parts of the trunk and total and component above-ground biomass. It is also possible to calculate assortments from the taper curves. Uncertainty information is provided for diameter, volume and component biomass estimation.
MAPFX is an end-to-end toolbox that pre-processes the raw data from MPC experiments (e.g., BioLegend's
LEGENDScreen and BD Lyoplates assays), and further imputes the ‘missing’ infinity markers in the wells without those measurements. The pipeline starts by performing background correction on raw intensities to remove the noise from electronic baseline restoration and fluorescence compensation by adapting a normal-exponential convolution model. Unwanted technical variation, from sources such as well effects, is then removed using a log-normal model with plate, column, and row factors, after which infinity markers are imputed using the informative backbone markers as predictors. The completed dataset can then be used for clustering and other statistical analyses. Additionally, MAPFX can be used to normalise data from FFC assays as well.
There are three main goals to the vctrs
package:
To propose
vec_size()
andvec_type()
as alternatives tolength()
andclass()
. These definitions are paired with a framework for type-coercion and size-recycling.To define type- and size-stability as desirable function properties, use them to analyse existing base function, and to propose better alternatives. This work has been particularly motivated by thinking about the ideal properties of
c()
,ifelse()
, andrbind()
.To provide a new
vctr
base class that makes it easy to create new S3 vectors.vctrs
provides methods for many base generics in terms of a few newvctrs
generics, making implementation considerably simpler and more robust.
This package provides a dynamic time warping (DTW) algorithm for stratigraphic alignment, translated into R from the original published MATLAB code by Hay et al. (2019) <doi:10.1130/G46019.1>. The DTW algorithm incorporates two geologically relevant parameters (g and edge) for augmenting the typical DTW cost matrix, allowing for a range of sedimentologic and chronologic conditions to be explored, as well as the generation of an alignment library (as opposed to a single alignment solution). The g parameter relates to the relative sediment accumulation rate between the two time series records, while the edge parameter relates to the amount of total shared time between the records. Note that this algorithm is used for all DTW alignments in the Align Shiny application, detailed in Hagen et al. (in review).
This package provides a new method for identification of clusters of genomic regions within chromosomes. Primarily, it is used for calling clusters of cis-regulatory elements (COREs). CREAM uses genome-wide maps of genomic regions in the tissue or cell type of interest, such as those generated from chromatin-based assays including DNaseI
, ATAC or ChIP-Seq
. CREAM considers proximity of the elements within chromosomes of a given sample to identify COREs in the following steps: 1) It identifies window size or the maximum allowed distance between the elements within each CORE, 2) It identifies number of elements which should be clustered as a CORE, 3) It calls COREs, 4) It filters the COREs with lowest order which does not pass the threshold considered in the approach.