This package provides functions to generate and analyze spatially-explicit individual-based multistate movements in rivers, heterogeneous and homogeneous spaces. This is done by incorporating landscape bias on local behaviour, based on resistance rasters. Although originally conceived and designed to simulate trajectories of species constrained to linear habitats/dendritic ecological networks (e.g. river networks), the simulation algorithm is built to be highly flexible and can be applied to any (aquatic, semi-aquatic or terrestrial) organism, independently on the landscape in which it moves. Thus, the user will be able to use the package to simulate movements either in homogeneous landscapes, heterogeneous landscapes (e.g. semi-aquatic animal moving mainly along rivers but also using the matrix), or even in highly contrasted landscapes (e.g. fish in a river network). The algorithm and its input parameters are the same for all cases, so that results are comparable. Simulated trajectories can then be used as mechanistic null models (Potts & Lewis 2014, <DOI:10.1098/rspb.2014.0231>) to test a variety of Movement Ecology hypotheses (Nathan et al. 2008, <DOI:10.1073/pnas.0800375105>), including landscape effects (e.g. resources, infrastructures) on animal movement and species site fidelity, or for predictive purposes (e.g. road mortality risk, dispersal/connectivity). The package should be relevant to explore a broad spectrum of ecological phenomena, such as those at the interface of animal behaviour, management, landscape and movement ecology, disease and invasive species spread, and population dynamics.
In silico experimental evolution offers a cost-and-time effective means to test evolutionary hypotheses. Existing evolutionary simulation tools focus on simulations in a limited experimental framework, and tend to report on only the results presumed of interest by the tools designer. The R-package for Simulated Haploid Asexual Population Evolution ('rSHAPE
') addresses these concerns by implementing a robust simulation framework that outputs complete population demographic and genomic information for in silico evolving communities. Allowing more than 60 parameters to be specified, rSHAPE' simulates
evolution across discrete time-steps for an evolving community of haploid asexual populations with binary state genomes. These settings are for the current state of rSHAPE
and future steps will be to increase the breadth of evolutionary conditions permitted. At present, most effort was placed into permitting varied growth models to be simulated (such as constant size, exponential growth, and logistic growth) as well as various fitness landscape models to reflect the evolutionary landscape (e.g.: Additive, House of Cards - Stuart Kauffman and Simon Levin (1987) <doi:10.1016/S0022-5193(87)80029-2>, NK - Stuart A. Kauffman and Edward D. Weinberger (1989) <doi:10.1016/S0022-5193(89)80019-0>, Rough Mount Fuji - Neidhart, Johannes and Szendro, Ivan G and Krug, Joachim (2014) <doi:10.1534/genetics.114.167668>). This package includes numerous functions though users will only need defineSHAPE()
, runSHAPE()
, shapeExperiment()
and summariseExperiment()
. All other functions are called by these main functions and are likely only to be on interest for someone wishing to develop rSHAPE
'. Simulation results will be stored in files which are exported to the directory referenced by the shape_workDir
option (defaults to tempdir()
but do change this by passing a folderpath argument for workDir
when calling defineSHAPE()
if you plan to make use of your results beyond your current session). rSHAPE
will generate numerous replicate simulations for your defined range of experimental parameters. The experiment will be built under the experimental working directory (i.e.: referenced by the option shape_workDir
set using defineSHAPE()
) where individual replicate simulation results will be stored as well as processed results which I have made in an effort to facilitate analyses by automating collection and processing of the potentially thousands of files which will be created. On that note, rSHAPE
implements a robust and flexible framework with highly detailed output at the cost of computational efficiency and potentially requiring significant disk space (generally gigabytes but up to tera-bytes for very large simulation efforts). So, while rSHAPE
offers a single framework in which we can simulate evolution and directly compare the impacts of a wide range of parameters, it is not as quick to run as other in silico simulation tools which focus on a single scenario with limited output. There you have it, rSHAPE
offers you a less restrictive in silico evolutionary playground than other tools and I hope you enjoy testing your hypotheses.
Collective matrix factorization (a.k.a. multi-view or multi-way factorization, Singh, Gordon, (2008) <doi:10.1145/1401890.1401969>) tries to approximate a (potentially very sparse or having many missing values) matrix X as the product of two low-dimensional matrices, optionally aided with secondary information matrices about rows and/or columns of X', which are also factorized using the same latent components. The intended usage is for recommender systems, dimensionality reduction, and missing value imputation. Implements extensions of the original model (Cortes, (2018) <arXiv:1809.00366>
) and can produce different factorizations such as the weighted implicit-feedback model (Hu, Koren, Volinsky, (2008) <doi:10.1109/ICDM.2008.22>), the weighted-lambda-regularization model, (Zhou, Wilkinson, Schreiber, Pan, (2008) <doi:10.1007/978-3-540-68880-8_32>), or the enhanced model with implicit features (Rendle, Zhang, Koren, (2019) <arXiv:1905.01395>
), with or without side information. Can use gradient-based procedures or alternating-least squares procedures (Koren, Bell, Volinsky, (2009) <doi:10.1109/MC.2009.263>), with either a Cholesky solver, a faster conjugate gradient solver (Takacs, Pilaszy, Tikk, (2011) <doi:10.1145/2043932.2043987>), or a non-negative coordinate descent solver (Franc, Hlavac, Navara, (2005) <doi:10.1007/11556121_50>), providing efficient methods for sparse and dense data, and mixtures thereof. Supports L1 and L2 regularization in the main models, offers alternative most-popular and content-based models, and implements functionality for cold-start recommendations and imputation of 2D data.
Linear dynamic panel data modeling based on linear and nonlinear moment conditions as proposed by Holtz-Eakin, Newey, and Rosen (1988) <doi:10.2307/1913103>, Ahn and Schmidt (1995) <doi:10.1016/0304-4076(94)01641-C>, and Arellano and Bover (1995) <doi:10.1016/0304-4076(94)01642-D>. Estimation of the model parameters relies on the Generalized Method of Moments (GMM) and instrumental variables (IV) estimation, numerical optimization (when nonlinear moment conditions are employed) and the computation of closed form solutions (when estimation is based on linear moment conditions). One-step, two-step and iterated estimation is available. For inference and specification testing, Windmeijer (2005) <doi:10.1016/j.jeconom.2004.02.005> and doubly corrected standard errors (Hwang, Kang, Lee, 2021 <doi:10.1016/j.jeconom.2020.09.010>) are available. Additionally, serial correlation tests, tests for overidentification, and Wald tests are provided. Functions for visualizing panel data structures and modeling results obtained from GMM estimation are also available. The plot methods include functions to plot unbalanced panel structure, coefficient ranges and coefficient paths across GMM iterations (the latter is implemented according to the plot shown in Hansen and Lee, 2021 <doi:10.3982/ECTA16274>). For a more detailed description of the GMM-based functionality, please see Fritsch, Pua, Schnurbus (2021) <doi:10.32614/RJ-2021-035>. For more details on the IV-based estimation routines, see Fritsch, Pua, and Schnurbus (WP, 2024) and Han and Phillips (2010) <doi:10.1017/S026646660909063X>.
The efficient treatment and convenient analysis of experimental high-throughput (omics) data gets facilitated through this collection of diverse functions. Several functions address advanced object-conversions, like manipulating lists of lists or lists of arrays, reorganizing lists to arrays or into separate vectors, merging of multiple entries, etc. Another set of functions provides speed-optimized calculation of standard deviation (sd), coefficient of variance (CV) or standard error of the mean (SEM) for data in matrixes or means per line with respect to additional grouping (eg n groups of replicates). A group of functions facilitate dealing with non-redundant information, by indexing unique, adding counters to redundant or eliminating lines with respect redundancy in a given reference-column, etc. Help is provided to identify very closely matching numeric values to generate (partial) distance matrixes for very big data in a memory efficient manner or to reduce the complexity of large data-sets by combining very close values. Other functions help aligning a matrix or data.frame to a reference using partial matching or to mine an experimental setup to extract patterns of replicate samples. Many times large experimental datasets need some additional filtering, adequate functions are provided. Convenient data normalization is supported in various different modes, parameter estimation via permutations or boot-strap as well as flexible testing of multiple pair-wise combinations using the framework of limma is provided, too. Batch reading (or writing) of sets of files and combining data to arrays is supported, too.
Optimization of conditional inference trees from the package party for classification and regression. For optimization, the model space is searched for the best tree on the full sample by means of repeated subsampling. Restrictions are allowed so that only trees are accepted which do not include pre-specified uninterpretable split results (cf. Weihs & Buschfeld, 2021a). The function PrInDT()
represents the basic resampling loop for 2-class classification (cf. Weihs & Buschfeld, 2021a). The function RePrInDT()
(repeated PrInDT()
) allows for repeated applications of PrInDT()
for different percentages of the observations of the large and the small classes (cf. Weihs & Buschfeld, 2021c). The function NesPrInDT()
(nested PrInDT()
) allows for an extra layer of subsampling for a specific factor variable (cf. Weihs & Buschfeld, 2021b). The functions PrInDTMulev()
and PrInDTMulab()
deal with multilevel and multilabel classification. In addition to these PrInDT()
variants for classification, the function PrInDTreg()
has been developed for regression problems. Finally, the function PostPrInDT()
allows for a posterior analysis of the distribution of a specified variable in the terminal nodes of a given tree. References are: -- Weihs, C., Buschfeld, S. (2021a) "Combining Prediction and Interpretation in Decision Trees (PrInDT
) - a Linguistic Example" <arXiv:2103.02336>
; -- Weihs, C., Buschfeld, S. (2021b) "NesPrInDT
: Nested undersampling in PrInDT
" <arXiv:2103.14931>
; -- Weihs, C., Buschfeld, S. (2021c) "Repeated undersampling in PrInDT
(RePrInDT
): Variation in undersampling and prediction, and ranking of predictors in ensembles" <arXiv:2108.05129>
.
Tetra-allele cross often referred as four-way cross or double cross or four-line cross are those type of mating designs in which every cross is obtained by mating amongst four inbred lines. A tetra-allele cross can be obtained by crossing the resultant of two unrelated diallel crosses. A common triallel cross involving four inbred lines A, B, C and D can be symbolically represented as (A X B) X (C X D) or (A, B, C, D) or (A B C D) etc. Tetra-allele cross can be broadly categorized as Complete Tetra-allele Cross (CTaC
) and Partial Tetra-allele Crosses (PTaC
). Rawlings and Cockerham (1962)<doi:10.2307/2527461> firstly introduced and gave the method of analysis for tetra-allele cross hybrids using the analysis method of single cross hybrids under the assumption of no linkage. The set of all possible four-way mating between several genotypes (individuals, clones, homozygous lines, etc.) leads to a CTaC
. If there are N number of inbred lines involved in a CTaC
, the the total number of crosses, T = N*(N-1)*(N-2)*(N-3)/8. When more number of lines are to be considered, the total number of crosses in CTaC
also increases. Thus, it is almost impossible for the investigator to carry out the experimentation with limited available resource material. This situation lies in taking a fraction of CTaC
with certain underlying properties, known as PTaC
.
The actfts package provides tools for performing autocorrelation analysis of time series data. It includes functions to compute and visualize the autocorrelation function (ACF) and the partial autocorrelation function (PACF). Additionally, it performs the Dickey-Fuller, KPSS, and Phillips-Perron unit root tests to assess the stationarity of time series. Theoretical foundations are based on Box and Cox (1964) <doi:10.1111/j.2517-6161.1964.tb00553.x>, Box and Jenkins (1976) <isbn:978-0-8162-1234-2>, and Box and Pierce (1970) <doi:10.1080/01621459.1970.10481180>. Statistical methods are also drawn from Kolmogorov (1933) <doi:10.1007/BF00993594>, Kwiatkowski et al. (1992) <doi:10.1016/0304-4076(92)90104-Y>, and Ljung and Box (1978) <doi:10.1093/biomet/65.2.297>. The package integrates functions from forecast (Hyndman & Khandakar, 2008) <https://CRAN.R-project.org/package=forecast>, tseries (Trapletti & Hornik, 2020) <https://CRAN.R-project.org/package=tseries>, xts (Ryan & Ulrich, 2020) <https://CRAN.R-project.org/package=xts>, and stats (R Core Team, 2023) <https://stat.ethz.ch/R-manual/R-devel/library/stats/html/00Index.html>. Additionally, it provides visualization tools via plotly (Sievert, 2020) <https://CRAN.R-project.org/package=plotly> and reactable (Glaz, 2023) <https://CRAN.R-project.org/package=reactable>. The package also incorporates macroeconomic datasets from the U.S. Bureau of Economic Analysis: Disposable Personal Income (DPI) <https://fred.stlouisfed.org/series/DPI>, Gross Domestic Product (GDP) <https://fred.stlouisfed.org/series/GDP>, and Personal Consumption Expenditures (PCEC) <https://fred.stlouisfed.org/series/PCEC>.
Knowledge graphs enable to efficiently visualize and gain insights into large-scale data analysis results, as p-values from multiple studies or embedding data matrices. The usual workflow is a user providing a data frame of association studies results and specifying target nodes, e.g. phenotypes, to visualize. The knowledge graph then shows all the features which are significantly associated with the phenotype, with the edges being proportional to the association scores. As the user adds several target nodes and grouping information about the nodes such as biological pathways, the construction of such graphs soon becomes complex. The kgraph package aims to enable users to easily build such knowledge graphs, and provides two main features: first, to enable building a knowledge graph based on a data frame of concepts relationships, be it p-values or cosine similarities; second, to enable determining an appropriate cut-off on cosine similarities from a complete embedding matrix, to enable the building of a knowledge graph directly from an embedding matrix. The kgraph package provides several display, layout and cut-off options, and has already proven useful to researchers to enable them to visualize large sets of p-value associations with various phenotypes, and to quickly be able to visualize embedding results. Two example datasets are provided to demonstrate these behaviors, and several live shiny applications are hosted by the CELEHS laboratory and Parse Health, as the KESER Mental Health application <https://keser-mental-health.parse-health.org/> based on Hong C. (2021) <doi:10.1038/s41746-021-00519-z>.
Implementation of default Bayes factors for testing statistical hypotheses under various statistical models. The package is intended for applied quantitative researchers in the social and behavioral sciences, medical research, and related fields. The Bayes factor tests can be executed for statistical models such as univariate and multivariate normal linear models, correlation analysis, generalized linear models, special cases of linear mixed models, survival models, relational event models. Parameters that can be tested are location parameters (e.g., group means, regression coefficients), variances (e.g., group variances), and measures of association (e.g,. polychoric/polyserial/biserial/tetrachoric/product moments correlations), among others. The statistical underpinnings are described in O'Hagan (1995) <DOI:10.1111/j.2517-6161.1995.tb02017.x>, De Santis and Spezzaferri (2001) <DOI:10.1016/S0378-3758(00)00240-8>, Mulder and Xin (2022) <DOI:10.1080/00273171.2021.1904809>, Mulder and Gelissen (2019) <DOI:10.1080/02664763.2021.1992360>, Mulder (2016) <DOI:10.1016/j.jmp.2014.09.004>, Mulder and Fox (2019) <DOI:10.1214/18-BA1115>, Mulder and Fox (2013) <DOI:10.1007/s11222-011-9295-3>, Boeing-Messing, van Assen, Hofman, Hoijtink, and Mulder (2017) <DOI:10.1037/met0000116>, Hoijtink, Mulder, van Lissa, and Gu (2018) <DOI:10.1037/met0000201>, Gu, Mulder, and Hoijtink (2018) <DOI:10.1111/bmsp.12110>, Hoijtink, Gu, and Mulder (2018) <DOI:10.1111/bmsp.12145>, and Hoijtink, Gu, Mulder, and Rosseel (2018) <DOI:10.1037/met0000187>. When using the packages, please refer to the package Mulder et al. (2021) <DOI:10.18637/jss.v100.i18> and the relevant methodological papers.
The Super Imposition by Translation and Rotation (SITAR) model is a shape-invariant nonlinear mixed effect model that fits a natural cubic spline mean curve to the growth data and aligns individual-specific growth curves to the underlying mean curve via a set of random effects (see Cole, 2010 <doi:10.1093/ije/dyq115> for details). The non-Bayesian version of the SITAR model can be fit by using the already available R package sitar'. While the sitar package allows modelling of a single outcome only, the bsitar package offers great flexibility in fitting models of varying complexities, including joint modelling of multiple outcomes such as height and weight (multivariate model). Additionally, the bsitar package allows for the simultaneous analysis of an outcome separately for subgroups defined by a factor variable such as gender. This is achieved by fitting separate models for each subgroup (for example males and females for gender variable). An advantage of this approach is that posterior draws for each subgroup are part of a single model object, making it possible to compare coefficients across subgroups and test hypotheses. Since the bsitar package is a front-end to the R package brms', it offers excellent support for post-processing of posterior draws via various functions that are directly available from the brms package. In addition, the bsitar package includes various customized functions that allow for the visualization of distance (increase in size with age) and velocity (change in growth rate as a function of age), as well as the estimation of growth spurt parameters such as age at peak growth velocity and peak growth velocity.
Gene lists derived from the results of genomic analyses are rich in biological information. For instance, differentially expressed genes (DEGs) from a microarray or RNA-Seq analysis are related functionally in terms of their response to a treatment or condition. Gene lists can vary in size, up to several thousand genes, depending on the robustness of the perturbations or how widely different the conditions are biologically. Having a way to associate biological relatedness between hundreds and thousands of genes systematically is impractical by manually curating the annotation and function of each gene. Over-representation analysis (ORA) of genes was developed to identify biological themes. Given a Gene Ontology (GO) and an annotation of genes that indicate the categories each one fits into, significance of the over-representation of the genes within the ontological categories is determined by a Fisher's exact test or modeling according to a hypergeometric distribution. Comparing a small number of enriched biological categories for a few samples is manageable using Venn diagrams or other means for assessing overlaps. However, with hundreds of enriched categories and many samples, the comparisons are laborious. Furthermore, if there are enriched categories that are shared between samples, trying to represent a common theme across them is highly subjective. goSTAG
uses GO subtrees to tag and annotate genes within a set. goSTAG
visualizes the similarities between the over-representation of DEGs by clustering the p-values from the enrichment statistical tests and labels clusters with the GO term that has the most paths to the root within the subtree generated from all the GO terms in the cluster.
Statistical classification and regression have been popular among various fields and stayed in the limelight of scientists of those fields. Examples of the fields include clinical trials where the statistical classification of patients is indispensable to predict the clinical courses of diseases. Considering the negative impact of diseases on performing daily tasks, correctly classifying patients based on the clinical information is vital in that we need to identify patients of the high-risk group to develop a severe state and arrange medical treatment for them at an opportune moment. Deep learning - a part of artificial intelligence - has gained much attention, and research on it burgeons during past decades: see, e.g, Kazemi and Mirroshandel (2018) <DOI:10.1016/j.artmed.2017.12.001>. It is a veritable technique which was originally designed for the classification, and hence, the Buddle package can provide sublime solutions to various challenging classification and regression problems encountered in the clinical trials. The Buddle package is based on the back-propagation algorithm - together with various powerful techniques such as batch normalization and dropout - which performs a multi-layer feed-forward neural network: see Krizhevsky et. al (2017) <DOI:10.1145/3065386>, Schmidhuber (2015) <DOI:10.1016/j.neunet.2014.09.003> and LeCun
et al. (1998) <DOI:10.1109/5.726791> for more details. This package contains two main functions: TrainBuddle()
and FetchBuddle()
. TrainBuddle()
builds a feed-forward neural network model and trains the model. FetchBuddle()
recalls the trained model which is the output of TrainBuddle()
, classifies or regresses given data, and make a final prediction for the data.
Generates internet memes that optionally include a superimposed inset plot and other atypical features, combining the visual impact of an attention-grabbing meme with graphic results of data analysis. The package differs from related packages that focus on imitating and reproducing standard memes. Some packages do this by interfacing with online meme generators whereas others achieve this natively. This package takes the latter approach. It does not interface with online meme generators or require any authentication with external websites. It reads images directly from local files or via URL and meme generation is done by the package. While this is similar to the meme package available on CRAN, it differs in that the focus is on allowing for non-standard meme layouts and hybrids of memes mixed with graphs. While this package can be used to make basic memes like an online meme generator would produce, it caters primarily to hybrid graph-meme plots where the meme presentation can be seen as a backdrop highlighting foreground graphs of data analysis results. The package also provides support for an arbitrary number of meme text labels with arbitrary size, position and other attributes rather than restricting to the standard top and/or bottom text placement. This is useful for proper aesthetic interleaving of plots of data between meme image backgrounds and overlain text labels. The package offers a selection of templates for graph placement and appearance with respect to the underlying meme. Graph templates also permit additional template-specific customization. Animated gif support is provided but this is optional and functional only if the magick package is installed. magick is not required unless gif functionality is desired.
The package provides functionality for kernel-based analysis of DNA, RNA, and amino acid sequences via SVM-based methods. As core functionality, kebabs implements following sequence kernels: spectrum kernel, mismatch kernel, gappy pair kernel, and motif kernel. Apart from an efficient implementation of standard position-independent functionality, the kernels are extended in a novel way to take the position of patterns into account for the similarity measure. Because of the flexibility of the kernel formulation, other kernels like the weighted degree kernel or the shifted weighted degree kernel with constant weighting of positions are included as special cases. An annotation-specific variant of the kernels uses annotation information placed along the sequence together with the patterns in the sequence. The package allows for the generation of a kernel matrix or an explicit feature representation in dense or sparse format for all available kernels which can be used with methods implemented in other R packages. With focus on SVM-based methods, kebabs provides a framework which simplifies the usage of existing SVM implementations in kernlab, e1071, and LiblineaR
. Binary and multi-class classification as well as regression tasks can be used in a unified way without having to deal with the different functions, parameters, and formats of the selected SVM. As support for choosing hyperparameters, the package provides cross validation - including grouped cross validation, grid search and model selection functions. For easier biological interpretation of the results, the package computes feature weights for all SVMs and prediction profiles which show the contribution of individual sequence positions to the prediction result and indicate the relevance of sequence sections for the learning result and the underlying biological functions.
Can be used for paternity and maternity assignment and outperforms conventional methods where closely related individuals occur in the pool of possible parents. The method compares the genotypes of offspring with any combination of potentials parents and scores the number of mismatches of these individuals at bi-allelic genetic markers (e.g. Single Nucleotide Polymorphisms). It elaborates on a prior exclusion method based on the Homozygous Opposite Test (HOT; Huisman 2017 <doi:10.1111/1755-0998.12665>) by introducing the additional exclusion criterion HIPHOP (Homozygous Identical Parents, Heterozygous Offspring are Precluded; Cockburn et al., in revision). Potential parents are excluded if they have more mismatches than can be expected due to genotyping error and mutation, and thereby one can identify the true genetic parents and detect situations where one (or both) of the true parents is not sampled. Package hiphop can deal with (a) the case where there is contextual information about parentage of the mother (i.e. a female has been seen to be involved in reproductive tasks such as nest building), but paternity is unknown (e.g. due to promiscuity), (b) where both parents need to be assigned, because there is no contextual information on which female laid eggs and which male fertilized them (e.g. polygynandrous mating system where multiple females and males deposit young in a common nest, or organisms with external fertilisation that breed in aggregations). For details: Cockburn, A., Penalba, J.V.,Jaccoud, D.,Kilian, A., Brouwer, L., Double, M.C., Margraf, N., Osmond, H.L., van de Pol, M. and Kruuk, L.E.B. (in revision). HIPHOP: improved paternity assignment among close relatives using a simple exclusion method for bi-allelic markers. Molecular Ecology Resources, DOI to be added upon acceptance.
This package performs functional regression, and some related approaches, for intensive longitudinal data (see the book by Walls & Schafer, 2006, Models for Intensive Longitudinal Data, Oxford) when such data is not necessarily observed on an equally spaced grid of times. The approach generally follows the ideas of Goldsmith, Bobb, Crainiceanu, Caffo, and Reich (2011)<DOI:10.1198/jcgs.2010.10007> and the approach taken in their sample code, but with some modifications to make it more feasible to use with long rather than wide, non-rectangular longitudinal datasets with unequal and potentially random measurement times. It also allows easy plotting of the correlation between the smoothed covariate and the outcome as a function of time, which can add additional insights on how to interpret a functional regression. Additionally, it also provides several permutation tests for the significance of the functional predictor. The heuristic interpretation of ``time is used to describe the index of the functional predictor, but the same methods can equally be used for another unidimensional continuous index, such as space along a north-south axis. Note that most of the functionality of this package has been superseded by added features after 2016 in the pfr function by Jonathan Gellar, Mathew W. McLean
, Jeff Goldsmith, and Fabian Scheipl, in the refund package built by Jeff Goldsmith and co-authors and maintained by Julia Wrobel. The development of the funreg package in 2015 and 2016 was part of a research project supported by Award R03 CA171809-01 from the National Cancer Institute and Award P50 DA010075 from the National Institute on Drug Abuse. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute on Drug Abuse, the National Cancer Institute, or the National Institutes of Health.
Computing Average and TPX Power under various BHFDR type sequential procedures. All of these procedures involve control of some summary of the distribution of the FDP, e.g. the proportion of discoveries which are false in a given experiment. The most widely known of these, the BH-FDR procedure, controls the FDR which is the mean of the FDP. A lesser known procedure, due to Lehmann and Romano, controls the FDX, or probability that the FDP exceeds a user provided threshold. This is less conservative than FWE control procedures but much more conservative than the BH-FDR proceudre. This package and the references supporting it introduce a new procedure for controlling the FDX which we call the BH-FDX procedure. This procedure iteratively identifies, given alpha and lower threshold delta, an alpha* less than alpha at which BH-FDR guarantees FDX control. This uses asymptotic approximation and is only slightly more conservative than the BH-FDR procedure. Likewise, we can think of the power in multiple testing experiments in terms of a summary of the distribution of the True Positive Proportion (TPP), the portion of tests truly non-null distributed that are called significant. The package will compute power, sample size or any other missing parameter required for power defined as (i) the mean of the TPP (average power) or (ii) the probability that the TPP exceeds a given value, lambda, (TPX power) via asymptotic approximation. All supplied theoretical results are also obtainable via simulation. The suggested approach is to narrow in on a design via the theoretical approaches and then make final adjustments/verify the results by simulation. The theoretical results are described in Izmirlian, G (2020) Statistics and Probability letters, "<doi:10.1016/j.spl.2020.108713>", and an applied paper describing the methodology with a simulation study is in preparation. See citation("pwrFDR
").
For QTL mapping, this package comprises several functions designed to execute diverse tasks, such as simulating or analyzing data, calculating significance thresholds, and visualizing QTL mapping results. The single-QTL or multiple-QTL method, which enables the fitting and comparison of various statistical models, is employed to analyze the data for estimating QTL parameters. The models encompass linear regression, permutation tests, normal mixture models, and truncated normal mixture models. The Gaussian stochastic process is utilized to compute significance thresholds for QTL detection on a genetic linkage map within experimental populations. Two types of data, complete genotyping, and selective genotyping data from various experimental populations, including backcross, F2, recombinant inbred (RI) populations, and advanced intercrossed (AI) populations, are considered in the QTL mapping analysis. For QTL hotspot detection, statistical methods can be developed based on either utilizing individual-level data or summarized data. We have proposed a statistical framework capable of handling both individual-level data and summarized QTL data for QTL hotspot detection. Our statistical framework can overcome the underestimation of thresholds resulting from ignoring the correlation structure among traits. Additionally, it can identify different types of hotspots with minimal computational cost during the detection process. Here, we endeavor to furnish the R codes for our QTL mapping and hotspot detection methods, intended for general use in genes, genomics, and genetics studies. The QTL mapping methods for the complete and selective genotyping designs are based on the multiple interval mapping (MIM) model proposed by Kao, C.-H. , Z.-B. Zeng and R. D. Teasdale (1999) <doi: 10.1534/genetics.103.021642> and H.-I Lee, H.-A. Ho and C.-H. Kao (2014) <doi: 10.1534/genetics.114.168385>, respectively. The QTL hotspot detection analysis is based on the method by Wu, P.-Y., M.-.H. Yang, and C.-H. Kao (2021) <doi: 10.1093/g3journal/jkab056>.
Estimation of interaction (i.e., moderation) effects between latent variables in structural equation models (SEM). The supported methods are: The constrained approach (Algina & Moulder, 2001). The unconstrained approach (Marsh et al., 2004). The residual centering approach (Little et al., 2006). The double centering approach (Lin et al., 2010). The latent moderated structural equations (LMS) approach (Klein & Moosbrugger, 2000). The quasi-maximum likelihood (QML) approach (Klein & Muthén, 2007) (temporarily unavailable) The constrained- unconstrained, residual- and double centering- approaches are estimated via lavaan (Rosseel, 2012), whilst the LMS- and QML- approaches are estimated via modsem it self. Alternatively model can be estimated via Mplus (Muthén & Muthén, 1998-2017). References: Algina, J., & Moulder, B. C. (2001). <doi:10.1207/S15328007SEM0801_3>. "A note on estimating the Jöreskog-Yang model for latent variable interaction using LISREL 8.3." Klein, A., & Moosbrugger, H. (2000). <doi:10.1007/BF02296338>. "Maximum likelihood estimation of latent interaction effects with the LMS method." Klein, A. G., & Muthén, B. O. (2007). <doi:10.1080/00273170701710205>. "Quasi-maximum likelihood estimation of structural equation models with multiple interaction and quadratic effects." Lin, G. C., Wen, Z., Marsh, H. W., & Lin, H. S. (2010). <doi:10.1080/10705511.2010.488999>. "Structural equation models of latent interactions: Clarification of orthogonalizing and double-mean-centering strategies." Little, T. D., Bovaird, J. A., & Widaman, K. F. (2006). <doi:10.1207/s15328007sem1304_1>. "On the merits of orthogonalizing powered and product terms: Implications for modeling interactions among latent variables." Marsh, H. W., Wen, Z., & Hau, K. T. (2004). <doi:10.1037/1082-989X.9.3.275>. "Structural equation models of latent interactions: evaluation of alternative estimation strategies and indicator construction." Muthén, L.K. and Muthén, B.O. (1998-2017). "'Mplus Userâ s Guide. Eighth Edition." <https://www.statmodel.com/>. Rosseel Y (2012). <doi:10.18637/jss.v048.i02>. "'lavaan': An R Package for Structural Equation Modeling.".
Forced-choice (FC) response has gained increasing popularity and interest for its resistance to faking when well-designed (Cao & Drasgow, 2019 <doi:10.1037/apl0000414>). To established well-designed FC scales, typically each item within a block should measure different trait and have similar level of social desirability (Zhang et al., 2020 <doi:10.1177/1094428119836486>). Recent study also suggests the importance of high inter-item agreement of social desirability between items within a block (Pavlov et al., 2021 <doi:10.31234/osf.io/hmnrc>). In addition to this, FC developers may also need to maximize factor loading differences (Brown & Maydeu-Olivares, 2011 <doi:10.1177/0013164410375112>) or minimize item location differences (Cao & Drasgow, 2019 <doi:10.1037/apl0000414>) depending on scoring models. Decision of which items should be assigned to the same block, termed item pairing, is thus critical to the quality of an FC test. This pairing process is essentially an optimization process which is currently carried out manually. However, given that we often need to simultaneously meet multiple objectives, manual pairing becomes impractical or even not feasible once the number of latent traits and/or number of items per trait are relatively large. To address these problems, autoFC
is developed as a practical tool for facilitating the automatic construction of FC tests (Li et al., 2022 <doi:10.1177/01466216211051726>), essentially exempting users from the burden of manual item pairing and reducing the computational costs and biases induced by simple ranking methods. Given characteristics of each item (and item responses), FC measures can be constructed either automatically based on user-defined pairing criteria and weights, or based on exact specifications of each block (i.e., blueprint; see Li et al., 2024 <doi:10.1177/10944281241229784>). Users can also generate simulated responses based on the Thurstonian Item Response Theory model (Brown & Maydeu-Olivares, 2011 <doi:10.1177/0013164410375112>) and predict trait scores of simulated/actual respondents based on an estimated model.
In shotgun proteomics, shared peptides (i.e., peptides that might originate from different proteins sharing homology, from different proteoforms due to alternative mRNA
splicing, post-translational modifications, proteolytic cleavages, and/or allelic variants) represent a major source of ambiguity in protein identifications. The net4pg package allows to assess and handle ambiguity of protein identifications. It implements methods for two main applications. First, it allows to represent and quantify ambiguity of protein identifications by means of graph connected components (CCs). In graph theory, CCs are defined as the largest subgraphs in which any two vertices are connected to each other by a path and not connected to any other of the vertices in the supergraph. Here, proteins sharing one or more peptides are thus gathered in the same CC (multi-protein CC), while unambiguous protein identifications constitute CCs with a single protein vertex (single-protein CCs). Therefore, the proportion of single-protein CCs and the size of multi-protein CCs can be used to measure the level of ambiguity of protein identifications. The package implements a strategy to efficiently calculate graph connected components on large datasets and allows to visually inspect them. Secondly, the net4pg package allows to exploit the increasing availability of matched transcriptomic and proteomic datasets to reduce ambiguity of protein identifications. More precisely, it implement a transcriptome-based filtering strategy fundamentally consisting in the removal of those proteins whose corresponding transcript is not expressed in the sample-matched transcriptome. The underlying assumption is that, according to the central dogma of biology, there can be no proteins without the corresponding transcript. Most importantly, the package allows to visually inspect the effect of the filtering on protein identifications and quantify ambiguity before and after filtering by means of graph connected components. As such, it constitutes a reproducible and transparent method to exploit transcriptome information to enhance protein identifications. All methods implemented in the net4pg package are fully described in Fancello and Burger (2022) <doi:10.1186/s13059-022-02701-2>.
Inflammation can affect many micronutrient biomarkers and can thus lead to incorrect diagnosis of individuals and to over- or under-estimate the prevalence of deficiency in a population. Biomarkers Reflecting Inflammation and Nutritional Determinants of Anemia (BRINDA) is a multi-agency and multi-country partnership designed to improve the interpretation of nutrient biomarkers in settings of inflammation and to generate context-specific estimates of risk factors for anemia (Suchdev (2016) <doi:10.3945/an.115.010215>). In the past few years, BRINDA published a series of papers to provide guidance on how to adjust micronutrient biomarkers, retinol binding protein, serum retinol, serum ferritin by Namaste (2020), soluble transferrin receptor (sTfR
), serum zinc, serum and Red Blood Cell (RBC) folate, and serum B-12, using inflammation markers, alpha-1-acid glycoprotein (AGP) and/or C-Reactive Protein (CRP) by Namaste (2020) <doi:10.1093/ajcn/nqaa141>, Rohner (2017) <doi:10.3945/ajcn.116.142232>, McDonald
(2020) <doi:10.1093/ajcn/nqz304>, and Young (2020) <doi:10.1093/ajcn/nqz303>. The BRINDA inflammation adjustment method mainly focuses on Women of Reproductive Age (WRA) and Preschool-age Children (PSC); however, the general principle of the BRINDA method might apply to other population groups. The BRINDA R package is a user-friendly all-in-one R package that uses a series of functions to implement BRINDA adjustment method, as described above. The BRINDA R package will first carry out rigorous checks and provides users guidance to correct data or input errors (if they occur) prior to inflammation adjustments. After no errors are detected, the package implements the BRINDA inflammation adjustment for up to five micronutrient biomarkers, namely retinol-binding-protein, serum retinol, serum ferritin, sTfR
, and serum zinc (when appropriate), using inflammation indicators of AGP and/or CRP for various population groups. Of note, adjustment for serum and RBC folate and serum B-12 is not included in the R package, since evidence shows that no adjustment is needed for these micronutrient biomarkers in either WRA or PSC groups (Young (2020) <doi:10.1093/ajcn/nqz303>).
Sometimes data for analysis are obtained using more convenient or less expensive means yielding "surrogate" variables for what could be obtained more accurately, albeit with less convenience; or less conveniently or at more expense yielding "reference" variables, thought of as being measured without error. Analysis of the surrogate variables measured with error generally yields biased estimates when the objective is to make inference about the reference variables. Often it is thought that ignoring the measurement error in surrogate variables only biases effects toward the null hypothesis, but this need not be the case. Measurement errors may bias parameter estimates either toward or away from the null hypothesis. If one has a data set with surrogate variable data from the full sample, and also reference variable data from a randomly selected subsample, then one can assess the bias introduced by measurement error in parameter estimation, and use this information to derive improved estimates based upon all available data. Formulaically these estimates based upon the reference variables from the validation subsample combined with the surrogate variables from the whole sample can be interpreted as starting with the estimate from reference variables in the validation subsample, and "augmenting" this with additional information from the surrogate variables. This suggests the term "augmented" estimate. The meerva package calculates these augmented estimates in the regression setting when there is a randomly selected subsample with both surrogate and reference variables. Measurement errors may be differential or non-differential, in any or all predictors (simultaneously) as well as outcome. The augmented estimates derive, in part, from the multivariate correlation between regression model parameter estimates from the reference variables and the surrogate variables, both from the validation subset. Because the validation subsample is chosen at random any biases imposed by measurement error, whether non-differential or differential, are reflected in this correlation and these correlations can be used to derive estimates for the reference variables using data from the whole sample. The main functions in the package are meerva.fit which calculates estimates for a dataset, and meerva.sim.block which simulates multiple datasets as described by the user, and analyzes these datasets, storing the regression coefficient estimates for inspection. The augmented estimates, as well as how measurement error may arise in practice, is described in more detail by Kremers WK (2021) <arXiv:2106.14063>
and is an extension of the works by Chen Y-H, Chen H. (2000) <doi:10.1111/1467-9868.00243>, Chen Y-H. (2002) <doi:10.1111/1467-9868.00324>, Wang X, Wang Q (2015) <doi:10.1016/j.jmva.2015.05.017> and Tong J, Huang J, Chubak J, et al. (2020) <doi:10.1093/jamia/ocz180>.