Top-Down mass spectrometry aims to identify entire proteins as well as their (post-translational) modifications or ions bound (eg Chen et al (2018) <doi:10.1021/acs.analchem.7b04747>). The pattern of internal fragments (Haverland et al (2017) <doi:10.1007/s13361-017-1635-x>) may reveal important information about the original structure of the proteins studied (Skinner et al (2018) <doi:10.1038/nchembio.2515> and Li et al (2018) <doi:10.1038/nchem.2908>). However, the number of possible internal fragments gets huge with longer proteins and subsequent identification of internal fragments remains challenging, in particular since the the accuracy of measurements with current mass spectrometers represents a limiting factor. This package attempts to deal with the complexity of internal fragments and allows identification of terminal and internal fragments from deconvoluted mass-spectrometry data.
Multi-state models are essential tools in longitudinal data analysis. One primary goal of these models is the estimation of transition probabilities, a critical metric for predicting clinical prognosis across various stages of diseases or medical conditions. Traditionally, inference in multi-state models relies on the Aalen-Johansen (AJ) estimator which is consistent under the Markov assumption. However, in many practical applications, the Markovian nature of the process is often not guaranteed, limiting the applicability of the AJ estimator in more complex scenarios. This package extends the landmark Aalen-Johansen estimator (Putter, H, Spitoni, C (2018) <doi:10.1177/0962280216674497>) incorporating presmoothing techniques described by Soutinho, Meira-Machado and Oliveira (2020) <doi:10.1080/03610918.2020.1762895>, offering a robust alternative for estimating transition probabilities in non-Markovian multi-state models with multiple states and potential reversible transitions.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. This is a another exporter for org-mode that translates Org-mode file to beautiful PDF file EXAMPLE ORG FILE HEADER: #+title:Readme ox-notes #+author: Matthias David #+options: toc:nil #+ou:Zoom #+quand: 20/2/2021 #+projet: ox-minutes #+absent: C. Robert,T. tartanpion #+present: K. Soulet,I. Payet #+excuse:Sophie Fonsec,Karine Soulet #+logo: logo.png
In self-reported or anonymised data the user often encounters heaped data, i.e. data which are rounded (to a possibly different degree of coarseness). While this is mostly a minor problem in parametric density estimation the bias can be very large for non-parametric methods such as kernel density estimation. This package implements a partly Bayesian algorithm treating the true unknown values as additional parameters and estimates the rounding parameters to give a corrected kernel density estimate. It supports various standard bandwidth selection methods. Varying rounding probabilities (depending on the true value) and asymmetric rounding is estimable as well: Gross, M. and Rendtel, U. (2016) (<doi:10.1093/jssam/smw011>). Additionally, bivariate non-parametric density estimation for rounded data, Gross, M. et al. (2016) (<doi:10.1111/rssa.12179>), as well as data aggregated on areas is supported.
Lag-sequential analysis is a method of assessing of patterns (what tends to follow what?) in sequences of codes. The codes are typically for discrete behaviors or states. The functions in this package read a stream of codes, or a frequency transition matrix, and produce a variety of lag sequential statistics, including transitional frequencies, expected transitional frequencies, transitional probabilities, z values, adjusted residuals, Yule's Q values, likelihood ratio tests of stationarity across time and homogeneity across groups or segments, transformed kappas for unidirectional dependence, bidirectional dependence, parallel and nonparallel dominance, and significance levels based on both parametric and randomization tests. The methods are described in Bakeman & Quera (2011) <doi:10.1017/CBO9781139017343>, O'Connor (1999) <doi:10.3758/BF03200753>, Wampold & Margolin (1982) <doi:10.1037/0033-2909.92.3.755>, and Wampold (1995, ISBN:0-89391-919-5).
This package provides a unified method, called M statistic, is provided for detecting phylogenetic signals in continuous traits, discrete traits, and multi-trait combinations. Blomberg and Garland (2002) <doi:10.1046/j.1420-9101.2002.00472.x> provided a widely accepted statistical definition of the phylogenetic signal, which is the "tendency for related species to resemble each other more than they resemble species drawn at random from the tree". The M statistic strictly adheres to the definition of phylogenetic signal, formulating an index and developing a method of testing in strict accordance with the definition, instead of relying on correlation analysis or evolutionary models. The novel method equivalently expressed the textual definition of the phylogenetic signal as an inequality equation of the phylogenetic and trait distances and constructed the M statistic. Also, there are more distance-based methods under development.
The document converter pandoc <https://pandoc.org/> is widely used in the R community. One feature of pandoc is that it can produce and consume JSON-formatted abstract syntax trees (AST). This allows to transform a given source document into JSON-formatted AST, alter it by so called filters and pass the altered JSON-formatted AST back to pandoc'. This package provides functions which allow to write such filters in native R code. Although this package is inspired by the Python package pandocfilters <https://github.com/jgm/pandocfilters/>, it provides additional convenience functions which make it simple to use the pandocfilters package as a report generator. Since pandocfilters inherits most of it's functionality from pandoc it can create documents in many formats (for more information see <https://pandoc.org/>) but is also bound to the same limitations as pandoc'.
Allows to detect spatial clusters of abnormal values on multivariate or functional data. Martin KULLDORFF and Lan HUANG and Kevin KONTY (2009) <doi:10.1186/1476-072X-8-58>, Inkyung JUNG and Ho Jin CHO (2015) <doi:10.1186/s12942-015-0024-6>, Lionel CUCALA and Michael GENIN and Caroline LANIER and Florent OCCELLI (2017) <doi:10.1016/j.spasta.2017.06.001>, Lionel CUCALA and Michael GENIN and Florent OCCELLI and Julien SOULA (2019) <doi:10.1016/j.spasta.2018.10.002>, Camille FREVENT and Mohamed-Salem AHMED and Matthieu MARBAC and Michael GENIN (2021) <doi:10.1016/j.spasta.2021.100550>, Zaineb SMIDA and Lionel CUCALA and Ali GANNOUN and Ghislain Durif (2022) <doi:10.1016/j.csda.2021.107378>, Camille FREVENT and Mohamed-Salem AHMED and Sophie DABO-NIANG and Michael GENIN (2023) <doi:10.1093/jrsssc/qlad017>.
An extensive set of data (pre-)processing and analysis methods and tools for metabolomics and other omics, with a strong emphasis on statistics and machine learning. This toolbox allows the user to build extensive and standardised workflows for data analysis. The methods and tools have been implemented using class-based templates provided by the struct (Statistics in R Using Class-based Templates) package. The toolbox includes pre-processing methods (e.g. signal drift and batch correction, normalisation, missing value imputation and scaling), univariate (e.g. ttest, various forms of ANOVA, Kruskal–Wallis test and more) and multivariate statistical methods (e.g. PCA and PLS, including cross-validation and permutation testing) as well as machine learning methods (e.g. Support Vector Machines). The STATistics Ontology (STATO) has been integrated and implemented to provide standardised definitions for the different methods, inputs and outputs.
Estimate a suite of normalizing transformations, including a new adaptation of a technique based on ranks which can guarantee normally distributed transformed data if there are no ties: ordered quantile normalization (ORQ). ORQ normalization combines a rank-mapping approach with a shifted logit approximation that allows the transformation to work on data outside the original domain. It is also able to handle new data within the original domain via linear interpolation. The package is built to estimate the best normalizing transformation for a vector consistently and accurately. It implements the Box-Cox transformation, the Yeo-Johnson transformation, three types of Lambert WxF transformations, and the ordered quantile normalization transformation. It estimates the normalization efficacy of other commonly used transformations, and it allows users to specify custom transformations or normalization statistics. Finally, functionality can be integrated into a machine learning workflow via recipes.
Includes bases for litholog generation: graphical functions based on R base graphics, interval management functions and svg importation functions among others. Also include stereographic projection functions, and other functions made to deal with large datasets while keeping options to get into the details of the data. When using for publication please cite Sebastien Wouters, Anne-Christine Da Silva, Frederic Boulvain and Xavier Devleeschouwer, 2021. The R Journal 13:2, 153-178. The palaeomagnetism functions are based on: Tauxe, L., 2010. Essentials of Paleomagnetism. University of California Press. <https://earthref.org/MagIC/books/Tauxe/Essentials/>
; Allmendinger, R. W., Cardozo, N. C., and Fisher, D., 2013, Structural Geology Algorithms: Vectors & Tensors: Cambridge, England, Cambridge University Press, 289 pp.; Cardozo, N., and Allmendinger, R. W., 2013, Spherical projections with OSXStereonet: Computers & Geosciences, v. 51, no. 0, p. 193 - 205, <doi: 10.1016/j.cageo.2012.07.021>.
doubletrouble aims to identify duplicated genes from whole-genome protein sequences and classify them based on their modes of duplication. The duplication modes are i. segmental duplication (SD); ii. tandem duplication (TD); iii. proximal duplication (PD); iv. transposed duplication (TRD) and; v. dispersed duplication (DD). Transposon-derived duplicates (TRD) can be further subdivided into rTRD
(retrotransposon-derived duplication) and dTRD
(DNA transposon-derived duplication). If users want a simpler classification scheme, duplicates can also be classified into SD- and SSD-derived (small-scale duplication) gene pairs. Besides classifying gene pairs, users can also classify genes, so that each gene is assigned a unique mode of duplication. Users can also calculate substitution rates per substitution site (i.e., Ka and Ks) from duplicate pairs, find peaks in Ks distributions with Gaussian Mixture Models (GMMs), and classify gene pairs into age groups based on Ks peaks.
Some tools to assist with converting International Organization for Standardization (ISO) standard 11784 (ISO11784) animal ID codes between 4 recognised formats commonly displayed on Passive Integrated Transponder (PIT) tag readers. The most common formats are 15 digit decimal, e.g., 999123456789012, and 13 character hexadecimal dot format, e.g., 3E7.1CBE991A14. These are referred to in this package as isodecimal and isodothex. The other two formats are the raw hexadecimal representation of the ISO11784 binary structure (see <https://en.wikipedia.org/wiki/ISO_11784_and_ISO_11785>). There are two flavours of this format, a left and a right variation. Which flavour a reader happens to output depends on if the developers decided to reverse the binary number or not before converting to hexadecimal, a decision based on the fact that the PIT tags will transmit their binary code Least Significant Bit (LSB) first, or backwards basically.
This package provides functions to find edges for bibliometric networks like bibliographic coupling network, co-citation network and co-authorship network. The weights of network edges can be calculated according to different methods, depending on the type of networks, the type of nodes, and what you want to analyse. These functions are optimized to be be used on large dataset. The package contains functions inspired by: Leydesdorff, Loet and Park, Han Woo (2017) <doi:10.1016/j.joi.2016.11.007>; Perianes-Rodriguez, Antonio, Ludo Waltman, and Nees Jan Van Eck (2016) <doi:10.1016/j.joi.2016.10.006>; Sen, Subir K. and Shymal K. Gan (1983) <http://nopr.niscair.res.in/handle/123456789/28008>; Shen, Si, Zhu, Danhao, Rousseau, Ronald, Su, Xinning and Wang, Dongbo (2019) <doi:10.1016/j.joi.2019.01.012>; Zhao, Dangzhi and Strotmann, Andreas (2008) <doi:10.1002/meet.2008.1450450292>.
The Genetic Algorithm (GA) is used to perform changepoint analysis in time series data. The package also includes an extended island version of GA, as described in Lu, Lund, and Lee (2010, <doi:10.1214/09-AOAS289>). By mimicking the principles of natural selection and evolution, GA provides a powerful stochastic search technique for solving combinatorial optimization problems. In changepointGA
', each chromosome represents a changepoint configuration, including the number and locations of changepoints, hyperparameters, and model parameters. The package employs genetic operatorsâ selection, crossover, and mutationâ to iteratively improve solutions based on the given fitness (objective) function. Key features of changepointGA
include encoding changepoint configurations in an integer format, enabling dynamic and simultaneous estimation of model hyperparameters, changepoint configurations, and associated parameters. The detailed algorithmic implementation can be found in the package vignettes and in the paper of Li (2024, <doi:10.48550/arXiv.2410.15571>
).
Cancer cells accumulate DNA mutations as result of DNA damage and DNA repair processes. This computational framework is aimed at deciphering DNA mutational signatures operating in cancer. The framework includes modules that support raw data import and processing, mutational signature extraction, and results interpretation and visualization. The framework accepts widely used file formats storing information about DNA variants, such as Variant Call Format files. The framework performs Non-Negative Matrix Factorization to extract mutational signatures explaining the observed set of DNA mutations. Bootstrapping is performed as part of the analysis. The framework supports parallelization and is optimized for use on multi-core systems. The software was described by Fantini D et al (2020) <doi:10.1038/s41598-020-75062-0> and is based on a custom R-based implementation of the original MATLAB WTSI framework by Alexandrov LB et al (2013) <doi:10.1016/j.celrep.2012.12.008>.
Quantifying systematic heterogeneity in meta-analysis using R. The M statistic aggregates heterogeneity information across multiple variants to, identify systematic heterogeneity patterns and their direction of effect in meta-analysis. It's primary use is to identify outlier studies, which either show "null" effects or consistently show stronger or weaker genetic effects than average across, the panel of variants examined in a GWAS meta-analysis. In contrast to conventional heterogeneity metrics (Q-statistic, I-squared and tau-squared) which measure random heterogeneity at individual variants, M measures systematic (non-random) heterogeneity across multiple independently associated variants. Systematic heterogeneity can arise in a meta-analysis due to differences in the study characteristics of participating studies. Some of the differences may include: ancestry, allele frequencies, phenotype definition, age-of-disease onset, family-history, gender, linkage disequilibrium and quality control thresholds. See <https://magosil86.github.io/getmstatistic/> for statistical statistical theory, documentation and examples.
Pool dilution is a isotope tracer technique wherein a biogeochemical pool is artifically enriched with its heavy isotopologue and the gross productive and consumptive fluxes of that pool are quantified by the change in pool size and isotopic composition over time. This package calculates gross production and consumption rates from closed-system isotopic pool dilution time series data. Pool size concentrations and heavy isotope (e.g., 15N) content are measured over time and the model optimizes production rate (P) and the first order rate constant (k) by minimizing error in the model-predicted total pool size, as well as the isotopic signature. The model optimizes rates by weighting information against the signal:noise ratio of concentration and heavy- isotope signatures using measurement precision as well as the magnitude of change over time. The calculations used here are based on von Fischer and Hedin (2002) <doi:10.1029/2001GB001448> with some modifications.
Fit and apply ComBat
, linear mixed-effects models (LMM), or prescaling to harmonize magnetic resonance imaging (MRI) data from different sites. Briefly, these methods remove differences between sites due to using different scanning devices, and LMM additionally tests linear hypotheses. As detailed in the manual, the original ComBat
function was first modified for the harmonization of MRI data (Fortin et al. (2017) <doi:10.1016/j.neuroimage.2017.11.024>) and then modified again to create separate functions for fitting and applying the harmonization and allow missing values and constant rows for its use within the Enhancing Neuro Imaging Genetics through Meta-Analysis (ENIGMA) Consortium (Radua et al. (2020) <doi:10.1016/j.neuroimage.2020.116956>); this package includes the latter version. LMM calls "lme" massively considering specific brain imaging details. Finally, prescaling is a good option for fMRI
, where different devices can have varying units of measurement.
This package performs unadjusted Bayesian survival analysis for right censored time-to-event data. The main function, BayesSurv()
, computes the posterior mean and a credible band for the survival function and for the cumulative hazard, as well as the posterior mean for the hazard, starting from a piecewise exponential (histogram) prior with Gamma distributed heights that are either independent, or have a Markovian dependence structure. A function, PlotBayesSurv()
, is provided to easily create plots of the posterior means of the hazard, cumulative hazard and survival function, with a credible band accompanying the latter two. The priors and samplers are described in more detail in Castillo and Van der Pas (2020) "Multiscale Bayesian survival analysis" <arXiv:2005.02889>
. In that paper it is also shown that the credible bands for the survival function and the cumulative hazard can be considered confidence bands (under mild conditions) and thus offer reliable uncertainty quantification.
Translates several CSV files with ontological terms and corresponding data into RDF triples. These RDF triples are stored in OWL and JSON-LD files, facilitating data accessibility, interoperability, and knowledge unification. The triples are also visualized in a graph saved as an SVG. The input CSVs must be formatted with a template from a public Google Sheet; see README or vignette for more information. This is a tool is used by the SDLE Research Center at Case Western Reserve University to create and visualize material science ontologies, and it includes example ontologies to demonstrate its capabilities. This work was supported by the U.S. Department of Energyâ s Office of Energy Efficiency and Renewable Energy (EERE) under Solar Energy Technologies Office (SETO) Agreement Numbers E-EE0009353 and DE-EE0009347, Department of Energy (National Nuclear Security Administration) under Award Number DE-NA0004104 and Contract number B647887, and U.S. National Science Foundation Award under Award Number 2133576.
Simple and user-friendly wrappers to the saemix package for performing linear and non-linear mixed-effects regression modeling for growth data to account for clustering or longitudinal analysis via repeated measurements. The package allows users to fit a variety of growth models, including linear, exponential, logistic, and Gompertz functions. For non-linear models, starting values are automatically calculated using initial least-squares estimates. The package includes functions for summarizing models, visualizing data and results, calculating doubling time and other key statistics, and generating model diagnostic plots and residual summary statistics. It also provides functions for generating publication-ready summary tables for reports. Additionally, users can fit linear and non-linear least-squares regression models if clustering is not applicable. The mixed-effects modeling methods in this package are based on Comets, Lavenu, and Lavielle (2017) <doi:10.18637/jss.v080.i03> as implemented in the saemix package. Please contact us at models@dfci.harvard.edu with any questions.
The 2-D spatial and temporal Epidemic Type Aftershock Sequence ('ETAS') Model is widely used to decluster earthquake data catalogs. Usually, the calculation of standard errors of the ETAS model parameter estimates is based on the Hessian matrix derived from the log-likelihood function of the fitted model. However, when an ETAS model is fitted to a local data set over a time period that is limited or short, the standard errors based on the Hessian matrix may be inaccurate. It follows that the asymptotic confidence intervals for parameters may not always be reliable. As an alternative, this package allows for the construction of bootstrap confidence intervals based on empirical quantiles for the parameters of the 2-D spatial and temporal ETAS model. This version improves on Version 0.1.0 of the package by enabling the study space window (renamed study region') to be polygonal rather than merely rectangular. A Japan earthquake data catalog is used in a second example to illustrate this new feature.
Statistical or cognitive modeling usually requires a number of more or less arbitrary choices creating one specific path through a garden of forking paths'. The multiverse approach (Steegen, Tuerlinckx, Gelman, & Vanpaemel, 2016, <doi:10.1177/1745691616658637>) offers a principled alternative in which results for all possible combinations of reasonable modeling choices are reported. MPTmultiverse performs a multiverse analysis for multinomial processing tree (MPT, Riefer & Batchelder, 1988, <doi:10.1037/0033-295X.95.3.318>) models combining maximum-likelihood/frequentist and Bayesian estimation approaches with different levels of pooling (i.e., data aggregation). For the frequentist approaches, no pooling (with and without parametric or nonparametric bootstrap) and complete pooling are implemented using MPTinR
<https://cran.r-project.org/package=MPTinR>
. For the Bayesian approaches, no pooling, complete pooling, and three different variants of partial pooling are implemented using TreeBUGS
<https://cran.r-project.org/package=TreeBUGS>
. The main function is fit_mpt()
who performs the multiverse analysis in one call.