Combines taxonomic classifications of high-throughput 16S rRNA
gene sequences with reference proteomes of archaeal and bacterial taxa to generate amino acid compositions of community reference proteomes. Calculates chemical metrics including carbon oxidation state ('Zc'), stoichiometric oxidation and hydration state ('nO2
and nH2O
'), H/C, N/C, O/C, and S/C ratios, grand average of hydropathicity ('GRAVY'), isoelectric point ('pI
'), protein length, and average molecular weight of amino acid residues. Uses precomputed reference proteomes for archaea and bacteria derived from the Genome Taxonomy Database ('GTDB'). Also includes reference proteomes derived from the NCBI Reference Sequence ('RefSeq
') database and manual mapping from the RDP Classifier training set to RefSeq
taxonomy as described by Dick and Tan (2023) <doi:10.1007/s00248-022-01988-9>. Processes taxonomic classifications in RDP Classifier format or OTU tables in phyloseq-class objects from the Bioconductor package phyloseq'.
The network analysis plays an important role in numerous application domains including biomedicine. Estimation of the number of communities is a fundamental and critical issue in network analysis. Most existing studies assume that the number of communities is known a priori, or lack of rigorous theoretical guarantee on the estimation consistency. This method proposes a regularized network embedding model to simultaneously estimate the community structure and the number of communities in a unified formulation. The proposed model equips network embedding with a novel composite regularization term, which pushes the embedding vector towards its center and collapses similar community centers with each other. A rigorous theoretical analysis is conducted, establishing asymptotic consistency in terms of community detection and estimation of the number of communities. Reference: Ren, M., Zhang S. and Wang J. (2022). "Consistent Estimation of the Number of Communities via Regularized Network Embedding". Biometrics, <doi:10.1111/biom.13815>.
Learn vector representations of sentences, paragraphs or documents by using the Paragraph Vector algorithms, namely the distributed bag of words (PV-DBOW) and the distributed memory (PV-DM) model. Top2vec finds clusters in text documents by combining techniques to embed documents and words and density-based clustering. It does this by embedding documents in the semantic space as defined by the doc2vec algorithm. Next it maps these document embeddings to a lower-dimensional space using the Uniform Manifold Approximation and Projection (UMAP) clustering algorithm and finds dense areas in that space using a Hierarchical Density-Based Clustering technique (HDBSCAN). These dense areas are the topic clusters which can be represented by the corresponding topic vector which is an aggregate of the document embeddings of the documents which are part of that topic cluster. In the same semantic space similar words can be found which are representative of the topic.
This package provides methods for simultaneous clustering and dimensionality reduction such as: Double k-means, Reduced k-means, Factorial k-means, Clustering with Disjoint PCA but also methods for exclusively dimensionality reduction: Disjoint PCA, Disjoint FA. The statistical methods implemented refer to the following articles: de Soete G., Carroll J. (1994) "K-means clustering in a low-dimensional Euclidean space" <doi:10.1007/978-3-642-51175-2_24> ; Vichi M. (2001) "Double k-means Clustering for Simultaneous Classification of Objects and Variables" <doi:10.1007/978-3-642-59471-7_6> ; Vichi M., Kiers H.A.L. (2001) "Factorial k-means analysis for two-way data" <doi:10.1016/S0167-9473(00)00064-5> ; Vichi M., Saporta G. (2009) "Clustering and disjoint principal component analysis" <doi:10.1016/j.csda.2008.05.028> ; Vichi M. (2017) "Disjoint factor analysis with cross-loadings" <doi:10.1007/s11634-016-0263-9>.
Runs ecological niche models over all combinations of user-defined settings (i.e., tuning), performs cross validation to evaluate models, and returns data tables to aid in selection of optimal model settings that balance goodness-of-fit and model complexity. Also has functions to partition data spatially (or not) for cross validation, to plot multiple visualizations of results, to run null models to estimate significance and effect sizes of performance metrics, and to calculate range overlap between model predictions, among others. The package was originally built for Maxent models (Phillips et al. 2006, Phillips et al. 2017), but the current version allows possible extensions for any modeling algorithm. The extensive vignette, which guides users through most package functionality but unfortunately has a file size too big for CRAN, can be found here on the package's Github Pages website: <https://jamiemkass.github.io/ENMeval/articles/ENMeval-2.0-vignette.html>.
Easily explore data by plotting graphs with a few lines of code. Use these ggplot()
wrappers to quickly draw graphs of scatter/dots with box-whiskers, violins or SD error bars, data distributions, before-after graphs, factorial ANOVA and more. Customise graphs in many ways, for example, by choosing from colour blind-friendly palettes (12 discreet, 3 continuous and 2 divergent palettes). Use the simple code for ANOVA as ordinary (lm()
) or mixed-effects linear models (lmer()
), including randomised-block or repeated-measures designs, and fit non-linear outcomes as a generalised additive model (gam) using mgcv()
. Obtain estimated marginal means and perform post-hoc comparisons on fitted models (via emmeans()
). Also includes small datasets for practising code and teaching basics before users move on to more complex designs. See vignettes for details on usage <https://grafify.shenoylab.com/>. Citation: <doi:10.5281/zenodo.5136508>.
This package provides functions to compute p-values based on permutation tests. Regression, ANOVA and ANCOVA, omnibus F-tests, marginal unilateral and bilateral t-tests are available. Several methods to handle nuisance variables are implemented (Kherad-Pajouh, S., & Renaud, O. (2010) <doi:10.1016/j.csda.2010.02.015> ; Kherad-Pajouh, S., & Renaud, O. (2014) <doi:10.1007/s00362-014-0617-3> ; Winkler, A. M., Ridgway, G. R., Webster, M. A., Smith, S. M., & Nichols, T. E. (2014) <doi:10.1016/j.neuroimage.2014.01.060>). An extension for the comparison of signals issued from experimental conditions (e.g. EEG/ERP signals) is provided. Several corrections for multiple testing are possible, including the cluster-mass statistic (Maris, E., & Oostenveld, R. (2007) <doi:10.1016/j.jneumeth.2007.03.024>) and the threshold-free cluster enhancement (Smith, S. M., & Nichols, T. E. (2009) <doi:10.1016/j.neuroimage.2008.03.061>).
This package contains performance analysis metrics of track records including entropy-based correlation and dynamic beta based on a state/space algorithm. The normalized sample entropy method has been implemented which produces accurate entropy estimation even on smaller datasets. On a separate stream, trades from the five major assets classes and also functionality to use pricing curves, rating tables, Credit Support Annex and add-on tables. The implementation follows an object oriented logic whereby each trade inherits from more abstract classes while also the curves/tables are objects. Furthermore, odds calculators and P&L back-testing functionality has been implemented for the most widely used betting/trading strategies including martingale, DAlembert', Labouchere and Fibonacci. Back testing has also been included for the EuroMillions
', the EuroJackpot
', the UK Lotto, the Set For Life and the UK ThunderBall
lotteries. Furthermore, some basic functionality about climate risk has been included.
Automatic fixed rank kriging for (irregularly located) spatial data using a class of basis functions with multi-resolution features and ordered in terms of their resolutions. The model parameters are estimated by maximum likelihood (ML) and the number of basis functions is determined by Akaike's information criterion (AIC). For spatial data with either one realization or independent replicates, the ML estimates and AIC are efficiently computed using their closed-form expressions when no missing value occurs. Details regarding the basis function construction, parameter estimation, and AIC calculation can be found in Tzeng and Huang (2018) <doi:10.1080/00401706.2017.1345701>. For data with missing values, the ML estimates are obtained using the expectation- maximization algorithm. Apart from the number of basis functions, there are no other tuning parameters, making the method fully automatic. Users can also include a stationary structure in the spatial covariance, which utilizes LatticeKrig
package.
Three global value chain (GVC) decompositions are implemented. The Leontief decomposition derives the value added origin of exports by country and industry as in Hummels, Ishii and Yi (2001). The Koopman, Wang and Wei (2014) decomposition splits country-level exports into 9 value added components, and the Wang, Wei and Zhu (2013) decomposition splits bilateral exports into 16 value added components. Various GVC indicators based on these decompositions are computed in the complimentary gvc package. --- References: --- Hummels, D., Ishii, J., & Yi, K. M. (2001). The nature and growth of vertical specialization in world trade. Journal of international Economics, 54(1), 75-96. Koopman, R., Wang, Z., & Wei, S. J. (2014). Tracing value-added and double counting in gross exports. American Economic Review, 104(2), 459-94. Wang, Z., Wei, S. J., & Zhu, K. (2013). Quantifying international production sharing at the bilateral and sector levels (No. w19677). National Bureau of Economic Research.
Fits high dimensional penalized generalized linear mixed models using the Monte Carlo Expectation Conditional Minimization (MCECM) algorithm. The purpose of the package is to perform variable selection on both the fixed and random effects simultaneously for generalized linear mixed models. The package supports fitting of Binomial, Gaussian, and Poisson data with canonical links, and supports penalization using the MCP, SCAD, or LASSO penalties. The MCECM algorithm is described in Rashid et al. (2020) <doi:10.1080/01621459.2019.1671197>. The techniques used in the minimization portion of the procedure (the M-step) are derived from the procedures of the ncvreg package (Breheny and Huang (2011) <doi:10.1214/10-AOAS388>) and grpreg package (Breheny and Huang (2015) <doi:10.1007/s11222-013-9424-2>), with appropriate modifications to account for the estimation and penalization of the random effects. The ncvreg and grpreg packages also describe the MCP, SCAD, and LASSO penalties.
The Row-column designs are widely recommended for experimental situations when there are two well-identified factors that are cross-classified representing known sources of variability. These designs are expected to result a gain in accuracy of estimating treatment comparisons in an experiment as they eliminate the effects of the row and column factors. However, these designs are not readily available when the number of treatments is more than the levels of row and column blocking factors. This package named iRoCoDe
generates row-column designs with incomplete rows and columns, by amalgamating two incomplete block designs (D1 and D2). The selection of D1 and D2 (the input designs) can be done from the available incomplete block designs, viz., balanced incomplete block designs/ partially balanced incomplete block designs/ t-designs. (Mcsorley, J.P., Phillips, N.C., Wallis, W.D. and Yucas, J.L. (2005).<doi:10.1007/s10623-003-6149-9>).
BioNERO aims to integrate all aspects of biological network inference in a single package, including data preprocessing, exploratory analyses, network inference, and analyses for biological interpretations. BioNERO can be used to infer gene coexpression networks (GCNs) and gene regulatory networks (GRNs) from gene expression data. Additionally, it can be used to explore topological properties of protein-protein interaction (PPI) networks. GCN inference relies on the popular WGCNA algorithm. GRN inference is based on the "wisdom of the crowds" principle, which consists in inferring GRNs with multiple algorithms (here, CLR, GENIE3 and ARACNE) and calculating the average rank for each interaction pair. As all steps of network analyses are included in this package, BioNERO makes users avoid having to learn the syntaxes of several packages and how to communicate between them. Finally, users can also identify consensus modules across independent expression sets and calculate intra and interspecies module preservation statistics between different networks.
Translates a BNF (Backus-Naur Form) specification of a context-free language into an R grammar object which consists of the start symbol, the symbol table, the production table, and a short production table. The short production table is non-recursive. The grammar object contains the file name from which it was generated (without a path). In addition, it provides functions to determine the type of a symbol (isTerminal()
and isNonterminal()
) and functions to access the production table (rules()
and derives()
). For the BNF specification, see Backus, John et al. (1962) "Revised Report on the Algorithmic Language ALGOL 60". (ALGOL60 standards page <http://www.algol60.org/2standards.htm>, html-edition <https://www.masswerk.at/algol60/report.htm>) A preprocessor for macros which expand to standard BNF is included. The grammar compiler is an extension of the APL2 implementation in Geyer-Schulz, Andreas (1997, ISBN:978-3-7908-0830-X).
The cfTools
R package provides methods for cell-free DNA (cfDNA
) methylation data analysis to facilitate cfDNA-based
studies. Given the methylation sequencing data of a cfDNA
sample, for each cancer marker or tissue marker, we deconvolve the tumor-derived or tissue-specific reads from all reads falling in the marker region. Our read-based deconvolution algorithm exploits the pervasiveness of DNA methylation for signal enhancement, therefore can sensitively identify a trace amount of tumor-specific or tissue-specific cfDNA
in plasma. cfTools
provides functions for (1) cancer detection: sensitively detect tumor-derived cfDNA
and estimate the tumor-derived cfDNA
fraction (tumor burden); (2) tissue deconvolution: infer the tissue type composition and the cfDNA
fraction of multiple tissue types for a plasma cfDNA
sample. These functions can serve as foundations for more advanced cfDNA-based
studies, including cancer diagnosis and disease monitoring.
This package performs (simultaneous) inferences for ratios of linear combinations of coefficients in the general linear model, linear mixed model, and for quantiles in a one-way layout. Multiple comparisons and simultaneous confidence interval estimations can be performed for ratios of treatment means in the normal one-way layout with homogeneous and heterogeneous treatment variances, according to Dilba et al. (2007) <https://cran.r-project.org/doc/Rnews/Rnews_2007-1.pdf> and Hasler and Hothorn (2008) <doi:10.1002/bimj.200710466>. Confidence interval estimations for ratios of linear combinations of linear model parameters like in (multiple) slope ratio and parallel line assays can be carried out. Moreover, it is possible to calculate the sample sizes required in comparisons with a control based on relative margins. For the simple two-sample problem, functions for a t-test for ratio-formatted hypotheses and the corresponding confidence interval are provided assuming homogeneous or heterogeneous group variances.
This package provides an extensible formula system to quickly and easily create production quality tables. The processing steps are a formula parser, statistical content generation from data as defined by formula, followed by rendering into a table. Each step of the processing is separate and user definable thus creating a set of composable building blocks for highly customizable table generation. A user is not limited by any of the choices of the package creator other than the formula grammar. For example, one could chose to add a different S3 rendering function and output a format not provided in the default package, or possibly one would rather have Gini coefficients for their statistical content in a resulting table. Routines to achieve New England Journal of Medicine style, Lancet style and Hmisc::summaryM()
statistics are provided. The package contains rendering for HTML5, Rmarkdown and an indexing format for use in tracing and tracking are provided.
This package provides a modular and extendable approach to extract (micro)saccades from gaze samples via an ensemble of methods. Although there is an agreement about a general definition of a saccade, the more specific details are harder to agree upon. Therefore, there are numerous algorithms that extract saccades based on various heuristics, which differ in the assumptions about velocity, acceleration, etc. The package uses three methods (Engbert and Kliegl (2003) <doi:10.1016/S0042-6989(03)00084-1>, Otero-Millan et al. (2014)<doi:10.1167/14.2.18>, and Nyström and Holmqvist (2010) <doi:10.3758/BRM.42.1.188>) to label individual samples and then applies a majority vote approach to identify saccades. The package includes three methods but can be extended via custom functions. It also uses a modular approach to compute velocity and acceleration from noisy samples. Finally, you can obtain methods votes per gaze sample instead of saccades.
An interface to WordNet
using the Jawbone Java API to WordNet
. WordNet
(<https://wordnet.princeton.edu/>) is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. Please note that WordNet(R
) is a registered tradename. Princeton University makes WordNet
available to research and commercial users free of charge provided the terms of their license (<https://wordnet.princeton.edu/license-and-commercial-use>) are followed, and proper reference is made to the project using an appropriate citation (<https://wordnet.princeton.edu/citing-wordnet>). The WordNet
database files need to be made available separately, either via package wordnetDicts
from <https://datacube.wu.ac.at>, installing system packages where available, or direct download from <https://wordnetcode.princeton.edu/3.0/WNdb-3.0.tar.gz>.
Computes the posterior model probabilities for standard meta-analysis models (null model vs. alternative model assuming either fixed- or random-effects, respectively). These posterior probabilities are used to estimate the overall mean effect size as the weighted average of the mean effect size estimates of the random- and fixed-effect model as proposed by Gronau, Van Erp, Heck, Cesario, Jonas, & Wagenmakers (2017, <doi:10.1080/23743603.2017.1326760>). The user can define a wide range of non-informative or informative priors for the mean effect size and the heterogeneity coefficient. Moreover, using pre-compiled Stan models, meta-analysis with continuous and discrete moderators with Jeffreys-Zellner-Siow (JZS) priors can be fitted and tested. This allows to compute Bayes factors and perform Bayesian model averaging across random- and fixed-effects meta-analysis with and without moderators. For a primer on Bayesian model-averaged meta-analysis, see Gronau, Heck, Berkhout, Haaf, & Wagenmakers (2021, <doi:10.1177/25152459211031256>).
This package provides a collection of functions to pre-process amplification curve data from polymerase chain reaction (PCR) or isothermal amplification reactions. Contains functions to normalize and baseline amplification curves, to detect both the start and end of an amplification reaction, several smoothers (e.g., LOWESS, moving average, cubic splines, Savitzky-Golay), a function to detect false positive amplification reactions and a function to determine the amplification efficiency. Quantification point (Cq) methods include the first (FDM) and second approximate derivative maximum (SDM) methods (calculated by a 5-point-stencil) and the cycle threshold method. Data sets of experimental nucleic acid amplification systems ('VideoScan
HCU', capillary convective PCR (ccPCR
)) and commercial systems are included. Amplification curves were generated by helicase dependent amplification (HDA), ccPCR
or PCR. As detection system intercalating dyes (EvaGreen
, SYBR Green) and hydrolysis probes (TaqMan
) were used. For more information see: Roediger et al. (2015) <doi:10.1093/bioinformatics/btv205>.
Perform multivariate modeling of evolved traits, with special attention to understanding the interplay of the multi-factorial determinants of their origins in complex ecological settings (Stephens, 2007 <doi:10.1016/j.tree.2006.12.003>). This software primarily concentrates on phylogenetic regression analysis, enabling implementation of tree transformation averaging and visualization functionality. Functions additionally support information theoretic approaches (Grueber, 2011 <doi:10.1111/j.1420-9101.2010.02210.x>; Garamszegi, 2011 <doi:10.1007/s00265-010-1028-7>) such as model averaging and selection of phylogenetic models. Accessory functions are also implemented for coef standardization (Cade 2015), selection uncertainty, and variable importance (Burnham & Anderson 2000). There are other numerous functions for visualizing confounded variables, plotting phylogenetic trees, as well as reporting and exporting modeling results. Lastly, as challenges to ecology are inherently multifarious, and therefore often multi-dataset, this package features several functions to support the identification, interpolation, merging, and updating of missing data and outdated nomenclature.
It is a hybrid spatial model that combines the strength of two widely used regression models, MARS (Multivariate Adaptive Regression Splines) and GWR (Geographically Weighted Regression) to provide an effective approach for predicting a response variable at unknown locations. The MARS model is used in the first step of the development of a hybrid model to identify the most important predictor variables that assist in predicting the response variable. For method details see, Friedman, J.H. (1991). <DOI:10.1214/aos/1176347963>.The GWR model is then used to predict the response variable at testing locations based on these selected variables that account for spatial variations in the relationships between the variables. This hybrid model can improve the accuracy of the predictions compared to using an individual model alone.This developed hybrid spatial model can be useful particularly in cases where the relationship between the response variable and predictor variables is complex and non-linear, and varies across locations.
This package provides a compilation of more than 80 functions designed to quantitatively and visually evaluate prediction performance of regression (continuous variables) and classification (categorical variables) of point-forecast models (e.g. APSIM, DSSAT, DNDC, supervised Machine Learning). For regression, it includes functions to generate plots (scatter, tiles, density, & Bland-Altman plot), and to estimate error metrics (e.g. MBE, MAE, RMSE), error decomposition (e.g. lack of accuracy-precision), model efficiency (e.g. NSE, E1, KGE), indices of agreement (e.g. d, RAC), goodness of fit (e.g. r, R2), adjusted correlation coefficients (e.g. CCC, dcorr), symmetric regression coefficients (intercept, slope), and mean absolute scaled error (MASE) for time series predictions. For classification (binomial and multinomial), it offers functions to generate and plot confusion matrices, and to estimate performance metrics such as accuracy, precision, recall, specificity, F-score, Cohen's Kappa, G-mean, and many more. For more details visit the vignettes <https://adriancorrendo.github.io/metrica/>.