The permubiome R package was created to perform a permutation-based non-parametric analysis on microbiome data for biomarker discovery aims. This test executes thousands of comparisons in a pairwise manner, after a random shuffling of data into the different groups of study with a prior selection of the microbiome features with the largest variation among groups. Previous to the permutation test itself, data can be normalized according to different methods proposed to handle microbiome data ('proportions or Anders'). The median-based differences between groups resulting from the multiple simulations are fitted to a normal distribution with the aim to calculate their significance. A multiple testing correction based on Benjamini-Hochberg method (fdr) is finally applied to extract the differentially presented features between groups of your dataset. LATEST UPDATES: v1.1 and olders incorporates function to parse COLUMN format; v1.2 and olders incorporates -optimize- function to maximize evaluation of features with largest inter-class variation; v1.3 and olders includes the -size.effect- function to perform estimation statistics using the bootstrap-coupled approach implemented in the dabestr (>=0.3.0) R package. Current v1.3.2 fixed bug with "Class" recognition and updated dabestr functions.
This package provides an all-in-one solution for automatic classification of sound events using convolutional neural networks (CNN). The main purpose is to provide a sound classification workflow, from annotating sound events in recordings to training and automating model usage in real-life situations. Using the package requires a pre-compiled collection of recordings with sound events of interest and it can be employed for: 1) Annotation: create a database of annotated recordings, 2) Training: prepare train data from annotated recordings and fit CNN models, 3) Classification: automate the use of the fitted model for classifying new recordings. By using automatic feature selection and a user-friendly GUI for managing data and training/deploying models, this package is intended to be used by a broad audience as it does not require specific expertise in statistics, programming or sound analysis. Please refer to the vignette for further information. Gibb, R., et al. (2019) <doi:10.1111/2041-210X.13101> Mac Aodha, O., et al. (2018) <doi:10.1371/journal.pcbi.1005995> Stowell, D., et al. (2019) <doi:10.1111/2041-210X.13103> LeCun, Y., et al. (2012) <doi:10.1007/978-3-642-35289-8_3>.
Systematic conservation prioritization using mixed integer linear programming (MILP). It provides a flexible interface for building and solving conservation planning problems. Once built, conservation planning problems can be solved using a variety of commercial and open-source exact algorithm solvers. By using exact algorithm solvers, solutions can be generated that are guaranteed to be optimal (or within a pre-specified optimality gap). Furthermore, conservation problems can be constructed to optimize the spatial allocation of different management actions or zones, meaning that conservation practitioners can identify solutions that benefit multiple stakeholders. To solve large-scale or complex conservation planning problems, users should install the Gurobi optimization software (available from <https://www.gurobi.com/>) and the gurobi R package (see Gurobi Installation Guide vignette for details). Users can also install the IBM CPLEX software (<https://www.ibm.com/products/ilog-cplex-optimization-studio/cplex-optimizer>) and the cplexAPI R package (available at <https://github.com/cran/cplexAPI>). Additionally, the rcbc R package (available at <https://github.com/dirkschumacher/rcbc>) can be used to generate solutions using the CBC optimization software (<https://github.com/coin-or/Cbc>). For further details, see Hanson et al. (2025) <doi:10.1111/cobi.14376>.
pathwayPCA is an integrative analysis tool that implements the principal component analysis (PCA) based pathway analysis approaches described in Chen et al. (2008), Chen et al. (2010), and Chen (2011). pathwayPCA allows users to: (1) Test pathway association with binary, continuous, or survival phenotypes. (2) Extract relevant genes in the pathways using the SuperPCA and AES-PCA approaches. (3) Compute principal components (PCs) based on the selected genes. These estimated latent variables represent pathway activities for individual subjects, which can then be used to perform integrative pathway analysis, such as multi-omics analysis. (4) Extract relevant genes that drive pathway significance as well as data corresponding to these relevant genes for additional in-depth analysis. (5) Perform analyses with enhanced computational efficiency with parallel computing and enhanced data safety with S4-class data objects. (6) Analyze studies with complex experimental designs, with multiple covariates, and with interaction effects, e.g., testing whether pathway association with clinical phenotype is different between male and female subjects. Citations: Chen et al. (2008) <https://doi.org/10.1093/bioinformatics/btn458>; Chen et al. (2010) <https://doi.org/10.1002/gepi.20532>; and Chen (2011) <https://doi.org/10.2202/1544-6115.1697>.
Non-Domestic VAERS vaccine data for 01/01/2016 - 06/14/2016. If you want to explore the full VAERS data for 1990 - Present (data, symptoms, and vaccines), then check out the vaersND package from the URL below. The URL and BugReports below correspond to the vaersND package, of which vaersNDvax is a small subset (2016 only). vaersND is not hosted on CRAN due to the large size of the data set. To install the Suggested vaers and vaersND packages, use the following R code: devtools::install_git("https://gitlab.com/iembry/vaers.git", build_vignettes = TRUE) and devtools::install_git("https://gitlab.com/iembry/vaersND.git", build_vignettes = TRUE)'. "VAERS is a national vaccine safety surveillance program co-sponsored by the US Centers for Disease Control and Prevention (CDC) and the US Food and Drug Administration (FDA). VAERS is a post-marketing safety surveillance program, collecting information about adverse events (possible side effects) that occur after the administration of vaccines licensed for use in the United States." For more information about the data, visit <https://vaers.hhs.gov/index>. For information about vaccination/immunization hazards, visit <http://www.questionuniverse.com/rethink.html/#vaccine>.
JASPAR (https://jaspar.elixir.no/) is a widely-used open-access database presenting manually curated high-quality and non-redundant DNA-binding profiles for transcription factors (TFs) across taxa. In this 10th release and 20th-anniversary update, the CORE collection has expanded with 329 new profiles. We updated three existing profiles and provided orthogonal support for 72 profiles from the previous release UNVALIDATED collection. Altogether, the JASPAR 2024 update provides a 20 percent increase in CORE profiles from the previous release. A trimming algorithm enhanced profiles by removing low information content flanking base pairs, which were likely uninformative (within the capacity of the PFM models) for TFBS predictions and modelling TF-DNA interactions. This release includes enhanced metadata, featuring a refined classification for plant TFs structural DNA-binding domains. The new JASPAR collections prompt updates to the genomic tracks of predicted TF-binding sites in 8 organisms, with human and mouse tracks available as native tracks in the UCSC Genome browser. All data are available through the JASPAR web interface and programmatically through its API and the updated Bioconductor and pyJASPAR packages. Finally, a new TFBS extraction tool enables users to retrieve predicted JASPAR TFBSs intersecting their genomic regions of interest.
This package provides a system designed for detecting concept drift in streaming datasets. It offers a comprehensive suite of statistical methods to detect concept drift, including methods for monitoring changes in data distributions over time. The package supports several tests, such as Drift Detection Method (DDM), Early Drift Detection Method (EDDM), Hoeffding Drift Detection Methods (HDDM_A, HDDM_W), Kolmogorov-Smirnov test-based Windowing (KSWIN) and Page Hinkley (PH) tests. The methods implemented in this package are based on established research and have been demonstrated to be effective in real-time data analysis. For more details on the methods, please check to the following sources. KobyliŠska et al. (2023) <doi:10.48550/arXiv.2308.11446>, S. Kullback & R.A. Leibler (1951) <doi:10.1214/aoms/1177729694>, Gama et al. (2004) <doi:10.1007/978-3-540-28645-5_29>, Baena-Garcia et al. (2006) <https://www.researchgate.net/publication/245999704_Early_Drift_Detection_Method>, Frà as-Blanco et al. (2014) <https://ieeexplore.ieee.org/document/6871418>, Raab et al. (2020) <doi:10.1016/j.neucom.2019.11.111>, Page (1954) <doi:10.1093/biomet/41.1-2.100>, Montiel et al. (2018) <https://jmlr.org/papers/volume19/18-251/18-251.pdf>.
State-of-the-art Multi-Objective Particle Swarm Optimiser (MOPSO), based on the algorithm developed by Lin et al. (2018) <doi:10.1109/TEVC.2016.2631279> with improvements described by Marinao-Rivas & Zambrano-Bigiarini (2020) <doi:10.1109/LA-CCI48322.2021.9769844>. This package is inspired by and closely follows the philosophy of the single objective hydroPSO R package ((Zambrano-Bigiarini & Rojas, 2013) <doi:10.1016/j.envsoft.2013.01.004>), and can be used for global optimisation of non-smooth and non-linear R functions and R-base models (e.g., TUWmodel', GR4J', GR6J'). However, the main focus of hydroMOPSO is optimising environmental and other real-world models that need to be run from the system console (e.g., SWAT+'). hydroMOPSO communicates with the model to be optimised through its input and output files, without requiring modifying its source code. Thanks to its flexible design and the availability of several fine-tuning options, hydroMOPSO can tackle a wide range of multi-objective optimisation problems (e.g., multi-objective functions, multiple model variables, multiple periods). Finally, hydroMOPSO is designed to run on multi-core machines or network clusters, to alleviate the computational burden of complex models with long execution time.
Hierarchical and partitioning algorithms to cluster blocks of variables. The partitioning algorithm includes an option called noise cluster to set aside atypical blocks of variables. Different thresholds per cluster can be sets. The CLUSTATIS method (for quantitative blocks) (Llobell, Cariou, Vigneau, Labenne & Qannari (2020) <doi:10.1016/j.foodqual.2018.05.013>, Llobell, Vigneau & Qannari (2019) <doi:10.1016/j.foodqual.2019.02.017>) and the CLUSCATA method (for Check-All-That-Apply data) (Llobell, Cariou, Vigneau, Labenne & Qannari (2019) <doi:10.1016/j.foodqual.2018.09.006>, Llobell, Giacalone, Labenne & Qannari (2019) <doi:10.1016/j.foodqual.2019.05.017>) are the core of this package. The CATATIS methods allows to compute some indices and tests to control the quality of CATA data (Llobell, Bonnet & Giacalone (2024) <doi:10.1111/joss.12941>) . Multivariate analysis and clustering of subjects for quantitative multiblock data, CATA, RATA, Free Sorting and JAR experiments are available. Clustering of observations (products in sensory analysis) in multi-block context (notably with ClusMB strategy) is also included (Llobell & Giacalone (2025) <doi:10.1111/joss.70024>).Performing clustering based on CATA and liking at the same time is possible thanks to cluscata_liking function (Llobell & Giacalone (2025) <doi:10.1016/j.foodqual.2021.104358>).
Distance measures (GDM1, GDM2, Sokal-Michener, Bray-Curtis, for symbolic interval-valued data), cluster quality indices (Calinski-Harabasz, Baker-Hubert, Hubert-Levine, Silhouette, Krzanowski-Lai, Hartigan, Gap, Davies-Bouldin), data normalization formulas (metric data, interval-valued symbolic data), data generation (typical and non-typical data), HINoV method, replication analysis, linear ordering methods, spectral clustering, agreement indices between two partitions, plot functions (for categorical and symbolic interval-valued data). (MILLIGAN, G.W., COOPER, M.C. (1985) <doi:10.1007/BF02294245>, HUBERT, L., ARABIE, P. (1985) <doi:10.1007%2FBF01908075>, RAND, W.M. (1971) <doi:10.1080/01621459.1971.10482356>, JAJUGA, K., WALESIAK, M. (2000) <doi:10.1007/978-3-642-57280-7_11>, MILLIGAN, G.W., COOPER, M.C. (1988) <doi:10.1007/BF01897163>, JAJUGA, K., WALESIAK, M., BAK, A. (2003) <doi:10.1007/978-3-642-55721-7_12>, DAVIES, D.L., BOULDIN, D.W. (1979) <doi:10.1109/TPAMI.1979.4766909>, CALINSKI, T., HARABASZ, J. (1974) <doi:10.1080/03610927408827101>, HUBERT, L. (1974) <doi:10.1080/01621459.1974.10480191>, TIBSHIRANI, R., WALTHER, G., HASTIE, T. (2001) <doi:10.1111/1467-9868.00293>, BRECKENRIDGE, J.N. (2000) <doi:10.1207/S15327906MBR3502_5>, WALESIAK, M., DUDEK, A. (2008) <doi:10.1007/978-3-540-78246-9_11>).
This package provides a comprehensive interface to access diverse public data about Colombia through multiple APIs and curated datasets. The package integrates four different APIs: API-Colombia for Colombian-specific data including geography, culture, tourism, and government information; World Bank API for economic and demographic indicators; Nager.Date for public holidays; and REST Countries API for general country information. The package enables users to explore various aspects of Colombia such as geographic locations, cultural attractions, economic indicators, demographic data, and public holidays. Additionally, ColombiAPI includes curated datasets covering Bogota air stations, business and holiday dates, public schools, Colombian coffee exports, cannabis licenses, Medellin rainfall, malls in Bogota, as well as datasets on indigenous languages, student admissions and school statistics, forest liana mortality, municipal and regional data, connectivity and digital infrastructure, program graduates, vehicle counts, international visitors, and GDP projections. These datasets provide users with a rich and multifaceted view of Colombian social, economic, environmental, and technological information, making ColombiAPI a comprehensive tool for exploring Colombia's diverse data landscape. For more information on the APIs, see: API-Colombia <https://api-colombia.com/>, Nager.Date <https://date.nager.at/Api>, World Bank API <https://datahelpdesk.worldbank.org/knowledgebase/articles/889392>, and REST Countries API <https://restcountries.com/>.
We developed an inference tool based on approximate Bayesian computation to decipher network data and assess the strength of the inferred links between network's actors. It is a new multi-level approximate Bayesian computation (ABC) approach. At the first level, the method captures the global properties of the network, such as a scale-free structure and clustering coefficients, whereas the second level is targeted to capture local properties, including the probability of each couple of genes being linked. Up to now, Approximate Bayesian Computation (ABC) algorithms have been scarcely used in that setting and, due to the computational overhead, their application was limited to a small number of genes. On the contrary, our algorithm was made to cope with that issue and has low computational cost. It can be used, for instance, for elucidating gene regulatory network, which is an important step towards understanding the normal cell physiology and complex pathological phenotype. Reverse-engineering consists in using gene expressions over time or over different experimental conditions to discover the structure of the gene network in a targeted cellular process. The fact that gene expression data are usually noisy, highly correlated, and have high dimensionality explains the need for specific statistical methods to reverse engineer the underlying network.
Infer constant and stochastic, time-dependent parameters to consider intrinsic stochasticity of a dynamic model and/or to analyze model structure modifications that could reduce model deficits. The concept is based on inferring time-dependent parameters as stochastic processes in the form of Ornstein-Uhlenbeck processes jointly with inferring constant model parameters and parameters of the Ornstein-Uhlenbeck processes. The package also contains functions to sample from and calculate densities of Ornstein-Uhlenbeck processes. References: Tomassini, L., Reichert, P., Kuensch, H.-R. Buser, C., Knutti, R. and Borsuk, M.E. (2009), A smoothing algorithm for estimating stochastic, continuous-time model parameters and its application to a simple climate model, Journal of the Royal Statistical Society: Series C (Applied Statistics) 58, 679-704, <doi:10.1111/j.1467-9876.2009.00678.x> Reichert, P., and Mieleitner, J. (2009), Analyzing input and structural uncertainty of nonlinear dynamic models with stochastic, time-dependent parameters. Water Resources Research, 45, W10402, <doi:10.1029/2009WR007814> Reichert, P., Ammann, L. and Fenicia, F. (2021), Potential and challenges of investigating intrinsic uncertainty of hydrological models with time-dependent, stochastic parameters. Water Resources Research 57(8), e2020WR028311, <doi:10.1029/2020WR028311> Reichert, P. (2022), timedeppar: An R package for inferring stochastic, time-dependent model parameters, in preparation.
Efficiently implementing two complementary methodologies for discovering motifs in functional data: ProbKMA and FunBIalign. Cremona and Chiaromonte (2023) "Probabilistic K-means with Local Alignment for Clustering and Motif Discovery in Functional Data" <doi:10.1080/10618600.2022.2156522> is a probabilistic K-means algorithm that leverages local alignment and fuzzy clustering to identify recurring patterns (candidate functional motifs) across and within curves, allowing different portions of the same curve to belong to different clusters. It includes a family of distances and a normalization to discover various motif types and learns motif lengths in a data-driven manner. It can also be used for local clustering of misaligned data. Di Iorio, Cremona, and Chiaromonte (2023) "funBIalign: A Hierarchical Algorithm for Functional Motif Discovery Based on Mean Squared Residue Scores" <doi:10.48550/arXiv.2306.04254> applies hierarchical agglomerative clustering with a functional generalization of the Mean Squared Residue Score to identify motifs of a specified length in curves. This deterministic method includes a small set of user-tunable parameters. Both algorithms are suitable for single curves or sets of curves. The package also includes a flexible function to simulate functional data with embedded motifs, allowing users to generate benchmark datasets for validating and comparing motif discovery methods.
Monolix is a tool for running mixed effects model using saem'. This tool allows you to convert Monolix models to rxode2 (Wang, Hallow and James (2016) <doi:10.1002/psp4.12052>) using the form compatible with nlmixr2 (Fidler et al (2019) <doi:10.1002/psp4.12445>). If available, the rxode2 model will read in the Monolix data and compare the simulation for the population model individual model and residual model to immediately show how well the translation is performing. This saves the model development time for people who are creating an rxode2 model manually. Additionally, this package reads in all the information to allow simulation with uncertainty (that is the number of observations, the number of subjects, and the covariance matrix) with a rxode2 model. This is complementary to the babelmixr2 package that translates nlmixr2 models to Monolix and can convert the objects converted from monolix2rx to a full nlmixr2 fit. While not required, you can get/install the lixoftConnectors package in the Monolix installation, as described at the following url <https://monolixsuite.slp-software.com/r-functions/2024R1/installation-and-initialization>. When lixoftConnectors is available, Monolix can be used to load its model library instead manually setting up text files (which only works with old versions of Monolix').
Ports the Stata ado package tost which provides a suite of commands to perform two one-sided tests for equivalence following the approach by Schuirman (1987) <doi:10.1007/BF01068419>. Commands are provided for t tests on means, z tests on proportions, McNemar's test (1947) <doi:10.1007/BF02295996> on proportions and related tests, tests on the regression coefficients from OLS linear regression (not yet implementing all of the current regression options from the Stata tostregress command, e.g., survey regression options, estimation options, etc.), Wilcoxon's (1945) <doi:10.2307/3001968> signed rank tests, Wilcoxon-Mann-Whitney (1947) <doi:10.1214/aoms/1177730491> rank sum tests, supporting inference about equivalence for a number of paired and unpaired, parametric and nonparametric study designs and data types. Each command tests a null hypothesis that samples were drawn from populations different by at least plus or minus some researcher-defined level of tolerance, which can be defined in terms of units of the data or rank units (Delta), or in units of the test statistic's distribution (epsilon) except for tost.rrp() and tost.rrpi(). Enough evidence rejects this null hypothesis in favor of equivalence within the tolerance. Equivalence intervals for all tests may be defined symmetrically or asymmetrically.
This package contains functions for hidden Markov models with observations having extra zeros as defined in the following two publications, Wang, T., Zhuang, J., Obara, K. and Tsuruoka, H. (2016) <doi:10.1111/rssc.12194>; Wang, T., Zhuang, J., Buckby, J., Obara, K. and Tsuruoka, H. (2018) <doi:10.1029/2017JB015360>. The observed response variable is either univariate or bivariate Gaussian conditioning on presence of events, and extra zeros mean that the response variable takes on the value zero if nothing is happening. Hence the response is modelled as a mixture distribution of a Bernoulli variable and a continuous variable. That is, if the Bernoulli variable takes on the value 1, then the response variable is Gaussian, and if the Bernoulli variable takes on the value 0, then the response is zero too. This package includes functions for simulation, parameter estimation, goodness-of-fit, the Viterbi algorithm, and plotting the classified 2-D data. Some of the functions in the package are based on those of the R package HiddenMarkov by David Harte. This updated version has included an example dataset and R code examples to show how to transform the data into the objects needed in the main functions. We have also made changes to increase the speed of some of the functions.
This package provides functions are included for recalling AQL (Acceptable Quality Level or Acceptance Quality Level) Based single, double, and multiple attribute sampling plans from the Military Standard (MIL-STD-105E) - American National Standards Institute/American Society for Quality (ANSI/ASQ Z1.4) tables and for retrieving variable sampling plans from Military Standard (MIL-STD-414) - American National Standards Institute/American Society for Quality (ANSI/ASQ Z1.9) tables. The sources for these tables are listed in the URL: field. Also included are functions for computing the OC (Operating Characteristic) and ASN (Average Sample Number) coordinates for the attribute plans it recalls, and functions for computing the estimated proportion nonconforming and the maximum allowable proportion nonconforming for variable sampling plans. The MIL-STD AQL Sampling schemes were the most used and copied set of standards in the world. They are intended to be used for sampling a stream of lots, and were used in contract agreements between supplier and customer companies. When the US military dropped support of MIL-STD 105E and 414, The American National Standards Institute (ANSI) and the International Standards Organization (ISO) adopted the standard with few changes or no changes to the central tables. This package is useful because its computer implementation of these tables duplicates that available in other commercial software and subscription online calculators.
Motivation: The understanding of cancer mechanism requires the identification of genes playing a role in the development of the pathology and the characterization of their role (notably oncogenes and tumor suppressors). Results: We present an R/bioconductor package called MoonlightR which returns a list of candidate driver genes for specific cancer types on the basis of TCGA expression data. The method first infers gene regulatory networks and then carries out a functional enrichment analysis (FEA) (implementing an upstream regulator analysis, URA) to score the importance of well-known biological processes with respect to the studied cancer type. Eventually, by means of random forests, MoonlightR predicts two specific roles for the candidate driver genes: i) tumor suppressor genes (TSGs) and ii) oncogenes (OCGs). As a consequence, this methodology does not only identify genes playing a dual role (e.g. TSG in one cancer type and OCG in another) but also helps in elucidating the biological processes underlying their specific roles. In particular, MoonlightR can be used to discover OCGs and TSGs in the same cancer type. This may help in answering the question whether some genes change role between early stages (I, II) and late stages (III, IV) in breast cancer. In the future, this analysis could be useful to determine the causes of different resistances to chemotherapeutic treatments.
Supervised, multivariate, and non-parametric discretization algorithm based on tree ensembles learning and moment matching optimization. This version of the algorithm relies on random forest algorithm to learn a large set of split points that conserves the relationship between attributes and the target class, and on moment matching optimization to transform this set into a reduced number of cut points matching as well as possible statistical properties of the initial set of split points. For each attribute to be discretized, the set S of its related split points extracted through random forest is mapped to a reduced set C of cut points of size k. This mapping relies on minimizing, for each continuous attribute to be discretized, the distance between the four first moments of S and the four first moments of C subject to some constraints. This non-linear optimization problem is performed using k values ranging from 2 to max_splits', and the best solution returned correspond to the value k which optimum solution is the lowest one over the different realizations. ForestDisc is a generalization of RFDisc discretization method initially proposed by Berrado and Runger (2009) <doi:10.1109/AICCSA.2009.5069327>, and improved by Berrado et al. in 2012 by adopting the idea of moment matching optimization related by Hoyland and Wallace (2001) <doi: 10.1287/mnsc.47.2.295.9834>.
This package implements the softmax aggregation method for calculating Plant Stress Response Index (PSRI) from time-series germination data under environmental stressors including prions, xenobiotics, osmotic stress, heavy metals, and chemical contaminants. Provides zero-robust PSRI computation through adaptive softmax weighting of germination components (Maximum Stress-adjusted Germination, Maximum Rate of Germination, complementary Mean Time to Germination, and Radicle Vigor Score), eliminating the zero-collapse failure mode of the geometric mean approach implemented in PSRICalc'. Includes perplexity-based temperature parameter calibration and modular component functions for transparent germination analysis. Built on the methodological foundation of the Osmotic Stress Response Index (OSRI) framework developed by Walne et al. (2020) <doi:10.1002/agg2.20087>. Note: This package implements methodology currently under peer review. Please contact the author before publication using this approach. Development followed an iterative human-machine collaboration where all algorithmic design, statistical methodologies, and biological validation logic were conceptualized, tested, and iteratively refined by Richard A. Feiss through repeated cycles of running experimental data, evaluating analytical outputs, and selecting among candidate algorithms and approaches. AI systems (Anthropic Claude and OpenAI GPT) served as coding assistants and analytical sounding boards under continuous human direction. The selection of statistical methods, evaluation of biological plausibility, and all final methodology decisions were made by the human author. AI systems did not independently originate algorithms, statistical approaches, or scientific methodologies.
Creating, and refining data nuggets. Data nuggets reduce a large dataset into a small collection of nuggets of data, each containing a center (location), weight (importance), and scale (variability) parameter. Data nugget centers are created by choosing observations in the dataset which are as equally spaced apart as possible. Data nugget weights are created by counting the number observations closest to a given data nugget center. We then say the data nugget contains these observations and the data nugget center is recalculated as the mean of these observations. Data nugget scales are created by calculating the trace of the covariance matrix of the observations contained within a data nugget divided by the dimension of the dataset. Data nuggets are refined by splitting data nuggets which have scales or shapes (defined as the ratio of the two largest eigenvalues of the covariance matrix of the observations contained within the data nugget) Reference paper: [1] Beavers, T. E., Cheng, G., Duan, Y., Cabrera, J., Lubomirski, M., Amaratunga, D., & Teigler, J. E. (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure. Journal of Computational and Graphical Statistics, 1-21. [2] Cherasia, K. E., Cabrera, J., Fernholz, L. T., & Fernholz, R. (2022). Data Nuggets in Supervised Learning. \emphIn Robust and Multivariate Statistical Methods: Festschrift in Honor of David E. Tyler (pp. 429-449). Cham: Springer International Publishing.
Genomic signatures represent unique features within a species DNA, enabling the differentiation of species and offering broad applications across various fields. This package provides essential tools for calculating these specific signatures, streamlining the process for researchers and offering a comprehensive and time-saving solution for genomic analysis.The amino acid contents are identified based on the work published by Sandberg et al. (2003) <doi:10.1016/s0378-1119(03)00581-x> and Xiao et al. (2015) <doi:10.1093/bioinformatics/btv042>. The Average Mutual Information Profiles (AMIP) values are calculated based on the work of Bauer et al. (2008) <doi:10.1186/1471-2105-9-48>. The Chaos Game Representation (CGR) plot visualization was done based on the work of Deschavanne et al. (1999) <doi:10.1093/oxfordjournals.molbev.a026048> and Jeffrey et al. (1990) <doi:10.1093/nar/18.8.2163>. The GC content is calculated based on the work published by Nakabachi et al. (2006) <doi:10.1126/science.1134196> and Barbu et al. (1956) <https://pubmed.ncbi.nlm.nih.gov/13363015>. The Oligonucleotide Frequency Derived Error Gradient (OFDEG) values are computed based on the work published by Saeed et al. (2009) <doi:10.1186/1471-2164-10-S3-S10>. The Relative Synonymous Codon Usage (RSCU) values are calculated based on the work published by Elek (2018) <https://urn.nsk.hr/urn:nbn:hr:217:686131>.
Assists in automating the selection of terms to include in mixed models when asreml is used to fit the models. Procedures are available for choosing models that conform to the hierarchy or marginality principle, for fitting and choosing between two-dimensional spatial models using correlation, natural cubic smoothing spline and P-spline models. A history of the fitting of a sequence of models is kept in a data frame. Also used to compute functions and contrasts of, to investigate differences between and to plot predictions obtained using any model fitting function. The content falls into the following natural groupings: (i) Data, (ii) Model modification functions, (iii) Model selection and description functions, (iv) Model diagnostics and simulation functions, (v) Prediction production and presentation functions, (vi) Response transformation functions, (vii) Object manipulation functions, and (viii) Miscellaneous functions (for further details see asremlPlus-package in help). The asreml package provides a computationally efficient algorithm for fitting a wide range of linear mixed models using Residual Maximum Likelihood. It is a commercial package and a license for it can be purchased from VSNi <https://vsni.co.uk/> as asreml-R', who will supply a zip file for local installation/updating (see <https://asreml.kb.vsni.co.uk/>). It is not needed for functions that are methods for alldiffs and data.frame objects. The package asremPlus can also be installed from <http://chris.brien.name/rpackages/>.