An experimentdata package to supplement the preciseTAD
package containing pre-trained models and the variable importances of each genomic annotation used to build the model parsed into list objects and available in ExperimentHub
. In total, preciseTADhub
provides access to n=84 random forest classification models optimized to predict TAD/chromatin loop boundary regions and stored as .RDS files. The value, n, comes from the fact that we considered l=2 cell lines GM12878, K562, g=2 ground truth boundaries Arrowhead, Peakachu, and c=21 autosomal chromosomes CHR1, CHR2, ..., CHR22 (omitting CHR9). Furthermore, each object is itself a two-item list containing: (1) the model object, and (2) the variable importances for CTCF, RAD21, SMC3, and ZNF143 used to predict boundary regions. Each model is trained via a "holdout" strategy, in which data from chromosomes CHR1, CHR2, ..., CHRi-1, CHRi+1, ..., CHR22 were used to build the model and the ith chromosome was reserved for testing. See https://doi.org/10.1101/2020.09.03.282186 for more detail on the model building strategy.
Package providing a number of functions for working with Two- and Four-parameter Beta and closely related distributions (i.e., the Gamma- Binomial-, and Beta-Binomial distributions). Includes, among other things: - d/p/q/r functions for Four-Parameter Beta distributions and Generalized "Binomial" (continuous) distributions, and d/p/r- functions for Beta- Binomial distributions. - d/p/q/r functions for Two- and Four-Parameter Beta distributions parameterized in terms of their means and variances rather than their shape-parameters. - Moment generating functions for Binomial distributions, Beta-Binomial distributions, and observed value distributions. - Functions for estimating classification accuracy and consistency, making use of the Classical Test-Theory based Livingston and Lewis (L&L) and Hanson and Brennan approaches. A shiny app is available, providing a GUI for the L&L approach when used for binary classifications. For url to the app, see documentation for the LL.CA()
function. Livingston and Lewis (1995) <doi:10.1111/j.1745-3984.1995.tb00462.x>. Lord (1965) <doi:10.1007/BF02289490>. Hanson (1991) <https://files.eric.ed.gov/fulltext/ED344945.pdf>.
Analyzes non-normal data via the Multiple Comparison Procedures and Modeling approach (MCP-Mod). Many functions rely on the DoseFinding
package. This package makes it so the user does not need to provide or calculate the mu vector and S matrix. Instead, the user typically supplies the data in its raw form, and this package will calculate the needed objects and passes them into the DoseFinding
functions. If the user wishes to primarily use the functions provided in the DoseFinding
package, a singular function (prepareGen()
) will provide mu and S. The package currently handles power analysis and the MCP-Mod procedure for negative binomial, Poisson, and binomial data. The MCP-Mod procedure can also be applied to survival data, but power analysis is not available. Bretz, F., Pinheiro, J. C., and Branson, M. (2005) <doi:10.1111/j.1541-0420.2005.00344.x>. Buckland, S. T., Burnham, K. P. and Augustin, N. H. (1997) <doi:10.2307/2533961>. Pinheiro, J. C., Bornkamp, B., Glimm, E. and Bretz, F. (2014) <doi:10.1002/sim.6052>.
Many times, you will not find data for all dates. After first January, 2011 you may have next data on 20th January, 2011 and so on. Also available dates may have zero values. Try to gather all such kinds of data in different excel sheets of a single excel file. Every sheet will contain two columns (1st one is dates and second one is the data). After loading all the sheets into different elements of a list, using this you can fill the gaps for all the sheets and mark all the corresponding values as zeros. Here I am talking about daily data. Finally, it will combine all the filled results into one data frame (first column is date and other columns will be corresponding values of your sheets) and give one combined data frame. Number of columns in the data frame will be number of sheets plus one. Then imputation will be done. Daily to monthly and weekly conversion is also possible. More details can be found in Garai and others (2023) <doi:10.13140/RG.2.2.11977.42087>.
The ordinal forest (OF) method allows ordinal regression with high-dimensional and low-dimensional data. After having constructed an OF prediction rule using a training dataset, it can be used to predict the values of the ordinal target variable for new observations. Moreover, by means of the (permutation-based) variable importance measure of OF, it is also possible to rank the covariates with respect to their importance in the prediction of the values of the ordinal target variable. OF is presented in Hornung (2020). NOTE: Starting with package version 2.4, it is also possible to obtain class probability predictions in addition to the class point predictions. Moreover, the variable importance values can also be based on the class probability predictions. Preliminary results indicate that this might lead to a better discrimination between influential and non-influential covariates. The main functions of the package are: ordfor()
(construction of OF) and predict.ordfor()
(prediction of the target variable values of new observations). References: Hornung R. (2020) Ordinal Forests. Journal of Classification 37, 4â 17. <doi:10.1007/s00357-018-9302-x>.
Merging data from multiple sources is a relevant approach for comprehensively evaluating complex systems. However, the inherent problems encountered when analyzing single tables are amplified with the generation of multi-block datasets, and finding the relationships between data layers of increasing complexity constitutes a challenging task. For that purpose, a generic methodology is proposed by combining the strength of established data analysis strategies, i.e. multi-block approaches and the Orthogonal Partial Least Squares (OPLS) framework to provide an efficient tool for the fusion of data obtained from multiple sources. The package enables quick and efficient implementation of the consensus OPLS model for any horizontal multi-block data structures (observation-based matching). Moreover, it offers an interesting range of metrics and graphics to help to determine the optimal number of components and check the validity of the model through permutation tests. Interpretation tools include score and loading plots, Variable Importance in Projection (VIP), functionality predict for SHAP computing, and performance coefficients such as R2, Q2, and DQ2 coefficients. J. Boccard and D.N. Rutledge (2013) <doi:10.1016/j.aca.2013.01.022>.
An approach to identifies metabolic biomarker signature for metabolic data by discovering predictive metabolite for predicting survival and classifying patients into risk groups. Classifiers are constructed as a linear combination of predictive/important metabolites, prognostic factors and treatment effects if necessary. Several methods were implemented to reduce the metabolomics matrix such as the principle component analysis of Wold Svante et al. (1987) <doi:10.1016/0169-7439(87)80084-9> , the LASSO method by Robert Tibshirani (1998) <doi:10.1002/(SICI)1097-0258(19970228)16:4%3C385::AID-SIM380%3E3.0.CO;2-3>, the elastic net approach by Hui Zou and Trevor Hastie (2005) <doi:10.1111/j.1467-9868.2005.00503.x>. Sensitivity analysis on the quantile used for the classification can also be accessed to check the deviation of the classification group based on the quantile specified. Large scale cross validation can be performed in order to investigate the mostly selected predictive metabolites and for internal validation. During the evaluation process, validation is accessed using the hazard ratios (HR) distribution of the test set and inference is mainly based on resampling and permutations technique.
The Taylor Russell model is a widely used method for assessing test validity in personnel selection tasks. The three functions in this package extend this model in a number of notable ways. TR()
estimates test validity for a single selection test via the original Taylor Russell model. It extends this model by allowing users greater flexibility in argument choice. For example, users can specify any three of the four parameters (base rate, selection ratio, criterion validity, and positive predictive value) of the Taylor Russell model and estimate the remaining parameter (see the help file for examples). The TaylorRussell()
function generalizes the original Taylor Russell model to allow for multiple selection tests (predictors). To our knowledge, this is the first generalization of the Taylor Russell model to allow for three or more selection tests (it is also the first to correctly handle models with two selection tests). TRDemo()
is a shiny program for illustrating the underlying logic of the Taylor Russell model. Taylor, HC and Russell, JT (1939) "The relationship of validity coefficients to the practical effectiveness of tests in selection: Discussion and tables" <doi:10.1037/h0057079>.
Hybrid control design is a way to borrow information from external controls to augment concurrent controls in a randomized controlled trial and is expected to overcome the feasibility issue when adequate randomized controlled trials cannot be conducted. A major challenge in the hybrid control design is its inability to eliminate a prior-data conflict caused by systematic imbalances in measured or unmeasured confounding factors between patients in the concurrent treatment/control group and external controls. To prevent the prior-data conflict, a combined use of propensity score matching and Bayesian commensurate prior has been proposed in the context of hybrid control design. The propensity score matching is first performed to guarantee the balance in baseline characteristics, and then the Bayesian commensurate prior is constructed while discounting the information based on the similarity in outcomes between the concurrent and external controls. psBayesborrow
is a package to implement the propensity score matching and the Bayesian analysis with commensurate prior, as well as to conduct a simulation study to assess operating characteristics of the hybrid control design, where users can choose design parameters in flexible and straightforward ways depending on their own application.
The harmonic mean p-value (HMP) test combines p-values and corrects for multiple testing while controlling the strong-sense family-wise error rate. It is more powerful than common alternatives including Bonferroni and Simes procedures when combining large proportions of all the p-values, at the cost of slightly lower power when combining small proportions of all the p-values. It is more stringent than controlling the false discovery rate, and possesses theoretical robustness to positive correlations between tests and unequal weights. It is a multi-level test in the sense that a superset of one or more significant tests is certain to be significant and conversely when the superset is non-significant, the constituent tests are certain to be non-significant. It is based on MAMML (model averaging by mean maximum likelihood), a frequentist analogue to Bayesian model averaging, and is theoretically grounded in generalized central limit theorem. For detailed examples type vignette("harmonicmeanp") after installation. Version 3.0 addresses errors in versions 1.0 and 2.0 that led function p.hmp to control the familywise error rate only in the weak sense, rather than the strong sense as intended.
This data package contains the Item Response Theory (IRT) parameters for the National Center for Education Statistics (NCES) items used on the National Assessment of Education Progress (NAEP) from 1990 to 2015. The values in these tables are used along with NAEP data to turn student item responses into scores and include information about item difficulty, discrimination, and guessing parameter for 3 parameter logit (3PL) items. Parameters for Generalized Partial Credit Model (GPCM) items are also included. The adjustments table contains the information regarding the treatment of items (e.g., deletion of an item or a collapsing of response categories), when these items did not appear to fit the item response models used to describe the NAEP data. Transformation constants change the score estimates that are obtained from the IRT scaling program to the NAEP reporting metric. Values from the years 2000 - 2013 were taken from the NCES website <https://nces.ed.gov/nationsreportcard/> and values from 1990 - 1998 and 2015 were extracted from their NAEP data files. All subtest names were reduced and homogenized to one word (e.g. "Reading to gain information" became "information"). The various subtest names for univariate transformation constants were all homogenized to "univariate".
Facilitates the performance of several analyses, including simple and sequential path coefficient analysis, correlation estimate, drawing correlogram, Heatmap, and path diagram. When working with raw data, that includes one or more dependent variables along with one or more independent variables are available, the path coefficient analysis can be conducted. It allows for testing direct effects, which can be a vital indicator in path coefficient analysis. The process of preparing the dataset rule is explained in detail in the vignette file "Path.Analysis_manual.Rmd". You can find this in the folders labelled "data" and "~/inst/extdata". Also see: 1)the lavaan', 2)a sample of sequential path analysis in metan suggested by Olivoto and Lúcio (2020) <doi:10.1111/2041-210X.13384>, 3)the simple PATHSAS macro written in SAS by Cramer et al. (1999) <doi:10.1093/jhered/90.1.260>, and 4)the semPlot()
function of OpenMx
as initial tools for conducting path coefficient analyses and SEM (Structural Equation Modeling). To gain a comprehensive understanding of path coefficient analysis, both in theory and practice, see a Minitab macro developed by Arminian, A. in the paper by Arminian et al. (2008) <doi:10.1080/15427520802043182>.
ChunkyPNG is a pure Ruby library that can read and write Portable Network Graphics (PNG) images without depending on an external image library. It tries to be memory efficient and reasonably fast. It has features such as:
Decoding support for any image that the PNG standard allows. This includes all standard color modes, all bit depths, all transparency, and interlacing and filtering options.
Encoding support for images of all color modes (true color, grayscale, and indexed) and transparency for all these color modes. The best color mode is chosen automatically, based on the amount of used colors.
Read/write access to the image's pixels.
Read/write access to all image metadata that is stored in chunks.
Memory efficiency:
fixnum
are used, i.e. 4 or 8 bytes of memory per pixel, depending on the hardware).Performance: ChunkyPNG is reasonably fast for Ruby standards, by only using integer math and a highly optimized saving routine.
Interoperability with RMagick.
ChunkyPNG is vulnerable to decompression bombs and can run out of memory when loading a specifically crafted PNG file. This is hard to fix in pure Ruby. Deal with untrusted images in a separate process, e.g., by using fork
or a background processing library.
This uses a mixed integer mathematical programming (MIP) approach for building and solving multi-action planning problems, where the goal is to find an optimal combination of management actions that abate threats, in an efficient way while accounting for spatial aspects. Thus, optimizing the connectivity and conservation effectiveness of the prioritized units and of the deployed actions. The package is capable of handling different commercial (gurobi, CPLEX) and non-commercial (symphony, CBC) MIP solvers. Gurobi optimization solver can be installed using comprehensive instructions in the gurobi installation vignette of the prioritizr package (available in <https://prioritizr.net/articles/gurobi_installation_guide.html>). Instead, CPLEX optimization solver can be obtain from IBM CPLEX web page (available here <https://www.ibm.com/es-es/products/ilog-cplex-optimization-studio>). Additionally, the rcbc R package (available at <https://github.com/dirkschumacher/rcbc>) can be used to obtain solutions using the CBC optimization software (<https://github.com/coin-or/Cbc>). Methods used in the package refers to Salgado-Rojas et al. (2020) <doi:10.1016/j.ecolmodel.2019.108901>, Beyer et al. (2016) <doi:10.1016/j.ecolmodel.2016.02.005>, Cattarino et al. (2015) <doi:10.1371/journal.pone.0128027> and Watts et al. (2009) <doi:10.1016/j.envsoft.2009.06.005>. See the prioriactions website for more information, documentations and examples.
This package provides tools to calculate stability indices with parametric, non-parametric and probabilistic approaches. The basic data format requirement for toolStability
is a data frame with 3 columns including numeric trait values, genotype,and environmental labels. Output format of each function is the dataframe with chosen stability index for each genotype. Function "table_stability" offers the summary table of all stability indices in this package. This R package toolStability
is part of the main publication: Wang, Casadebaig and Chen (2023) <doi:10.1007/s00122-023-04264-7>. Analysis pipeline for main publication can be found on github: <https://github.com/Illustratien/Wang_2023_TAAG>. Sample dataset in this package is derived from another publication: Casadebaig P, Zheng B, Chapman S et al. (2016) <doi:10.1371/journal.pone.0146385>. For detailed documentation of dataset, please see on Zenodo <doi:10.5281/zenodo.4729636>. Indices used in this package are from: Döring TF, Reckling M (2018) <doi:10.1016/j.eja.2018.06.007>. Eberhart SA, Russell WA (1966) <doi:10.2135/cropsci1966.0011183X000600010011x>. Eskridge KM (1990) <doi:10.2135/cropsci1990.0011183X003000020025x>. Finlay KW, Wilkinson GN (1963) <doi:10.1071/AR9630742>. Hanson WD (1970) Genotypic stability. <doi:10.1007/BF00285245>. Lin CS, Binns MR (1988). Nassar R, Hühn M (1987). Pinthus MJ (1973) <doi:10.1007/BF00021563>. Römer T (1917). Shukla GK (1972). Wricke G (1962).
Understanding morphological variation is an important task in many applications. Recent studies in computational biology have focused on developing computational tools for the task of sub-image selection which aims at identifying structural features that best describe the variation between classes of shapes. A major part in assessing the utility of these approaches is to demonstrate their performance on both simulated and real datasets. However, when creating a model for shape statistics, real data can be difficult to access and the sample sizes for these data are often small due to them being expensive to collect. Meanwhile, the landscape of current shape simulation methods has been mostly limited to approaches that use black-box inference---making it difficult to systematically assess the power and calibration of sub-image models. In this R package, we introduce the alpha-shape sampler: a probabilistic framework for simulating realistic 2D and 3D shapes based on probability distributions which can be learned from real data or explicitly stated by the user. The ashapesampler package supports two mechanisms for sampling shapes in two and three dimensions. The first, empirically sampling based on an existing data set, was highlighted in the original main text of the paper. The second, probabilistic sampling from a known distribution, is the computational implementation of the theory derived in that paper. Work based on Winn-Nunez et al. (2024) <doi:10.1101/2024.01.09.574919>.
Tree-based classification and soft-clustering method for preference rankings, with tools for external validation of fuzzy clustering, and Kemeny-equivalent augmented unfolding. It contains the recursive partitioning algorithm for preference rankings, non-parametric tree-based method for a matrix of preference rankings as a response variable. It contains also the distribution-free soft clustering method for preference rankings, namely the K-median cluster component analysis (CCA). The package depends on the ConsRank
R package. Options for validate the tree-based method are both test-set procedure and V-fold cross validation. The package contains the routines to compute the adjusted concordance index (a fuzzy version of the adjusted rand index) and the normalized degree of concordance (the corresponding fuzzy version of the rand index). The package also contains routines to perform the Kemeny-equivalent augmented unfolding. The mds endine is the function sacofSym
from the package smacof'. Essential references: D'Ambrosio, A., Vera, J.F., and Heiser, W.J. (2021) <doi:10.1080/00273171.2021.1899892>; D'Ambrosio, A., Amodio, S., Iorio, C., Pandolfo, G., and Siciliano, R. (2021) <doi:10.1007/s00357-020-09367-0>; D'Ambrosio, A., and Heiser, W.J. (2019) <doi:10.1007/s41237-018-0069-5>; D'Ambrosio, A., and Heiser W.J. (2016) <doi:10.1007/s11336-016-9505-1>; Hullermeier, E., Rifqi, M., Henzgen, S., and Senge, R. (2012) <doi:10.1109/TFUZZ.2011.2179303>; Marden, J.J. <ISBN:0412995212>.
Current layout algorithms such as Kamada Kawai do not take into consideration disjoint clusters in a network, often resulting in a high overlap among the clusters, resulting in a visual â hairballâ that often is uninterpretable. The ExplodeLayout
algorithm takes as input (1) an edge list of a unipartite or bipartite network, (2) node layout coordinates (x, y) generated by a layout algorithm such as Kamada Kawai, (3) node cluster membership generated from a clustering algorithm such as modularity maximization, and (4) a radius to enable the node clusters to be â explodedâ to reduce their overlap. The algorithm uses these inputs to generate new layout coordinates of the nodes which â explodesâ the clusters apart, such that the edge lengths within the clusters are preserved, while the edge lengths between clusters are recalculated. The modified network layout with nodes and edges are displayed in two dimensions. The user can experiment with different explode radii to generate a layout which has sufficient separation of clusters, while reducing the overall layout size of the network. This package is a basic version of an earlier version called [epl]<https://github.com/UTMB-DIVA-Lab/epl> that searched for an optimal explode radius, and offered multiple ways to separate clusters in a network (Bhavnani et al(2017) <https://pmc.ncbi.nlm.nih.gov/articles/PMC5543384/>). The example dataset is for a bipartite network, but the algorithm can work also for unipartite networks.
This package performs analyzes and estimates of environmental covariates and genetic parameters related to selection strategies and development of superior genotypes. It has two main functionalities, the first being about prediction models of covariates and environmental processes, while the second deals with the estimation of genetic parameters and selection strategies. Designed for researchers and professionals in genetics and environmental sciences, the package combines statistical methods for modeling and data analysis. This includes the plastochron estimate proposed by Porta et al. (2024) <doi:10.1590/1807-1929/agriambi.v28n10e278299>, Stress indices for genotype selection referenced by Ghazvini et al. (2024) <doi:10.1007/s10343-024-00981-1>, the Environmental Stress Index described by Tazzo et al. (2024) <https://revistas.ufg.br/vet/article/view/77035>, industrial quality indices of wheat genotypes (Szareski et al., 2019), <doi:10.4238/gmr18223>, Ear Indexes estimation (Rigotti et al., 2024), <doi:10.13083/reveng.v32i1.17394>, Selection index for protein and grain yield (de Pelegrin et al., 2017), <doi:10.4236/ajps.2017.813224>, Estimation of the ISGR - Genetic Selection Index for Resilience for environmental resilience (Bandeira et al., 2024) <https://www.cropj.com/Carvalho_18_12_2024_825_830.pdf>, estimation of Leaf Area Index (Meira et al., 2015) <https://www.fag.edu.br/upload/revista/cultivando_o_saber/55d1ef202e494.pdf>, Restriction of control variability (Carvalho et al., 2023) <doi:10.4025/actasciagron.v45i1.56156>, Risk of Disease Occurrence in Soybeans described by Engers et al. (2024) <doi:10.1007/s40858-024-00649-1> and estimation of genetic parameters for selection based on balanced experiments (Yadav et al., 2024) <doi:10.1155/2024/9946332>.
This package provides tools to teach students elemental statistics. The main topics covered are descriptive statistics, probability models (discrete and continuous variables) and statistical inference (confidence intervals and hypothesis tests). One of the main advantages of this package is that allows the user to read quite a variety of types of data files with one unique command. Moreover it includes shortcuts to simple but up-to-now not in R descriptive features such a complete frequency table or an histogram with the optimal number of intervals. Related to model distributions (both discrete and continuous), the package allows the student to easy plot the mass/density function, distribution function and quantile function just detailing as input arguments the known population parameters. The inference related tools are basically confidence interval and hypothesis testing. Having defined independent commands for these two tools makes it easier for the student to understand what the software is performing, and it also helps the student to have a better knowledge on which specific tool they need to use in each situation. Moreover, the hypothesis testing commands provide not only the numeric result on the screen but also a very intuitive graph (which includes the statistic distribution, the observed value of the statistic, the rejection area and the p-value) that is very useful for the student to visualise the process. The regression section includes up to now, a simple linear model, with one single command the student can obtain the numeric summary as well as the corresponding diagram with the adjusted regression model and a legend with basic information (formula of the adjusted model and R-squared).
This package provides a set of functions to help clinical trial researchers calculate power and sample size for two-arm Bayesian randomized clinical trials that do or do not incorporate historical control data. At some point during the design process, a clinical trial researcher who is designing a basic two-arm Bayesian randomized clinical trial needs to make decisions about power and sample size within the context of hypothesized treatment effects. Through simulation, the simple_sim()
function will estimate power and other user specified clinical trial characteristics at user specified sample sizes given user defined scenarios about treatment effect,control group characteristics, and outcome. If the clinical trial researcher has access to historical control data, then the researcher can design a two-arm Bayesian randomized clinical trial that incorporates the historical data. In such a case, the researcher needs to work through the potential consequences of historical and randomized control differences on trial characteristics, in addition to working through issues regarding power in the context of sample size, treatment effect size, and outcome. If a researcher designs a clinical trial that will incorporate historical control data, the researcher needs the randomized controls to be from the same population as the historical controls. What if this is not the case when the designed trial is implemented? During the design phase, the researcher needs to investigate the negative effects of possible historic/randomized control differences on power, type one error, and other trial characteristics. Using this information, the researcher should design the trial to mitigate these negative effects. Through simulation, the historic_sim()
function will estimate power and other user specified clinical trial characteristics at user specified sample sizes given user defined scenarios about historical and randomized control differences as well as treatment effects and outcomes. The results from historic_sim()
and simple_sim()
can be printed with print_table()
and graphed with plot_table()
methods. Outcomes considered are Gaussian, Poisson, Bernoulli, Lognormal, Weibull, and Piecewise Exponential. The methods are described in Eggleston et al. (2021) <doi:10.18637/jss.v100.i21>.
This package provides functions that facilitate the use of accepted taxonomic nomenclature, collection of functional trait data, and assignment of functional group classifications to phytoplankton species. Possible classifications include Morpho-functional group (MFG; Salmaso et al. 2015 <doi:10.1111/fwb.12520>) and CSR (Reynolds 1988; Functional morphology and the adaptive strategies of phytoplankton. In C.D. Sandgren (ed). Growth and reproductive strategies of freshwater phytoplankton, 388-433. Cambridge University Press, New York). Versions 2.0.0 and later includes new functions for querying the algaebase online taxonomic database (www.algaebase.org), however these functions require a valid API key that must be acquired from the algaebase administrators. Note that none of the algaeClassify
authors are affiliated with algaebase in any way. Taxonomic names can also be checked against a variety of taxonomic databases using the Global Names Resolver service via its API (<https://resolver.globalnames.org/api>). In addition, currently accepted and outdated synonyms, and higher taxonomy, can be extracted for lists of species from the ITIS database using wrapper functions for the ritis package. The algaeClassify
package is a product of the GEISHA (Global Evaluation of the Impacts of Storms on freshwater Habitat and Structure of phytoplankton Assemblages), funded by CESAB (Centre for Synthesis and Analysis of Biodiversity) and the U.S. Geological Survey John Wesley Powell Center for Synthesis and Analysis, with data and other support provided by members of GLEON (Global Lake Ecology Observation Network). DISCLAIMER: This software has been approved for release by the U.S. Geological Survey (USGS). Although the software has been subjected to rigorous review, the USGS reserves the right to update the software as needed pursuant to further analysis and review. No warranty, expressed or implied, is made by the USGS or the U.S. Government as to the functionality of the software and related material nor shall the fact of release constitute any such warranty. Furthermore, the software is released on condition that neither the USGS nor the U.S. Government shall be held liable for any damages resulting from its authorized or unauthorized use.
There are 4 possible methods: "ExhaustiveSearch
"; "ExhaustivePhi
"; "ClusteringSearch
"; and "ClusteringPhi
". "ExhaustiveSearch"-->
gives you the best phage cocktail from a phage-bacteria infection network. It checks different phage cocktail sizes from 1 to 7 and only stops before if it lyses all bacteria. Other option is when users have decided not to obtain a phage cocktail size higher than a limit value. "ExhaustivePhi"-->
firstly, it finds Phi out. Phi is a formula indicating the necessary phage cocktail size. Phi needs nestedness temperature and fill, which are internally calculated. This function will only look for the best combination (phage cocktail) with a Phi size. "ClusteringSearch"-->
firstly, an agglomerative hierarchical clustering using Ward's algorithm is calculated for phages. They will be clustered according to bacteria lysed by them. PhageCocktail()
chooses how many clusters are needed in order to select 1 phage per cluster. Using the phages selected during the clustering, it checks different phage cocktail sizes from 1 to 7 and only stops before if it lyses all bacteria. Other option is when users have decided not to obtain a phage cocktail size higher than a limit value. "ClusteringPhi"-->
firstly, an agglomerative hierarchical clustering using Ward's algorithm is calculated for phages. They will be clustered according to bacteria lysed by them. PhageCocktail()
chooses how many clusters are needed in order to select 1 phage per cluster. Once the function has one phage per cluster, it calculates Phi. If the number of clusters is less than Phi number, it will be changed to obtain, as minimum, this quantity of candidates (phages). Then, it calculates the best combination of Phi phages using those selected during the clustering with Ward algorithm. If you use PhageCocktail
, please cite it as: "PhageCocktail
: An R Package to Design Phage Cocktails from Experimental Phage-Bacteria Infection Networks". Marà a Victoria Dà az-Galián, Miguel A. Vega-Rodrà guez, Felipe Molina. Computer Methods and Programs in Biomedicine, 221, 106865, Elsevier Ireland, Clare, Ireland, 2022, pp. 1-9, ISSN: 0169-2607. <doi:10.1016/j.cmpb.2022.106865>.
Automatically selects and visualises statistical hypothesis tests between two vectors, based on their class, distribution, sample size, and a user-defined confidence level (conf.level). Visual outputs - including box plots, bar charts, regression lines with confidence bands, mosaic plots, residual plots, and Q-Q plots - are annotated with relevant test statistics, assumption checks, and post-hoc analyses where applicable. The algorithmic workflow helps the user focus on the interpretation of test results rather than test selection. It is particularly suited for quick data analysis, e.g., in statistical consulting projects or educational settings. The test selection algorithm proceeds as follows: Input vectors of class numeric or integer are considered numerical; those of class factor are considered categorical. Assumptions of residual normality and homogeneity of variances are considered met if the corresponding test yields a p-value greater than the significance level alpha = 1 - conf.level. (1) When the response vector is numerical and the predictor vector is categorical, a test of central tendencies is selected. If the categorical predictor has exactly two levels, t.test()
is applied when group sizes exceed 30 (Lumley et al. (2002) <doi:10.1146/annurev.publhealth.23.100901.140546>). For smaller samples, normality of residuals is tested using shapiro.test()
; if met, t.test()
is used; otherwise, wilcox.test()
. If the predictor is categorical with more than two levels, an aov()
is initially fitted. Residual normality is evaluated using both shapiro.test()
and ad.test()
; residuals are considered approximately normal if at least one test yields a p-value above alpha. If this assumption is met, bartlett.test()
assesses variance homogeneity. If variances are homogeneous, aov()
is used; otherwise oneway.test()
. Both tests are followed by TukeyHSD()
. If residual normality cannot be assumed, kruskal.test()
is followed by pairwise.wilcox.test()
. (2) When both the response and predictor vectors are numerical, a simple linear regression model is fitted using lm()
. (3) When both vectors are categorical, Cochran's rule (Cochran (1954) <doi:10.2307/3001666>) is applied to test independence either by chisq.test()
or fisher.test()
.