SCANVIS is a set of annotation-dependent tools for analyzing splice junctions and their read support as predetermined by an alignment tool of choice (for example, STAR aligner). SCANVIS assesses each junction's relative read support (RRS) by relating to the context of local split reads aligning to annotated transcripts. SCANVIS also annotates each splice junction by indicating whether the junction is supported by annotation or not, and if not, what type of junction it is (e.g. exon skipping, alternative 5 or 3 events, Novel Exons). Unannotated junctions are also futher annotated by indicating whether it induces a frame shift or not. SCANVIS includes a visualization function to generate static sashimi-style plots depicting relative read support and number of split reads using arc thickness and arc heights, making it easy for users to spot well-supported junctions. These plots also clearly delineate unannotated junctions from annotated ones using designated color schemes, and users can also highlight splice junctions of choice. Variants and/or a read profile are also incoroporated into the plot if the user supplies variants in bed format and/or the BAM file. One further feature of the visualization function is that users can submit multiple samples of a certain disease or cohort to generate a single plot - this occurs via a "merge" function wherein junction details over multiple samples are merged to generate a single sashimi plot, which is useful when contrasting cohorots (eg. disease vs control).
To date, thousands of single nucleotide polymorphisms (SNPs) have been found to be associated with complex traits and diseases. However, the vast majority of these disease-associated SNPs lie in the non-coding part of the genome, and are likely to affect regulatory elements, such as enhancers and promoters, rather than function of a protein. Thus, to understand the molecular mechanisms underlying genetic traits and diseases, it becomes increasingly important to study the effect of a SNP on nearby molecular traits such as chromatin environment or transcription factor (TF) binding. Towards this aim, we developed SNPhood, a user-friendly *Bioconductor* R package to investigate and visualize the local neighborhood of a set of SNPs of interest for NGS data such as chromatin marks or transcription factor binding sites from ChIP-Seq
or RNA- Seq experiments. SNPhood comprises a set of easy-to-use functions to extract, normalize and summarize reads for a genomic region, perform various data quality checks, normalize read counts using additional input files, and to cluster and visualize the regions according to the binding pattern. The regions around each SNP can be binned in a user-defined fashion to allow for analysis of very broad patterns as well as a detailed investigation of specific binding shapes. Furthermore, SNPhood supports the integration with genotype information to investigate and visualize genotype-specific binding patterns. Finally, SNPhood can be employed for determining, investigating, and visualizing allele-specific binding patterns around the SNPs of interest.
This package provides tools for transport planning with an emphasis on spatial transport data and non-motorized modes. The package was originally developed to support the Propensity to Cycle Tool', a publicly available strategic cycle network planning tool (Lovelace et al. 2017) <doi:10.5198/jtlu.2016.862>, but has since been extended to support public transport routing and accessibility analysis (Moreno-Monroy et al. 2017) <doi:10.1016/j.jtrangeo.2017.08.012> and routing with locally hosted routing engines such as OSRM (Lowans et al. 2023) <doi:10.1016/j.enconman.2023.117337>. The main functions are for creating and manipulating geographic "desire lines" from origin-destination (OD) data (building on the od package); calculating routes on the transport network locally and via interfaces to routing services such as <https://cyclestreets.net/> (Desjardins et al. 2021) <doi:10.1007/s11116-021-10197-1>; and calculating route segment attributes such as bearing. The package implements the travel flow aggregration method described in Morgan and Lovelace (2020) <doi:10.1177/2399808320942779> and the OD jittering method described in Lovelace et al. (2022) <doi:10.32866/001c.33873>. Further information on the package's aim and scope can be found in the vignettes and in a paper in the R Journal (Lovelace and Ellison 2018) <doi:10.32614/RJ-2018-053>, and in a paper outlining the landscape of open source software for geographic methods in transport planning (Lovelace, 2021) <doi:10.1007/s10109-020-00342-2>.
This package provides tools for exploring the topography of 3d triangle meshes. The functions were developed with dental surfaces in mind, but could be applied to any triangle mesh of class mesh3d'. More specifically, doolkit allows to isolate the border of a mesh, or a subpart of the mesh using the polygon networks method; crop a mesh; compute basic descriptors (elevation, orientation, footprint area); compute slope, angularity and relief index (Ungar and Williamson (2000) <https://palaeo-electronica.org/2000_1/gorilla/issue1_00.htm>; Boyer (2008) <doi:10.1016/j.jhevol.2008.08.002>), inclination and occlusal relief index or gamma (Guy et al. (2013) <doi:10.1371/journal.pone.0066142>), OPC (Evans et al. (2007) <doi:10.1038/nature05433>), OPCR (Wilson et al. (2012) <doi:10.1038/nature10880>), DNE (Bunn et al. (2011) <doi:10.1002/ajpa.21489>; Pampush et al. (2016) <doi:10.1007/s10914-016-9326-0>), form factor (Horton (1932) <doi:10.1029/TR013i001p00350>), basin elongation (Schum (1956) <doi:10.1130/0016-7606(1956)67[597:EODSAS]2.0.CO;2>), lemniscate ratio (Chorley et al; (1957) <doi:10.2475/ajs.255.2.138>), enamel-dentine distance (Guy et al. (2015) <doi:10.1371/journal.pone.0138802>; Thiery et al. (2017) <doi:10.3389/fphys.2017.00524>), absolute crown strength (Schwartz et al. (2020) <doi:10.1098/rsbl.2019.0671>), relief rate (Thiery et al. (2019) <doi:10.1002/ajpa.23916>) and area-relative curvature; draw cumulative profiles of a topographic variable; and map a variable over a 3d triangle mesh.
The R-package bayespm implements Bayesian Statistical Process Control and Monitoring (SPC/M) methodology. These methods utilize available prior information and/or historical data, providing efficient online quality monitoring of a process, in terms of identifying moderate/large transient shifts (i.e., outliers) or persistent shifts of medium/small size in the process. These self-starting, sequentially updated tools can also run under complete absence of any prior information. The Predictive Control Charts (PCC) are introduced for the quality monitoring of data from any discrete or continuous distribution that is a member of the regular exponential family. The Predictive Ratio CUSUMs (PRC) are introduced for the Binomial, Poisson and Normal data (a later version of the library will cover all the remaining distributions from the regular exponential family). The PCC targets transient process shifts of typically large size (a.k.a. outliers), while PRC is focused in detecting persistent (structural) shifts that might be of medium or even small size. Apart from monitoring, both PCC and PRC provide the sequentially updated posterior inference for the monitored parameter. Bourazas K., Kiagias D. and Tsiamyrtzis P. (2022) "Predictive Control Charts (PCC): A Bayesian approach in online monitoring of short runs" <doi:10.1080/00224065.2021.1916413>, Bourazas K., Sobas F. and Tsiamyrtzis, P. 2023. "Predictive ratio CUSUM (PRC): A Bayesian approach in online change point detection of short runs" <doi:10.1080/00224065.2022.2161434>, Bourazas K., Sobas F. and Tsiamyrtzis, P. 2023. "Design and properties of the predictive ratio cusum (PRC) control charts" <doi:10.1080/00224065.2022.2161435>.
Consider ambiguity in probabilistic descriptions by replacing a parametric probabilistic description of uncertainty by a non-parametric set of probability distributions in the form of a Density Ratio Class. This is of particular interest in Bayesian inference. The Density Ratio Class is particularly suited for this purpose as it is invariant under Bayesian inference, marginalization, and propagation through a deterministic model. Here, invariant means that the result of the operation applied to a Density Ratio Class is again a Density Ratio Class. In particular the invariance under Bayesian inference thus enables iterative learning within the same framework of Density Ratio Classes. The use of imprecise probabilities in general, and Density Ratio Classes in particular, lead to intervals of characteristics of probability distributions, such as cumulative distribution functions, quantiles, and means. The package is based on a sample of the distribution proportional to the upper bound of the class. Typically this will be a sample from the posterior in Bayesian inference. Based on such a sample, the package provides functions to calculate lower and upper class boundaries and lower and upper bounds of cumulative distribution functions, and quantiles. Rinderknecht, S.L., Albert, C., Borsuk, M.E., Schuwirth, N., Kuensch, H.R. and Reichert, P. (2014) "The effect of ambiguous prior knowledge on Bayesian model parameter inference and prediction." Environmental Modelling & Software. 62, 300-315, 2014. <doi:10.1016/j.envsoft.2014.08.020>. Sriwastava, A. and Reichert, P. "Robust Bayesian Estimation of Value Function Parameters using Imprecise Priors." Submitted. <https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4973574>.
Epistasis, commonly defined as the interaction between genetic loci, is known to play an important role in the phenotypic variation of complex traits. As a result, many statistical methods have been developed to identify genetic variants that are involved in epistasis, and nearly all of these approaches carry out this task by focusing on analyzing one trait at a time. Previous studies have shown that jointly modeling multiple phenotypes can often dramatically increase statistical power for association mapping. In this package, we present the multivariate MArginal ePIstasis
Test ('mvMAPIT
') â a multi-outcome generalization of a recently proposed epistatic detection method which seeks to detect marginal epistasis or the combined pairwise interaction effects between a given variant and all other variants. By searching for marginal epistatic effects, one can identify genetic variants that are involved in epistasis without the need to identify the exact partners with which the variants interact â thus, potentially alleviating much of the statistical and computational burden associated with conventional explicit search based methods. Our proposed mvMAPIT
builds upon this strategy by taking advantage of correlation structure between traits to improve the identification of variants involved in epistasis. We formulate mvMAPIT
as a multivariate linear mixed model and develop a multi-trait variance component estimation algorithm for efficient parameter inference and P-value computation. Together with reasonable model approximations, our proposed approach is scalable to moderately sized genome-wide association studies. Crawford et al. (2017) <doi:10.1371/journal.pgen.1006869>. Stamp et al. (2023) <doi:10.1093/g3journal/jkad118>.
Cross validation informed Relaxed LASSO, Artificial Neural Network (ANN), gradient boosting machine ('xgboost'), Random Forest ('RandomForestSRC
'), Oblique Random Forest ('aorsf'), Recursive Partitioning ('RPART') or step wise regression models are fit. Cross validation leave out samples (leading to nested cross validation) or bootstrap out-of-bag samples are used to evaluate and compare performances between these models with results presented in tabular or graphical means. Calibration plots can also be generated, again based upon (outer nested) cross validation or bootstrap leave out (out of bag) samples. For some datasets, for example when the design matrix is not of full rank, glmnet may have very long run times when fitting the relaxed lasso model, from our experience when fitting Cox models on data with many predictors and many patients, making it difficult to get solutions from either glmnet()
or cv.glmnet()
. This may be remedied by using the path=TRUE option when calling glmnet()
and cv.glmnet()
. Within the glmnetr package the approach of path=TRUE is taken by default. When fitting not a relaxed lasso model but an elastic-net model, then the R-packages nestedcv <https://cran.r-project.org/package=nestedcv>, glmnetSE
<https://cran.r-project.org/package=glmnetSE>
or others may provide greater functionality when performing a nested CV. Use of the glmnetr has many similarities to the glmnet package and it is recommended that the user of glmnetr also become familiar with the glmnet package <https://cran.r-project.org/package=glmnet>, with the "An Introduction to glmnet'" and "The Relaxed Lasso" being especially useful in this regard.
Processing and analysis of field collected or simulated sprinkler system catch data (depths) to characterize irrigation uniformity and efficiency using standard and other measures. Standard measures include the Christiansen coefficient of uniformity (CU) as found in Christiansen, J.E.(1942, ISBN:0138779295, "Irrigation by Sprinkling"); and distribution uniformity (DU), potential efficiency of the low quarter (PELQ), and application efficiency of the low quarter (AELQ) that are implementations of measures of the same notation in Keller, J. and Merriam, J.L. (1978) "Farm Irrigation System Evaluation: A Guide for Management" <https://pdf.usaid.gov/pdf_docs/PNAAG745.pdf>. spreval::DU.lh is similar to spreval::DU but is the distribution uniformity of the low half instead of low quarter as in DU. spreval::PELQT is a version of spreval::PELQ adapted for traveling systems instead of lateral move or solid-set sprinkler systems. The function spreval::eff is analogous to the method used to compute application efficiency for furrow irrigation presented in Walker, W. and Skogerboe, G.V. (1987,ISBN:0138779295, "Surface Irrigation: Theory and Practice"),that uses piecewise integration of infiltrated depth compared against soil-moisture deficit (SMD), when the argument "target" is set equal to SMD. The other functions contained in the package provide graphical representation of sprinkler system uniformity, and other standard univariate parametric and non-parametric statistical measures as applied to sprinkler system catch depths. A sample data set of field test data spreval::catchcan (catch depths) is provided and is used in examples and vignettes. Agricultural systems emphasized, but this package can be used for landscape irrigation evaluation, and a landscape (turf) vignette is included as an example application.
Fits (excess) hazard, relative mortality ratio or marginal intensity models with multidimensional penalized splines allowing for time-dependent effects, non-linear effects and interactions between several continuous covariates. In survival and net survival analysis, in addition to modelling the effect of time (via the baseline hazard), one has often to deal with several continuous covariates and model their functional forms, their time-dependent effects, and their interactions. Model specification becomes therefore a complex problem and penalized regression splines represent an appealing solution to that problem as splines offer the required flexibility while penalization limits overfitting issues. Current implementations of penalized survival models can be slow or unstable and sometimes lack some key features like taking into account expected mortality to provide net survival and excess hazard estimates. In contrast, survPen
provides an automated, fast, and stable implementation (thanks to explicit calculation of the derivatives of the likelihood) and offers a unified framework for multidimensional penalized hazard and excess hazard models. Later versions (>2.0.0) include penalized models for relative mortality ratio, and marginal intensity in recurrent event setting. survPen
may be of interest to those who 1) analyse any kind of time-to-event data: mortality, disease relapse, machinery breakdown, unemployment, etc 2) wish to describe the associated hazard and to understand which predictors impact its dynamics, 3) wish to model the relative mortality ratio between a cohort and a reference population, 4) wish to describe the marginal intensity for recurrent event data. See Fauvernier et al. (2019a) <doi:10.21105/joss.01434> for an overview of the package and Fauvernier et al. (2019b) <doi:10.1111/rssc.12368> for the method.
The single cell mapper (scMappR
) R package contains a suite of bioinformatic tools that provide experimentally relevant cell-type specific information to a list of differentially expressed genes (DEG). The function "scMappR_and_pathway_analysis
" reranks DEGs to generate cell-type specificity scores called cell-weighted fold-changes. Users input a list of DEGs, normalized counts, and a signature matrix into this function. scMappR
then re-weights bulk DEGs by cell-type specific expression from the signature matrix, cell-type proportions from RNA-seq deconvolution and the ratio of cell-type proportions between the two conditions to account for changes in cell-type proportion. With cwFold-changes
calculated, scMappR
uses two approaches to utilize cwFold-changes
to complete cell-type specific pathway analysis. The "process_dgTMatrix_lists
" function in the scMappR
package contains an automated scRNA-seq
processing pipeline where users input scRNA-seq
count data, which is made compatible for scMappR
and other R packages that analyze scRNA-seq
data. We further used this to store hundreds up regularly updating signature matrices. The functions "tissue_by_celltype_enrichment", "tissue_scMappR_internal
", and "tissue_scMappR_custom
" combine these consistently processed scRNAseq
count data with gene-set enrichment tools to allow for cell-type marker enrichment of a generic gene list (e.g. GWAS hits). Reference: Sokolowski,D.J., Faykoo-Martinez,M., Erdman,L., Hou,H., Chan,C., Zhu,H., Holmes,M.M., Goldenberg,A. and Wilson,M.D. (2021) Single-cell mapper (scMappR
): using scRNA-seq
to infer cell-type specificities of differentially expressed genes. NAR Genomics and Bioinformatics. 3(1). Iqab011. <doi:10.1093/nargab/lqab011>.
Prepares data for statistical analysis (e.g., analysis of variance ;ANOVA) by enabling the user to easily and quickly merge (using the file_merge()
function) raw data files into one merged table and then aggregate the merged table (using the prep()
function) into a finalized table while keeping track and summarizing every step of the preparation. The finalized table contains several possibilities for dependent measures of the dependent variable. Most suitable when measuring variables in an interval or ratio scale (e.g., reaction-times) and/or discrete values such as accuracy. Main functions included are file_merge()
and prep()
. The file_merge()
function vertically merges individual data files (in a long format) in which each line is a single observation to one single dataset. The prep()
function aggregates the single dataset according to any combination of grouping variables (i.e., between-subjects and within-subjects independent variables, respectively), and returns a data frame with a number of dependent measures for further analysis for each cell according to the combination of provided grouping variables. Dependent measures for each cell include among others means before and after rejecting all values according to a flexible standard deviation criteria, number of rejected values according to the flexible standard deviation criteria, proportions of rejected values according to the flexible standard deviation criteria, number of values before rejection, means after rejecting values according to procedures described in Van Selst & Jolicoeur (1994; suitable when measuring reaction-times), standard deviations, medians, means according to any percentile (e.g., 0.05, 0.25, 0.75, 0.95) and harmonic means. The data frame prep()
returns can also be exported as a txt file to be used for statistical analysis in other statistical programs.
Calculate multiple biotic indices using diatoms from environmental samples. Diatom species are recognized by their species name using a heuristic search, and their ecological data is retrieved from multiple sources. It includes number/shape of chloroplasts diversity indices, size classes, ecological guilds, and multiple biotic indices. It outputs both a dataframe with all the results and plots of all the obtained data in a defined output folder. - Sample data was taken from Nicolosi Gelis, Cochero & Gómez (2020, <doi:10.1016/j.ecolind.2019.105951>). - The package uses the Diat.Barcode database to calculate morphological and ecological information by Rimet & Couchez (2012, <doi:10.1051/kmae/2012018>),and the combined classification of guilds and size classes established by B-Béres et al. (2017, <doi:10.1016/j.ecolind.2017.07.007>). - Current diatom-based biotic indices include the DES index by Descy (1979) - EPID index by Dell'Uomo (1996, ISBN: 3950009002) - IDAP index by Prygiel & Coste (1993, <doi:10.1007/BF00028033>) - ID-CH index by Hürlimann & Niederhauser (2007) - IDP index by Gómez & Licursi (2001, <doi:10.1023/A:1011415209445>) - ILM index by Leclercq & Maquet (1987) - IPS index by Coste (1982) - LOBO index by Lobo, Callegaro, & Bender (2002, ISBN:9788585869908) - SLA by SládeÄ ek (1986, <doi:10.1002/aheh.19860140519>) - TDI index by Kelly, & Whitton (1995, <doi:10.1007/BF00003802>) - SPEAR(herbicide) index by Wood, Mitrovic, Lim, Warne, Dunlop, & Kefford (2019, <doi:10.1016/j.ecolind.2018.12.035>) - PBIDW index by Castro-Roa & Pinilla-Agudelo (2014) - DISP index by Stenger-Kovács et al. (2018, <doi:10.1016/j.ecolind.2018.07.026>) - EDI index by Chamorro et al. (2024, <doi:10.1021/acsestwater.4c00126>) - DDI index by à lvarez-Blanco et al. (2013, <doi: 10.1007/s10661-012-2607-z>) - PDISE index by Kahlert et al. (2023, <doi:10.1007/s10661-023-11378-4>).
Analysis of task-related functional magnetic resonance imaging (fMRI
) activity at the level of individual participants is commonly based on general linear modelling (GLM) that allows us to estimate to what extent the blood oxygenation level dependent (BOLD) signal can be explained by task response predictors specified in the GLM model. The predictors are constructed by convolving the hypothesised timecourse of neural activity with an assumed hemodynamic response function (HRF). To get valid and precise estimates of task response, it is important to construct a model of neural activity that best matches actual neuronal activity. The construction of models is most often driven by predefined assumptions on the components of brain activity and their duration based on the task design and specific aims of the study. However, our assumptions about the onset and duration of component processes might be wrong and can also differ across brain regions. This can result in inappropriate or suboptimal models, bad fitting of the model to the actual data and invalid estimations of brain activity. Here we present an approach in which theoretically driven models of task response are used to define constraints based on which the final model is derived computationally using the actual data. Specifically, we developed autohrf â a package for the R programming language that allows for data-driven estimation of HRF models. The package uses genetic algorithms to efficiently search for models that fit the underlying data well. The package uses automated parameter search to find the onset and duration of task predictors which result in the highest fitness of the resulting GLM based on the fMRI
signal under predefined restrictions. We evaluate the usefulness of the autohrf package on publicly available datasets of task-related fMRI
activity. Our results suggest that by using autohrf users can find better task related brain activity models in a quick and efficient manner.
Simulation methods for phylogenetic trees where (i) all tips are sampled at one time point or (ii) tips are sampled sequentially through time. (i) For sampling at one time point, simulations are performed under a constant rate birth-death process, conditioned on having a fixed number of final tips (sim.bd.taxa()
), or a fixed age (sim.bd.age()
), or a fixed age and number of tips (sim.bd.taxa.age()
). When conditioning on the number of final tips, the method allows for shifts in rates and mass extinction events during the birth-death process (sim.rateshift.taxa()
). The function sim.bd.age()
(and sim.rateshift.taxa()
without extinction) allow the speciation rate to change in a density-dependent way. The LTT plots of the simulations can be displayed using LTT.plot()
, LTT.plot.gen()
and LTT.average.root()
. TreeSim
further samples trees with n final tips from a set of trees generated by the common sampling algorithm stopping when a fixed number m>>n of tips is first reached (sim.gsa.taxa()
). This latter method is appropriate for m-tip trees generated under a big class of models (details in the sim.gsa.taxa()
man page). For incomplete phylogeny, the missing speciation events can be added through simulations (corsim()
). (ii) sim.rateshifts.taxa()
is generalized to sim.bdsky.stt()
for serially sampled trees, where the trees are conditioned on either the number of sampled tips or the age. Furthermore, for a multitype-branching process with sequential sampling, trees on a fixed number of tips can be simulated using sim.bdtypes.stt.taxa()
. This function further allows to simulate under epidemiological models with an exposed class. The function sim.genespeciestree()
simulates coalescent gene trees within birth-death species trees, and sim.genetree()
simulates coalescent gene trees.
Flexible multidimensional scaling (MDS) methods and extensions to the package smacof'. This package contains various functions, wrappers, methods and classes for fitting, plotting and displaying a large number of different flexible MDS models (some as of yet unpublished). These are: Torgerson scaling (Torgerson, 1958, ISBN:978-0471879459) with powers, Sammon mapping (Sammon, 1969, <doi:10.1109/T-C.1969.222678>) with ratio and interval optimal scaling, Multiscale MDS (Ramsay, 1977, <doi:10.1007/BF02294052>) with ratio and interval optimal scaling, S-stress MDS (ALSCAL; Takane, Young & De Leeuw, 1977, <doi:10.1007/BF02293745>) with ratio and interval optimal scaling, elastic scaling (McGee
, 1966, <doi:10.1111/j.2044-8317.1966.tb00367.x>) with ratio and interval optimal scaling, r-stress MDS (De Leeuw, Groenen & Mair, 2016, <https://rpubs.com/deleeuw/142619>) with ratio, interval and non-metric optimal scaling, power-stress MDS (POST-MDS; Buja & Swayne, 2002 <doi:10.1007/s00357-001-0031-0>) with ratio and interval optimal scaling, restricted power-stress (Rusch, Mair & Hornik, 2021, <doi:10.1080/10618600.2020.1869027>) with ratio and interval optimal scaling, approximate power-stress with ratio optimal scaling (Rusch, Mair & Hornik, 2021, <doi:10.1080/10618600.2020.1869027>), Box-Cox MDS (Chen & Buja, 2013, <https://jmlr.org/papers/v14/chen13a.html>), local MDS (Chen & Buja, 2009, <doi:10.1198/jasa.2009.0111>), curvilinear component analysis (Demartines & Herault, 1997, <doi:10.1109/72.554199>) and curvilinear distance analysis (Lee, Lendasse & Verleysen, 2004, <doi:10.1016/j.neucom.2004.01.007>). There also are experimental models (e.g., sparsified MDS and sparsified POST-MDS). Some functions are suitably flexible to allow any other sensible combination of explicit power transformations for weights, distances and input proximities with implicit ratio, interval or non-metric optimal scaling of the input proximities. Most functions use a Majorization-Minimization algorithm. Currently the methods are only available for one-mode data (symmetric dissimilarity matrices).
Measures morphological diversity from discrete character data and estimates evolutionary tempo on phylogenetic trees. Imports morphological data from #NEXUS (Maddison et al. (1997) <doi:10.1093/sysbio/46.4.590>) format with read_nexus_matrix()
, and writes to both #NEXUS and TNT format (Goloboff et al. (2008) <doi:10.1111/j.1096-0031.2008.00217.x>). Main functions are test_rates()
, which implements AIC and likelihood ratio tests for discrete character rates introduced across Lloyd et al. (2012) <doi:10.1111/j.1558-5646.2011.01460.x>, Brusatte et al. (2014) <doi:10.1016/j.cub.2014.08.034>, Close et al. (2015) <doi:10.1016/j.cub.2015.06.047>, and Lloyd (2016) <doi:10.1111/bij.12746>, and calculate_morphological_distances()
, which implements multiple discrete character distance metrics from Gower (1971) <doi:10.2307/2528823>, Wills (1998) <doi:10.1006/bijl.1998.0255>, Lloyd (2016) <doi:10.1111/bij.12746>, and Hopkins and St John (2018) <doi:10.1098/rspb.2018.1784>. This also includes the GED correction from Lehmann et al. (2019) <doi:10.1111/pala.12430>. Multiple functions implement morphospace plots: plot_chronophylomorphospace()
implements Sakamoto and Ruta (2012) <doi:10.1371/journal.pone.0039752>, plot_morphospace()
implements Wills et al. (1994) <doi:10.1017/S009483730001263X>, plot_changes_on_tree()
implements Wang and Lloyd (2016) <doi:10.1098/rspb.2016.0214>, and plot_morphospace_stack()
implements Foote (1993) <doi:10.1017/S0094837300015864>. Other functions include safe_taxonomic_reduction()
, which implements Wilkinson (1995) <doi:10.1093/sysbio/44.4.501>, map_dollo_changes()
implements the Dollo stochastic character mapping of Tarver et al. (2018) <doi:10.1093/gbe/evy096>, and estimate_ancestral_states()
implements the ancestral state options of Lloyd (2018) <doi:10.1111/pala.12380>. calculate_tree_length()
and reconstruct_ancestral_states()
implements the generalised algorithms from Swofford and Maddison (1992; no doi).
This package provides a comprehensive set of functions providing frequentist methods for network meta-analysis (Balduzzi et al., 2023) <doi:10.18637/jss.v106.i02> and supporting Schwarzer et al. (2015) <doi:10.1007/978-3-319-21416-0>, Chapter 8 "Network Meta-Analysis": - frequentist network meta-analysis following Rücker (2012) <doi:10.1002/jrsm.1058>; - additive network meta-analysis for combinations of treatments (Rücker et al., 2020) <doi:10.1002/bimj.201800167>; - network meta-analysis of binary data using the Mantel-Haenszel or non-central hypergeometric distribution method (Efthimiou et al., 2019) <doi:10.1002/sim.8158>, or penalised logistic regression (Evrenoglou et al., 2022) <doi:10.1002/sim.9562>; - rankograms and ranking of treatments by the Surface under the cumulative ranking curve (SUCRA) (Salanti et al., 2013) <doi:10.1016/j.jclinepi.2010.03.016>; - ranking of treatments using P-scores (frequentist analogue of SUCRAs without resampling) according to Rücker & Schwarzer (2015) <doi:10.1186/s12874-015-0060-8>; - split direct and indirect evidence to check consistency (Dias et al., 2010) <doi:10.1002/sim.3767>, (Efthimiou et al., 2019) <doi:10.1002/sim.8158>; - league table with network meta-analysis results; - comparison-adjusted funnel plot (Chaimani & Salanti, 2012) <doi:10.1002/jrsm.57>; - net heat plot and design-based decomposition of Cochran's Q according to Krahn et al. (2013) <doi:10.1186/1471-2288-13-35>; - measures characterizing the flow of evidence between two treatments by König et al. (2013) <doi:10.1002/sim.6001>; - automated drawing of network graphs described in Rücker & Schwarzer (2016) <doi:10.1002/jrsm.1143>; - partial order of treatment rankings ('poset') and Hasse diagram for poset (Carlsen & Bruggemann, 2014) <doi:10.1002/cem.2569>; (Rücker & Schwarzer, 2017) <doi:10.1002/jrsm.1270>; - contribution matrix as described in Papakonstantinou et al. (2018) <doi:10.12688/f1000research.14770.3> and Davies et al. (2022) <doi:10.1002/sim.9346>; - network meta-regression with a single continuous or binary covariate; - subgroup network meta-analysis.
Analyzing the performance of artificial intelligence (AI) systems/algorithms characterized by a search-and-report strategy. Historically observer performance has dealt with measuring radiologists performances in search tasks, e.g., searching for lesions in medical images and reporting them, but the implicit location information has been ignored. The implemented methods apply to analyzing the absolute and relative performances of AI systems, comparing AI performance to a group of human readers or optimizing the reporting threshold of an AI system. In addition to performing historical receiver operating receiver operating characteristic (ROC) analysis (localization information ignored), the software also performs free-response receiver operating characteristic (FROC) analysis, where lesion localization information is used. A book using the software has been published: Chakraborty DP: Observer Performance Methods for Diagnostic Imaging - Foundations, Modeling, and Applications with R-Based Examples, Taylor-Francis LLC; 2017: <https://www.routledge.com/Observer-Performance-Methods-for-Diagnostic-Imaging-Foundations-Modeling/Chakraborty/p/book/9781482214840>. Online updates to this book, which use the software, are at <https://dpc10ster.github.io/RJafrocQuickStart/>
, <https://dpc10ster.github.io/RJafrocRocBook/>
and at <https://dpc10ster.github.io/RJafrocFrocBook/>
. Supported data collection paradigms are the ROC, FROC and the location ROC (LROC). ROC data consists of single ratings per images, where a rating is the perceived confidence level that the image is that of a diseased patient. An ROC curve is a plot of true positive fraction vs. false positive fraction. FROC data consists of a variable number (zero or more) of mark-rating pairs per image, where a mark is the location of a reported suspicious region and the rating is the confidence level that it is a real lesion. LROC data consists of a rating and a location of the most suspicious region, for every image. Four models of observer performance, and curve-fitting software, are implemented: the binormal model (BM), the contaminated binormal model (CBM), the correlated contaminated binormal model (CORCBM), and the radiological search model (RSM). Unlike the binormal model, CBM, CORCBM and RSM predict proper ROC curves that do not inappropriately cross the chance diagonal. Additionally, RSM parameters are related to search performance (not measured in conventional ROC analysis) and classification performance. Search performance refers to finding lesions, i.e., true positives, while simultaneously not finding false positive locations. Classification performance measures the ability to distinguish between true and false positive locations. Knowing these separate performances allows principled optimization of reader or AI system performance. This package supersedes Windows JAFROC (jackknife alternative FROC) software V4.2.1, <https://github.com/dpc10ster/WindowsJafroc>
. Package functions are organized as follows. Data file related function names are preceded by Df', curve fitting functions by Fit', included data sets by dataset', plotting functions by Plot', significance testing functions by St', sample size related functions by Ss', data simulation functions by Simulate and utility functions by Util'. Implemented are figures of merit (FOMs) for quantifying performance and functions for visualizing empirical or fitted operating characteristics: e.g., ROC, FROC, alternative FROC (AFROC) and weighted AFROC (wAFROC
) curves. For fully crossed study designs significance testing of reader-averaged FOM differences between modalities is implemented via either Dorfman-Berbaum-Metz or the Obuchowski-Rockette methods. Also implemented is single treatment analysis, which allows comparison of performance of a group of radiologists to a specified value, or comparison of AI to a group of radiologists interpreting the same cases. Crossed-modality analysis is implemented wherein there are two crossed treatment factors and the aim is to determined performance in each treatment factor averaged over all levels of the second factor. Sample size estimation tools are provided for ROC and FROC studies; these use estimates of the relevant variances from a pilot study to predict required numbers of readers and cases in a pivotal study to achieve the desired power. Utility and data file manipulation functions allow data to be read in any of the currently used input formats, including Excel, and the results of the analysis can be viewed in text or Excel output files. The methods are illustrated with several included datasets from the author's collaborations. This update includes improvements to the code, some as a result of user-reported bugs and new feature requests, and others discovered during ongoing testing and code simplification.
This package performs variable selection based on subsampling, ranking forward selection. Details of the method are published in Lihui Liu, Hong Gu, Johan Van Limbergen, Toby Kenney (2020) SuRF
: A new method for sparse variable selection, with application in microbiome data analysis Statistics in Medicine 40 897-919 <doi:10.1002/sim.8809>. Xo is the matrix of predictor variables. y is the response variable. Currently only binary responses using logistic regression are supported. X is a matrix of additional predictors which should be scaled to have sum 1 prior to analysis. fold is the number of folds for cross-validation. Alpha is the parameter for the elastic net method used in the subsampling procedure: the default value of 1 corresponds to LASSO. prop is the proportion of variables to remove in the each subsample. weights indicates whether observations should be weighted by class size. When the class sizes are unbalanced, weighting observations can improve results. B is the number of subsamples to use for ranking the variables. C is the number of permutations to use for estimating the critical value of the null distribution. If the doParallel
package is installed, the function can be run in parallel by setting ncores to the number of threads to use. If the default value of 1 is used, or if the doParallel
package is not installed, the function does not run in parallel. display.progress indicates whether the function should display messages indicating its progress. family is a family variable for the glm()
fitting. Note that the glmnet package does not permit the use of nonstandard link functions, so will always use the default link function. However, the glm()
fitting will use the specified link. The default is binomial with logistic regression, because this is a common use case. pval is the p-value for inclusion of a variable in the model. Under the null case, the number of false positives will be geometrically distributed with this as probability of success, so if this parameter is set to p, the expected number of false positives should be p/(1-p).
This R package introduces Weighted Mean SHapley Additive exPlanations
(WMSHAP), an innovative method for calculating SHAP values for a grid of fine-tuned base-learner machine learning models as well as stacked ensembles, a method not previously available due to the common reliance on single best-performing models. By integrating the weighted mean SHAP values from individual base-learners comprising the ensemble or individual base-learners in a tuning grid search, the package weights SHAP contributions according to each model's performance, assessed by multiple either R squared (for both regression and classification models). alternatively, this software also offers weighting SHAP values based on the area under the precision-recall curve (AUCPR), the area under the curve (AUC), and F2 measures for binary classifiers. It further extends this framework to implement weighted confidence intervals for weighted mean SHAP values, offering a more comprehensive and robust feature importance evaluation over a grid of machine learning models, instead of solely computing SHAP values for the best model. This methodology is particularly beneficial for addressing the severe class imbalance (class rarity) problem by providing a transparent, generalized measure of feature importance that mitigates the risk of reporting SHAP values for an overfitted or biased model and maintains robustness under severe class imbalance, where there is no universal criteria of identifying the absolute best model. Furthermore, the package implements hypothesis testing to ascertain the statistical significance of SHAP values for individual features, as well as comparative significance testing of SHAP contributions between features. Additionally, it tackles a critical gap in feature selection literature by presenting criteria for the automatic feature selection of the most important features across a grid of models or stacked ensembles, eliminating the need for arbitrary determination of the number of top features to be extracted. This utility is invaluable for researchers analyzing feature significance, particularly within severely imbalanced outcomes where conventional methods fall short. Moreover, it is also expected to report democratic feature importance across a grid of models, resulting in a more comprehensive and generalizable feature selection. The package further implements a novel method for visualizing SHAP values both at subject level and feature level as well as a plot for feature selection based on the weighted mean SHAP ratios.
This package implements a suite of semiparametric and nonparametric kernel-smoothed estimation and testing procedures for continuous mark-specific stratified hazard ratio (treatment/placebo) models in a randomized treatment efficacy trial with a time-to-event endpoint. Semiparametric methods, allowing multivariate marks, are described in Juraska M and Gilbert PB (2013), Mark-specific hazard ratio model with multivariate continuous marks: an application to vaccine efficacy. Biometrics 69(2):328-337 <doi:10.1111/biom.12016>, and in Juraska M and Gilbert PB (2016), Mark-specific hazard ratio model with missing multivariate marks. Lifetime Data Analysis 22(4):606-25 <doi:10.1007/s10985-015-9353-9>. Nonparametric kernel-smoothed methods, allowing univariate marks only, are described in Sun Y and Gilbert PB (2012), Estimation of stratified markâ specific proportional hazards models with missing marks. Scandinavian Journal of Statistics
Many methods are developed to deal with two major statistical problems: image segmentation and nonparametric estimation in various regression models. Image segmentation is nowadays gaining a lot of attention from various scientific subfields. Especially, image segmentation has been popular in medical research such as magnetic resonance imaging (MRI) analysis. When a patient suffers from some brain diseases such as dementia and Parkinson's disease, those diseases can be easily diagnosed in brain MRI: the area affected by those diseases is brightly expressed in MRI, which is called a white lesion. For the purpose of medical research, locating and segment those white lesions in MRI is a critical issue; it can be done manually. However, manual segmentation is very expensive in that it is error-prone and demands a huge amount of time. Therefore, supervised machine learning has emerged as an alternative solution. Despite its powerful performance in a classification problem such as hand-written digits, supervised machine learning has not shown the same satisfactory result in MRI analysis. Setting aside all issues of the supervised machine learning, it exposed a critical problem when employed for MRI analysis: it requires time-consuming data labeling. Thus, there is a strong demand for an unsupervised approach, and this package - based on Hira L. Koul (1986) <DOI:10.1214/aos/1176350059> - proposes an efficient method for simple image segmentation - here, "simple" means that an image is black-and-white - which can easily be applied to MRI analysis. This package includes a function GetSegImage()
: when a black-and-white image is given as an input, GetSegImage()
separates an area of white pixels - which corresponds to a white lesion in MRI - from the given image. For the second problem, consider linear regression model and autoregressive model of order q where errors in the linear regression model and innovations in the autoregression model are independent and symmetrically distributed. Hira L. Koul (1986) <DOI:10.1214/aos/1176350059> proposed a nonparametric minimum distance estimation method by minimizing L2-type distance between certain weighted residual empirical processes. He also proposed a simpler version of the loss function by using symmetry of the integrating measure in the distance. Kim (2018) <DOI:10.1080/00949655.2017.1392527> proposed a fast computational method which enables practitioners to compute the minimum distance estimator of the vector of general multiple regression parameters for several integrating measures. This package contains three functions: KoulLrMde()
, KoulArMde()
, and Koul2StageMde()
. The former two provide minimum distance estimators for linear regression model and autoregression model, respectively, where both are based on Koul's method. These two functions take much less time for the computation than those based on parametric minimum distance estimation methods. Koul2StageMde()
provides estimators for regression and autoregressive coefficients of linear regression model with autoregressive errors through minimum distant method of two stages. The new version is written in Rcpp and dramatically reduces computational time.
Docco in Ruby