Functionality for spatio-temporal modeling of large data sets is provided. A Gaussian process in space and time is defined through a stochastic partial differential equation (SPDE). The SPDE is solved in the spectral space, and after discretizing in time and space, a linear Gaussian state space model is obtained. When doing inference, the main computational difficulty consists in evaluating the likelihood and in sampling from the full conditional of the spectral coefficients, or equivalently, the latent space-time process. In comparison to the traditional approach of using a spatio-temporal covariance function, the spectral SPDE approach is computationally advantageous. See Sigrist, Kuensch, and Stahel (2015) <doi:10.1111/rssb.12061> for more information on the methodology. This package aims at providing tools for two different modeling approaches. First, the SPDE based spatio-temporal model can be used as a component in a customized hierarchical Bayesian model (HBM). The functions of the package then provide parameterizations of the process part of the model as well as computationally efficient algorithms needed for doing inference with the HBM. Alternatively, the adaptive MCMC algorithm implemented in the package can be used as an algorithm for doing inference without any additional modeling. The MCMC algorithm supports data that follow a Gaussian or a censored distribution with point mass at zero. Covariates can be included in the model through a regression term.
Implementation of a collection of MCMC methods for Bayesian structure learning of directed acyclic graphs (DAGs), both from continuous and discrete data. For efficient inference on larger DAGs, the space of DAGs is pruned according to the data. To filter the search space, the algorithm employs a hybrid approach, combining constraint-based learning with search and score. A reduced search space is initially defined on the basis of a skeleton obtained by means of the PC-algorithm, and then iteratively improved with search and score. Search and score is then performed following two approaches: Order MCMC, or Partition MCMC. The BGe score is implemented for continuous data and the BDe score is implemented for binary data or categorical data. The algorithms may provide the maximum a posteriori (MAP) graph or a sample (a collection of DAGs) from the posterior distribution given the data. All algorithms are also applicable for structure learning and sampling for dynamic Bayesian networks. References: J. Kuipers, P. Suter, G. Moffa (2022) <doi:10.1080/10618600.2021.2020127>, N. Friedman and D. Koller (2003) <doi:10.1023/A:1020249912095>, J. Kuipers and G. Moffa (2017) <doi:10.1080/01621459.2015.1133426>, M. Kalisch et al. (2012) <doi:10.18637/jss.v047.i11>, D. Geiger and D. Heckerman (2002) <doi:10.1214/aos/1035844981>, P. Suter, J. Kuipers, G. Moffa, N.Beerenwinkel (2023) <doi:10.18637/jss.v105.i09>.
This package performs frequentist inference for the extremal index of a stationary time series. Two types of methodology are used. One type is based on a model that relates the distribution of block maxima to the marginal distribution of series and leads to the semiparametric maxima estimators described in Northrop (2015) <doi:10.1007/s10687-015-0221-5> and Berghaus and Bucher (2018) <doi:10.1214/17-AOS1621>. Sliding block maxima are used to increase precision of estimation. A graphical block size diagnostic is provided. The other type of methodology uses a model for the distribution of threshold inter-exceedance times (Ferro and Segers (2003) <doi:10.1111/1467-9868.00401>). Three versions of this type of approach are provided: the iterated weight least squares approach of Suveges (2007) <doi:10.1007/s10687-007-0034-2>, the K-gaps model of Suveges and Davison (2010) <doi:10.1214/09-AOAS292> and a similar approach of Holesovsky and Fusek (2020) <doi:10.1007/s10687-020-00374-3> that we refer to as D-gaps. For the K-gaps and D-gaps models this package allows missing values in the data, can accommodate independent subsets of data, such as monthly or seasonal time series from different years, and can incorporate information from right-censored inter-exceedance times. Graphical diagnostics for the threshold level and the respective tuning parameters K and D are provided.
This package provides functions supporting the reading and parsing of internal e-book content from EPUB files. The epubr package provides functions supporting the reading and parsing of internal e-book content from EPUB files. E-book metadata and text content are parsed separately and joined together in a tidy, nested tibble data frame. E-book formatting is not completely standardized across all literature. It can be challenging to curate parsed e-book content across an arbitrary collection of e-books perfectly and in completely general form, to yield a singular, consistently formatted output. Many EPUB files do not even contain all the same pieces of information in their respective metadata. EPUB file parsing functionality in this package is intended for relatively general application to arbitrary EPUB e-books. However, poorly formatted e-books or e-books with highly uncommon formatting may not work with this package. There may even be cases where an EPUB file has DRM or some other property that makes it impossible to read with epubr'. Text is read as is for the most part. The only nominal changes are minor substitutions, for example curly quotes changed to straight quotes. Substantive changes are expected to be performed subsequently by the user as part of their text analysis. Additional text cleaning can be performed at the user's discretion, such as with functions from packages like tm or qdap'.
Fresh biomass determination is the key to evaluating crop genotypes response to diverse input and stress conditions and forms the basis for calculating net primary production. However, as conventional phenotyping approaches for measuring fresh biomass is time-consuming, laborious and destructive, image-based phenotyping methods are being widely used now. In the image-based approach, the fresh weight of the above-ground part of the plant depends on the projected area. For determining the projected area, the visual image of the plant is converted into the grayscale image by simply averaging the Red(R), Green (G) and Blue (B) pixel values. Grayscale image is then converted into a binary image using Otsuâ s thresholding method Otsu, N. (1979) <doi:10.1109/TSMC.1979.4310076> to separate plant area from the background (image segmentation). The segmentation process was accomplished by selecting the pixels with values over the threshold value belonging to the plant region and other pixels to the background region. The resulting binary image consists of white and black pixels representing the plant and background regions. Finally, the number of pixels inside the plant region was counted and converted to square centimetres (cm2) using the reference object (any object whose actual area is known previously) to get the projected area. After that, the projected area is used as input to the machine learning model (Linear Model, Artificial Neural Network, and Support Vector Regression) to determine the plant's fresh weight.
This package performs stability analysis of multi-environment trial data using parametric and non-parametric methods. Parametric methods includes Additive Main Effects and Multiplicative Interaction (AMMI) analysis by Gauch (2013) <doi:10.2135/cropsci2013.04.0241>, Ecovalence by Wricke (1965), Genotype plus Genotype-Environment (GGE) biplot analysis by Yan & Kang (2003) <doi:10.1201/9781420040371>, geometric adaptability index by Mohammadi & Amri (2008) <doi:10.1007/s10681-007-9600-6>, joint regression analysis by Eberhart & Russel (1966) <doi:10.2135/cropsci1966.0011183X000600010011x>, genotypic confidence index by Annicchiarico (1992), Murakami & Cruz's (2004) method, power law residuals (POLAR) statistics by Doring et al. (2015) <doi:10.1016/j.fcr.2015.08.005>, scale-adjusted coefficient of variation by Doring & Reckling (2018) <doi:10.1016/j.eja.2018.06.007>, stability variance by Shukla (1972) <doi:10.1038/hdy.1972.87>, weighted average of absolute scores by Olivoto et al. (2019a) <doi:10.2134/agronj2019.03.0220>, and multi-trait stability index by Olivoto et al. (2019b) <doi:10.2134/agronj2019.03.0221>. Non-parametric methods includes superiority index by Lin & Binns (1988) <doi:10.4141/cjps88-018>, nonparametric measures of phenotypic stability by Huehn (1990) <doi:10.1007/BF00024241>, TOP third statistic by Fox et al. (1990) <doi:10.1007/BF00040364>. Functions for computing biometrical analysis such as path analysis, canonical correlation, partial correlation, clustering analysis, and tools for inspecting, manipulating, summarizing and plotting typical multi-environment trial data are also provided.
This package provides a variety of multivariable data summary statistics and constructions have been proposed, either to generalize univariable analogs or to exploit multivariable properties. Notable among these are the bivariate peelings surveyed by Green (1981, ISBN:978-0-471-28039-2), the bag-and-bolster plots proposed by Rousseeuw &al (1999) <doi:10.1080/00031305.1999.10474494>, and the minimum spanning trees used by Jolliffe (2002) <doi:10.1007/b98835> to represent high-dimensional relationships among data in a low-dimensional plot. Additionally, biplots of singular value--decomposed tabular data, such as from principal components analysis, make use of vectors, calibrated axes, and other representations of variable elements to complement point markers for case elements; see Gabriel (1971) <doi:10.1093/biomet/58.3.453> and Gower & Harding (1988) <doi:10.1093/biomet/75.3.445> for original proposals. Because they treat the abscissa and ordinate as commensurate or the data elements themselves as point masses or unit vectors, these multivariable tools can be thought of as belonging to geometric data analysis; see Podani (2000, ISBN:90-5782-067-6) for techniques and applications and Le Roux & Rouanet (2005) <doi:10.1007/1-4020-2236-0> for foundations. gggda extends Wickham's (2010) <doi:10.1198/jcgs.2009.07098> layered grammar of graphics with statistical transformation ("stat") and geometric construction ("geom") layers for many of these tools, as well as convenience coordinate systems to emphasize intrinsic geometry of the data.
Package provides functions for estimation and inference in Bayesian quantile regression with ordinal outcomes. An ordinal model with 3 or more outcomes (labeled OR1 model) is estimated by a combination of Gibbs sampling and Metropolis-Hastings (MH) algorithm. Whereas an ordinal model with exactly 3 outcomes (labeled OR2 model) is estimated using a Gibbs sampling algorithm. The summary output presents the posterior mean, posterior standard deviation, 95% credible intervals, and the inefficiency factors along with the two model comparison measures â logarithm of marginal likelihood and the deviance information criterion (DIC). The package also provides functions for computing the covariate effects and other functions that aids either the estimation or inference in quantile ordinal models. Rahman, M. A. (2016).â Bayesian Quantile Regression for Ordinal Models.â Bayesian Analysis, 11(1): 1-24 <doi: 10.1214/15-BA939>. Yu, K., and Moyeed, R. A. (2001). â Bayesian Quantile Regression.â Statistics and Probability Letters, 54(4): 437â 447 <doi: 10.1016/S0167-7152(01)00124-9>. Koenker, R., and Bassett, G. (1978).â Regression Quantiles.â Econometrica, 46(1): 33-50 <doi: 10.2307/1913643>. Chib, S. (1995). â Marginal likelihood from the Gibbs output.â Journal of the American Statistical Association, 90(432):1313â 1321, 1995. <doi: 10.1080/01621459.1995.10476635>. Chib, S., and Jeliazkov, I. (2001). â Marginal likelihood from the Metropolis-Hastings output.â Journal of the American Statistical Association, 96(453):270â 281, 2001. <doi: 10.1198/016214501750332848>.
Specialized solvers for combinatorial optimization problems in the Subset Sum family. The solvers differ from the mainstream in the options of (i) restricting subset size, (ii) bounding subset elements, (iii) mining real-value multisets with predefined subset sum errors, (iv) finding one or more subsets in limited time. A novel algorithm for mining the one-dimensional Subset Sum induced algorithms for the multi-Subset Sum and the multidimensional Subset Sum. The multi-threaded framework for the latter offers exact algorithms to the multidimensional Knapsack and the Generalized Assignment problems. Historical updates include (a) renewed implementation of the multi-Subset Sum, multidimensional Knapsack and Generalized Assignment solvers; (b) availability of bounding solution space in the multidimensional Subset Sum; (c) fundamental data structure and architectural changes for enhanced cache locality and better chance of SIMD vectorization; (d) option of mapping floating-point instance to compressed 64-bit integer instance with user-controlled precision loss, which could yield substantial speedup due to the dimension reduction and efficient compressed integer arithmetic via bit-manipulations; (e) distributed computing infrastructure for multidimensional subset sum; (f) arbitrary-precision zero-margin-of-error multidimensional Subset Sum accelerated by a simplified Bloom filter. The package contains a copy of xxHash from <https://github.com/Cyan4973/xxHash>. Package vignette (<doi:10.48550/arXiv.1612.04484>) detailed a few historical updates. Functions prefixed with aux (auxiliary) are independent implementations of published algorithms for solving optimization problems less relevant to Subset Sum.
In various domains, many datasets exhibit both high variable dependency and group structures, which necessitates their simultaneous estimation. This package provides functions for two subgroup identification methods based on penalized functions, both of which utilize factor model structures to adapt to data with cross-sectional dependency. The first method is the Subgroup Identification with Latent Factor Structure Method (SILFSM) we proposed. By employing Center-Augmented Regularization and factor structures, the SILFSM effectively eliminates data dependencies while identifying subgroups within datasets. For this model, we offer optimization functions based on two different methods: Coordinate Descent and our newly developed Difference of Convex-Alternating Direction Method of Multipliers (DC-ADMM) algorithms; the latter can be applied to cases where the distance function in Center-Augmented Regularization takes L1 and L2 forms. The other method is the Factor-Adjusted Pairwise Fusion Penalty (FA-PFP) model, which incorporates factor augmentation into the Pairwise Fusion Penalty (PFP) developed by Ma, S. and Huang, J. (2017) <doi:10.1080/01621459.2016.1148039>. Additionally, we provide a function for the Standard CAR (S-CAR) method, which does not consider the dependency and is for comparative analysis with other approaches. Furthermore, functions based on the Bayesian Information Criterion (BIC) of the SILFSM and the FA-PFP method are also included in SILFS for selecting tuning parameters. For more details of Subgroup Identification with Latent Factor Structure Method, please refer to He et al. (2024) <doi:10.48550/arXiv.2407.00882>.
In many binary classification applications, such as disease diagnosis and spam detection, practitioners commonly face the need to limit type I error (i.e., the conditional probability of misclassifying a class 0 observation as class 1) so that it remains below a desired threshold. To address this need, the Neyman-Pearson (NP) classification paradigm is a natural choice; it minimizes type II error (i.e., the conditional probability of misclassifying a class 1 observation as class 0) while enforcing an upper bound, alpha, on the type I error. Although the NP paradigm has a century-long history in hypothesis testing, it has not been well recognized and implemented in classification schemes. Common practices that directly limit the empirical type I error to no more than alpha do not satisfy the type I error control objective because the resulting classifiers are still likely to have type I errors much larger than alpha. As a result, the NP paradigm has not been properly implemented for many classification scenarios in practice. In this work, we develop the first umbrella algorithm that implements the NP paradigm for all scoring-type classification methods, including popular methods such as logistic regression, support vector machines and random forests. Powered by this umbrella algorithm, we propose a novel graphical tool for NP classification methods: NP receiver operating characteristic (NP-ROC) bands, motivated by the popular receiver operating characteristic (ROC) curves. NP-ROC bands will help choose in a data adaptive way and compare different NP classifiers.
Background - Traditional gene set enrichment analyses are typically limited to a few ontologies and do not account for the interdependence of gene sets or terms, resulting in overcorrected p-values. To address these challenges, we introduce mulea, an R package offering comprehensive overrepresentation and functional enrichment analysis. Results - mulea employs a progressive empirical false discovery rate (eFDR) method, specifically designed for interconnected biological data, to accurately identify significant terms within diverse ontologies. mulea expands beyond traditional tools by incorporating a wide range of ontologies, encompassing Gene Ontology, pathways, regulatory elements, genomic locations, and protein domains. This flexibility enables researchers to tailor enrichment analysis to their specific questions, such as identifying enriched transcriptional regulators in gene expression data or overrepresented protein domains in protein sets. To facilitate seamless analysis, mulea provides gene sets (in standardised GMT format) for 27 model organisms, covering 22 ontology types from 16 databases and various identifiers resulting in almost 900 files. Additionally, the muleaData ExperimentData Bioconductor package simplifies access to these pre-defined ontologies. Finally, mulea's architecture allows for easy integration of user-defined ontologies, or GMT files from external sources (e.g., MSigDB or Enrichr), expanding its applicability across diverse research areas. Conclusions - mulea is distributed as a CRAN R package. It offers researchers a powerful and flexible toolkit for functional enrichment analysis, addressing limitations of traditional tools with its progressive eFDR and by supporting a variety of ontologies. Overall, mulea fosters the exploration of diverse biological questions across various model organisms.
Collection of R functions to do purely presence-only species distribution modeling with isolation forest (iForest) and its variations such as Extended isolation forest and SCiForest. See the details of these methods in references: Liu, F.T., Ting, K.M. and Zhou, Z.H. (2008) <doi:10.1109/ICDM.2008.17>, Hariri, S., Kind, M.C. and Brunner, R.J. (2019) <doi:10.1109/TKDE.2019.2947676>, Liu, F.T., Ting, K.M. and Zhou, Z.H. (2010) <doi:10.1007/978-3-642-15883-4_18>, Guha, S., Mishra, N., Roy, G. and Schrijvers, O. (2016) <https://proceedings.mlr.press/v48/guha16.html>, Cortes, D. (2021) <doi:10.48550/arXiv.2110.13402>. Additionally, Shapley values are used to explain model inputs and outputs. See details in references: Shapley, L.S. (1953) <doi:10.1515/9781400881970-018>, Lundberg, S.M. and Lee, S.I. (2017) <https://dm-gatech.github.io/CS8803-Fall2018-DML-Papers/shapley.pdf>, Molnar, C. (2020) <ISBN:978-0-244-76852-2>, Å trumbelj, E. and Kononenko, I. (2014) <doi:10.1007/s10115-013-0679-x>. itsdm also provides functions to diagnose variable response, analyze variable importance, draw spatial dependence of variables and examine variable contribution. As utilities, the package includes a few functions to download bioclimatic variables including WorldClim version 2.0 (see Fick, S.E. and Hijmans, R.J. (2017) <doi:10.1002/joc.5086>) and CMCC-BioClimInd (see Noce, S., Caporaso, L. and Santini, M. (2020) <doi:10.1038/s41597-020-00726-5>.
Stochastic block model used for dynamic graphs represented by Poisson processes. To model recurrent interaction events in continuous time, an extension of the stochastic block model is proposed where every individual belongs to a latent group and interactions between two individuals follow a conditional inhomogeneous Poisson process with intensity driven by the individualsâ latent groups. The model is shown to be identifiable and its estimation is based on a semiparametric variational expectation-maximization algorithm. Two versions of the method are developed, using either a nonparametric histogram approach (with an adaptive choice of the partition size) or kernel intensity estimators. The number of latent groups can be selected by an integrated classification likelihood criterion. Y. Baraud and L. Birgé (2009). <doi:10.1007/s00440-007-0126-6>. C. Biernacki, G. Celeux and G. Govaert (2000). <doi:10.1109/34.865189>. M. Corneli, P. Latouche and F. Rossi (2016). <doi:10.1016/j.neucom.2016.02.031>. J.-J. Daudin, F. Picard and S. Robin (2008). <doi:10.1007/s11222-007-9046-7>. A. P. Dempster, N. M. Laird and D. B. Rubin (1977). <http://www.jstor.org/stable/2984875>. G. Grégoire (1993). <http://www.jstor.org/stable/4616289>. L. Hubert and P. Arabie (1985). <doi:10.1007/BF01908075>. M. Jordan, Z. Ghahramani, T. Jaakkola and L. Saul (1999). <doi:10.1023/A:1007665907178>. C. Matias, T. Rebafka and F. Villers (2018). <doi:10.1093/biomet/asy016>. C. Matias and S. Robin (2014). <doi:10.1051/proc/201447004>. H. Ramlau-Hansen (1983). <doi:10.1214/aos/1176346152>. P. Reynaud-Bouret (2006). <doi:10.3150/bj/1155735930>.
Estimation, based on conditional maximum likelihood, of the quadratic exponential model proposed by Bartolucci, F. & Nigro, V. (2010, Econometrica) <DOI:10.3982/ECTA7531> and of a simplified and a modified version of this model. The quadratic exponential model is suitable for the analysis of binary longitudinal data when state dependence (further to the effect of the covariates and a time-fixed individual intercept) has to be taken into account. Therefore, this is an alternative to the dynamic logit model having the advantage of easily allowing conditional inference in order to eliminate the individual intercepts and then getting consistent estimates of the parameters of main interest (for the covariates and the lagged response). The simplified version of this model does not distinguish, as the original model does, between the last time occasion and the previous occasions. The modified version formulates in a different way the interaction terms and it may be used to test in a easy way state dependence as shown in Bartolucci, F., Nigro, V. & Pigini, C. (2018, Econometric Reviews) <DOI:10.1080/07474938.2015.1060039>. The package also includes estimation of the dynamic logit model by a pseudo conditional estimator based on the quadratic exponential model, as proposed by Bartolucci, F. & Nigro, V. (2012, Journal of Econometrics) <DOI:10.1016/j.jeconom.2012.03.004>. For large time dimensions of the panel, the computation of the proposed models involves a recursive function from Krailo M. D., & Pike M. C. (1984, Journal of the Royal Statistical Society. Series C (Applied Statistics)) and Bartolucci F., Valentini, F. & Pigini C. (2021, Computational Economics <DOI:10.1007/s10614-021-10218-2>.
This package provides methods that use flexible variants of multidimensional scaling (MDS) which incorporate parametric nonlinear distance transformations and trade-off the goodness-of-fit fit with structure considerations to find optimal hyperparameters, also known as structure optimized proximity scaling (STOPS) (Rusch, Mair & Hornik, 2023,<doi:10.1007/s11222-022-10197-w>). The package contains various functions, wrappers, methods and classes for fitting, plotting and displaying different 1-way MDS models with ratio, interval, ordinal optimal scaling in a STOPS framework. These cover essentially the functionality of the package smacofx, including Torgerson (classical) scaling with power transformations of dissimilarities, SMACOF MDS with powers of dissimilarities, Sammon mapping with powers of dissimilarities, elastic scaling with powers of dissimilarities, spherical SMACOF with powers of dissimilarities, (ALSCAL) s-stress MDS with powers of dissimilarities, r-stress MDS, MDS with powers of dissimilarities and configuration distances, elastic scaling powers of dissimilarities and configuration distances, Sammon mapping powers of dissimilarities and configuration distances, power stress MDS (POST-MDS), approximate power stress, Box-Cox MDS, local MDS, Isomap, curvilinear component analysis (CLCA), curvilinear distance analysis (CLDA) and sparsified (power) multidimensional scaling and (power) multidimensional distance analysis (experimental models from smacofx influenced by CLCA). All of these models can also be fit by optimizing over hyperparameters based on goodness-of-fit fit only (i.e., no structure considerations). The package further contains functions for optimization, specifically the adaptive Luus-Jaakola algorithm and a wrapper for Bayesian optimization with treed Gaussian process with jumps to linear models, and functions for various c-structuredness indices. Hyperparameter optimization can be done with a number of techniques but we recommend either Bayesian optimization or particle swarm. For using "Kriging", users need to install a version of the archived DiceOptim R package.
This package provides tools for semantic segmentation of geospatial data using convolutional neural network-based deep learning. Utility functions allow for creating masks, image chips, data frames listing image chips in a directory, and DataSets for use within DataLoaders. Additional functions are provided to serve as checks during the data preparation and training process. A UNet architecture can be defined with 4 blocks in the encoder, a bottleneck block, and 4 blocks in the decoder. The UNet can accept a variable number of input channels, and the user can define the number of feature maps produced in each encoder and decoder block and the bottleneck. Users can also choose to (1) replace all rectified linear unit (ReLU) activation functions with leaky ReLU or swish, (2) implement attention gates along the skip connections, (3) implement squeeze and excitation modules within the encoder blocks, (4) add residual connections within all blocks, (5) replace the bottleneck with a modified atrous spatial pyramid pooling (ASPP) module, and/or (6) implement deep supervision using predictions generated at each stage in the decoder. A unified focal loss framework is implemented after Yeung et al. (2022) <doi:10.1016/j.compmedimag.2021.102026>. We have also implemented assessment metrics using the luz package including F1-score, recall, and precision. Trained models can be used to predict to spatial data without the need to generate chips from larger spatial extents. Functions are available for performing accuracy assessment. The package relies on torch for implementing deep learning, which does not require the installation of a Python environment. Raster geospatial data are handled with terra'. Models can be trained using a Compute Unified Device Architecture (CUDA)-enabled graphics processing unit (GPU); however, multi-GPU training is not supported by torch in R'.
This package provides a variety of methods are provided to estimate and visualize distributional differences in terms of effect sizes. Particular emphasis is upon evaluating differences between two or more distributions across the entire scale, rather than at a single point (e.g., differences in means). For example, Probability-Probability (PP) plots display the difference between two or more distributions, matched by their empirical CDFs (see Ho and Reardon, 2012; <doi:10.3102/1076998611411918>), allowing for examinations of where on the scale distributional differences are largest or smallest. The area under the PP curve (AUC) is an effect-size metric, corresponding to the probability that a randomly selected observation from the x-axis distribution will have a higher value than a randomly selected observation from the y-axis distribution. Binned effect size plots are also available, in which the distributions are split into bins (set by the user) and separate effect sizes (Cohen's d) are produced for each bin - again providing a means to evaluate the consistency (or lack thereof) of the difference between two or more distributions at different points on the scale. Evaluation of empirical CDFs is also provided, with built-in arguments for providing annotations to help evaluate distributional differences at specific points (e.g., semi-transparent shading). All function take a consistent argument structure. Calculation of specific effect sizes is also possible. The following effect sizes are estimable: (a) Cohen's d, (b) Hedges g, (c) percentage above a cut, (d) transformed (normalized) percentage above a cut, (e) area under the PP curve, and (f) the V statistic (see Ho, 2009; <doi:10.3102/1076998609332755>), which essentially transforms the area under the curve to standard deviation units. By default, effect sizes are calculated for all possible pairwise comparisons, but a reference group (distribution) can be specified.
Pancreatic ductal adenocarcinoma (PDA) has a relatively poor prognosis and is one of the most lethal cancers. Molecular classification of gene expression profiles holds the potential to identify meaningful subtypes which can inform therapeutic strategy in the clinical setting. The Pancreatic Cancer Adenocarcinoma Tool-Kit (PDATK) provides an S4 class-based interface for performing unsupervised subtype discovery, cross-cohort meta-clustering, gene-expression-based classification, and subsequent survival analysis to identify prognostically useful subtypes in pancreatic cancer and beyond. Two novel methods, Consensus Subtypes in Pancreatic Cancer (CSPC) and Pancreatic Cancer Overall Survival Predictor (PCOSP) are included for consensus-based meta-clustering and overall-survival prediction, respectively. Additionally, four published subtype classifiers and three published prognostic gene signatures are included to allow users to easily recreate published results, apply existing classifiers to new data, and benchmark the relative performance of new methods. The use of existing Bioconductor classes as input to all PDATK classes and methods enables integration with existing Bioconductor datasets, including the 21 pancreatic cancer patient cohorts available in the MetaGxPancreas data package. PDATK has been used to replicate results from Sandhu et al (2019) [https://doi.org/10.1200/cci.18.00102] and an additional paper is in the works using CSPC to validate subtypes from the included published classifiers, both of which use the data available in MetaGxPancreas. The inclusion of subtype centroids and prognostic gene signatures from these and other publications will enable researchers and clinicians to classify novel patient gene expression data, allowing the direct clinical application of the classifiers included in PDATK. Overall, PDATK provides a rich set of tools to identify and validate useful prognostic and molecular subtypes based on gene-expression data, benchmark new classifiers against existing ones, and apply discovered classifiers on novel patient data to inform clinical decision making.
There are diverse purposes such as biomarker confirmation, novel biomarker discovery, constructing predictive models, model-based prediction, and validation. It handles binary, continuous, and time-to-event outcomes at the sample or patient level. - Biomarker confirmation utilizes established functions like glm() from stats', coxph() from survival', surv_fit(), and ggsurvplot() from survminer'. - Biomarker discovery and variable selection are facilitated by three LASSO-related functions LASSO2(), LASSO_plus(), and LASSO2plus(), leveraging the glmnet R package with additional steps. - Eight versatile modeling functions are offered, each designed for predictive models across various outcomes and data types. 1) LASSO2(), LASSO_plus(), LASSO2plus(), and LASSO2_reg() perform variable selection using LASSO methods and construct predictive models based on selected variables. 2) XGBtraining() employs XGBoost for model building and is the only function not involving variable selection. 3) Functions like LASSO2_XGBtraining(), LASSOplus_XGBtraining(), and LASSO2plus_XGBtraining() combine LASSO-related variable selection with XGBoost for model construction. - All models support prediction and validation, requiring a testing dataset comparable to the training dataset. Additionally, the package introduces XGpred() for risk prediction based on survival data, with the XGpred_predict() function available for predicting risk groups in new datasets. The methodology is based on our new algorithms and various references: - Hastie et al. (1992, ISBN 0 534 16765-9), - Therneau et al. (2000, ISBN 0-387-98784-3), - Kassambara et al. (2021) <https://CRAN.R-project.org/package=survminer>, - Friedman et al. (2010) <doi:10.18637/jss.v033.i01>, - Simon et al. (2011) <doi:10.18637/jss.v039.i05>, - Harrell (2023) <https://CRAN.R-project.org/package=rms>, - Harrell (2023) <https://CRAN.R-project.org/package=Hmisc>, - Chen and Guestrin (2016) <arXiv:1603.02754>, - Aoki et al. (2023) <doi:10.1200/JCO.23.01115>.
First, we provide functions to calculate the partial derivative of the first-passage time diffusion probability density function (PDF) and cumulative distribution function (CDF) with respect to the first-passage time t (only for PDF), the upper barrier a, the drift rate v, the relative starting point w, the non-decision time t0, the inter-trial variability of the drift rate sv, the inter-trial variability of the rel. starting point sw, and the inter-trial variability of the non-decision time st0. In addition the PDF and CDF themselves are also provided. Most calculations are done on the logarithmic scale to make it more stable. Since the PDF, CDF, and their derivatives are represented as infinite series, we give the user the option to control the approximation errors with the argument precision'. For the numerical integration we used the C library cubature by Johnson, S. G. (2005-2013) <https://github.com/stevengj/cubature>. Numerical integration is required whenever sv, sw, and/or st0 is not zero. Note that numerical integration reduces speed of the computation and the precision cannot be guaranteed anymore. Therefore, whenever numerical integration is used an estimate of the approximation error is provided in the output list. Note: The large number of contributors (ctb) is due to copying a lot of C/C++ code chunks from the GNU Scientific Library (GSL). Second, we provide methods to sample from the first-passage time distribution with or without user-defined truncation from above. The first method is a new adaptive rejection sampler building on the works of Gilks and Wild (1992; <doi:10.2307/2347565>) and Hartmann and Klauer (in press). The second method is a rejection sampler provided by Drugowitsch (2016; <doi:10.1038/srep20490>). The third method is an inverse transformation sampler. The fourth method is a "pseudo" adaptive rejection sampler that builds on the first method. For more details see the corresponding help files.
Self-reported health, happiness, attitudes, and other statuses or perceptions are often the subject of biases that may come from different sources. For example, the evaluation of an individualâ s own health may depend on previous medical diagnoses, functional status, and symptoms and signs of illness; as on well as life-style behaviors, including contextual social, gender, age-specific, linguistic and other cultural factors (Jylha 2009 <doi:10.1016/j.socscimed.2009.05.013>; Oksuzyan et al. 2019 <doi:10.1016/j.socscimed.2019.03.002>). The hopit package offers versatile functions for analyzing different self-reported ordinal variables, and for helping to estimate their biases. Specifically, the package provides the function to fit a generalized ordered probit model that regresses original self-reported status measures on two sets of independent variables (King et al. 2004 <doi:10.1017/S0003055403000881>; Jurges 2007 <doi:10.1002/hec.1134>; Oksuzyan et al. 2019 <doi:10.1016/j.socscimed.2019.03.002>). The first set of variables (e.g., health variables) included in the regression are individual statuses and characteristics that are directly related to the self-reported variable. In the case of self-reported health, these could be chronic conditions, mobility level, difficulties with daily activities, performance on grip strength tests, anthropometric measures, and lifestyle behaviors. The second set of independent variables (threshold variables) is used to model cut-points between adjacent self-reported response categories as functions of individual characteristics, such as gender, age group, education, and country (Oksuzyan et al. 2019 <doi:10.1016/j.socscimed.2019.03.002>). The model helps to adjust for specific socio-demographic and cultural differences in how the continuous latent health is projected onto the ordinal self-rated measure. The fitted model can be used to calculate an individual predicted latent status variable, a latent index, and standardized latent coefficients; and makes it possible to reclassify a categorical status measure that has been adjusted for inter-individual differences in reporting behavior.
Nuclear magnetic resonance (NMR) is a highly versatile analytical technique for studying molecular configuration, conformation, and dynamics, especially those of biomacromolecules such as proteins. Biological Magnetic Resonance Data Bank ('BMRB') is a repository for Data from NMR Spectroscopy on Proteins, Peptides, Nucleic Acids, and other Biomolecules. Currently, BMRB offers an R package RBMRB to fetch data, however, it doesn't easily offer individual data file downloading and storing in a local directory. When using RBMRB', the data will stored as an R object, which fundamentally hinders the NMR researches to access the rich information from raw data, for example, the metadata. Here, BMRBr File Downloader ('BMRBr') offers a more fundamental, low level downloader, which will download original deposited .str format file. This type of file contains information such as entry title, authors, citation, protein sequences, and so on. Many factors affect NMR experiment outputs, such as temperature, resonance sensitivity and etc., approximately 40% of the entries in the BMRB have chemical shift accuracy problems [1,2] Unfortunately, current reference correction methods are heavily dependent on the availability of assigned protein chemical shifts or protein structure. This is my current research project is going to solve, which will be included in the future release of the package. The current version of the package is sufficient and robust enough for downloading individual BMRB data file from the BMRB database <http://www.bmrb.wisc.edu>. The functionalities of this package includes but not limited: * To simplifies NMR researches by combine data downloading and results analysis together. * To allows NMR data reaches a broader audience that could utilize more than just chemical shifts but also metadata. * To offer reference corrected data for entries without assignment or structure information (future release). Reference: [1] E.L. Ulrich, H. Akutsu, J.F. Doreleijers, Y. Harano, Y.E. Ioannidis, J. Lin, et al., BioMagResBank, Nucl. Acids Res. 36 (2008) D402â 8. <doi:10.1093/nar/gkm957>. [2] L. Wang, H.R. Eghbalnia, A. Bahrami, J.L. Markley, Linear analysis of carbon-13 chemical shift differences and its application to the detection and correction of errors in referencing and spin system identifications, J. Biomol. NMR. 32 (2005) 13â 22. <doi:10.1007/s10858-005-1717-0>.
This method is a new class of model selection strategies, for mixed model selection, which includes linear and generalized linear mixed models. The idea involves a procedure to isolate a subgroup of what are known as correct models (of which the optimal model is a member). This is accomplished by constructing a statistical fence, or barrier, to carefully eliminate incorrect models. Once the fence is constructed, the optimal model is selected from among those within the fence according to a criterion which can be made flexible. References: 1. Jiang J., Rao J.S., Gu Z., Nguyen T. (2008), Fence Methods for Mixed Model Selection. The Annals of Statistics, 36(4): 1669-1692. <DOI:10.1214/07-AOS517> <https://projecteuclid.org/euclid.aos/1216237296>. 2. Jiang J., Nguyen T., Rao J.S. (2009), A Simplified Adaptive Fence Procedure. Statistics and Probability Letters, 79, 625-629. <DOI:10.1016/j.spl.2008.10.014> <https://www.researchgate.net/publication/23991417_A_simplified_adaptive_fence_procedure> 3. Jiang J., Nguyen T., Rao J.S. (2010), Fence Method for Nonparametric Small Area Estimation. Survey Methodology, 36(1), 3-11. <http://publications.gc.ca/collections/collection_2010/statcan/12-001-X/12-001-x2010001-eng.pdf>. 4. Jiming Jiang, Thuan Nguyen and J. Sunil Rao (2011), Invisible fence methods and the identification of differentially expressed gene sets. Statistics and Its Interface, Volume 4, 403-415. <http://www.intlpress.com/site/pub/files/_fulltext/journals/sii/2011/0004/0003/SII-2011-0004-0003-a014.pdf>. 5. Thuan Nguyen & Jiming Jiang (2012), Restricted fence method for covariate selection in longitudinal data analysis. Biostatistics, 13(2), 303-314. <DOI:10.1093/biostatistics/kxr046> <https://academic.oup.com/biostatistics/article/13/2/303/263903/Restricted-fence-method-for-covariate-selection-in>. 6. Thuan Nguyen, Jie Peng, Jiming Jiang (2014), Fence Methods for Backcross Experiments. Statistical Computation and Simulation, 84(3), 644-662. <DOI:10.1080/00949655.2012.721885> <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3891925/>. 7. Jiang, J. (2014), The fence methods, in Advances in Statistics, Hindawi Publishing Corp., Cairo. <DOI:10.1155/2014/830821>. 8. Jiming Jiang and Thuan Nguyen (2015), The Fence Methods, World Scientific, Singapore. <https://www.abebooks.com/9789814596060/Fence-Methods-Jiming-Jiang-981459606X/plp>.