Similarity of dissolution profiles is assessed using the similarity factor f2 according to the EMA guideline (European Medicines Agency 2010) "On the investigation of bioequivalence". Dissolution profiles are regarded as similar if the f2 value is between 50 and 100. For the applicability of the similarity factor f2, the variability between profiles needs to be within certain limits. Often, this constraint is violated. One possibility in this situation is to resample the measured profiles in order to obtain a bootstrap estimate of f2 (Shah et al. (1998) <doi:10.1023/A:1011976615750>). Other alternatives are the model-independent non-parametric multivariate confidence region (MCR) procedure (Tsong et al. (1996) <doi:10.1177/009286159603000427>) or the T2-test for equivalence procedure (Hoffelder (2016) <https://www.ecv.de/suse_item.php?suseId=Z|pi|8430>
). Functions for estimation of f1, f2, bootstrap f2, MCR / T2-test for equivalence procedure are implemented.
Allows ATA (Automatic Time series analysis using the Ata method) models from the ATAforecasting package to be used in a tidy workflow with the modeling interface of fabletools'. This extends ATAforecasting to provide enhanced model specification and management, performance evaluation methods, and model combination tools. The Ata method (Yapar et al. (2019) <doi:10.15672/hujms.461032>), an alternative to exponential smoothing (described in Yapar (2016) <doi:10.15672/HJMS.201614320580>, Yapar et al. (2017) <doi:10.15672/HJMS.2017.493>), is a new univariate time series forecasting method which provides innovative solutions to issues faced during the initialization and optimization stages of existing forecasting methods. Forecasting performance of the Ata method is superior to existing methods both in terms of easy implementation and accurate forecasting. It can be applied to non-seasonal or seasonal time series which can be decomposed into four components (remainder, level, trend and seasonal).
This package provides a collection of functions designed to streamline the retrieval of data from Brazilian lottery games operated by Caixa Econômica Federal, accessible through the official website at <https://loterias.caixa.gov.br/Paginas/default.aspx/>. Datasets for each game are conveniently stored on the GitHub
page at <https://github.com/tomasbp2/LotteryBrasilDATA/>
. Each game within this repository consists of two primary datasets: the winners dataset and the numbers dataset. The winners dataset includes crucial information such as the draw date, game type, potential matches, winners for each match, and corresponding prize amounts. Meanwhile, the numbers dataset provides essential details including the draw date, game type, and the numbers drawn during the respective lottery event. By offering easy access to these datasets, the package facilitates efficient data retrieval and analysis for researchers, analysts, and enthusiasts interested in exploring the dynamics and outcomes of Brazilian lottery games.
Statistical methods to match feature vectors between multiple datasets in a one-to-one fashion. Given a fixed number of classes/distributions, for each unit, exactly one vector of each class is observed without label. The goal is to label the feature vectors using each label exactly once so to produce the best match across datasets, e.g. by minimizing the variability within classes. Statistical solutions based on empirical loss functions and probabilistic modeling are provided. The Gurobi software and its R interface package are required for one of the package functions (match.2x()
) and can be obtained at <https://www.gurobi.com/> (free academic license). For more details, refer to Degras (2022) <doi:10.1080/10618600.2022.2074429> "Scalable feature matching for large data collections" and Bandelt, Maas, and Spieksma (2004) <doi:10.1057/palgrave.jors.2601723> "Local search heuristics for multi-index assignment problems with decomposable costs".
The first stand-alone R package for computation of latent correlation that takes into account all variable types (continuous/binary/ordinal/zero-inflated), comes with an optimized memory footprint, and is computationally efficient, essentially making latent correlation estimation almost as fast as rank-based correlation estimation. The estimation is based on latent copula Gaussian models. For continuous/binary types, see Fan, J., Liu, H., Ning, Y., and Zou, H. (2017). For ternary type, see Quan X., Booth J.G. and Wells M.T. (2018) <arXiv:1809.06255>
. For truncated type or zero-inflated type, see Yoon G., Carroll R.J. and Gaynanova I. (2020) <doi:10.1093/biomet/asaa007>. For approximation method of computation, see Yoon G., Müller C.L. and Gaynanova I. (2021) <doi:10.1080/10618600.2021.1882468>. The latter method uses multi-linear interpolation originally implemented in the R package <https://cran.r-project.org/package=chebpol>.
Easily analyze and visualize differences between samples (e.g., benchmark comparisons, nonresponse comparisons in surveys) on three levels. The comparisons can be univariate, bivariate or multivariate. On univariate level the variables of interest of a survey and a comparison survey (i.e. benchmark) are compared, by calculating one of several difference measures (e.g., relative difference in mean), and an average difference between the surveys. On bivariate level a function can calculate significant differences in correlations for the surveys. And on multivariate levels a function can calculate significant differences in model coefficients between the surveys of comparison. All of those differences can be easily plotted and outputted as a table. For more detailed information on the methods and example use see Rohr, B., Silber, H., & Felderer, B. (2024). Comparing the Accuracy of Univariate, Bivariate, and Multivariate Estimates across Probability and Nonprobability Surveys with Population Benchmarks. Sociological Methodology <doi:10.1177/00811750241280963>.
This package provides interactive, configurable and graphics visualization of the chromosome regions of any living organism allowing users to map chromosome elements (like genes, SNPs etc.) on the chromosome plot. It introduces a special plot viz. the "chromosome heatmap" that, in addition to mapping elements, can visualize the data associated with chromosome elements (like gene expression) in the form of heat colors. Users can investigate the detailed information about the mappings (like gene names or total genes mapped on a location) or can view the magnified single or double stranded view of the chromosome at a location showing each mapped element in sequential order. The package provide multiple features like visualizing multiple sets, chromosome heat-maps, group annotations, adding hyperlinks, and labelling. The plots can be saved as HTML documents that can be customized and shared easily. In addition, you can include them in R Markdown or in R Shiny applications.
An extensible repository of accurate, up-to-date functions to score commonly used patient-reported outcome (PRO), quality of life (QOL), and other psychometric and psychological measures. PROscorer', together with the PROscorerTools
package, is a system to facilitate the incorporation of PRO measures into research studies and clinical settings in a scientifically rigorous and reproducible manner. These packages and their vignettes are intended to help establish and promote best practices for scoring PRO and PRO-like measures in research. The PROscorer Instrument Descriptions vignette contains descriptions of each instrument scored by PROscorer', complete with references. These instrument descriptions are suitable for inclusion in formal study protocol documents, grant proposals, and manuscript Method sections. Each PROscorer function is composed of helper functions from the PROscorerTools
package, and users are encouraged to contribute new functions to PROscorer'. More scoring functions are currently in development and will be added in future updates.
The implementation of a forecasting-specific tree-based model that is in particular suitable for global time series forecasting, as proposed in Godahewa et al. (2022) <arXiv:2211.08661v1>
. The model uses the concept of Self Exciting Threshold Autoregressive (SETAR) models to define the node splits and thus, the model is named SETAR-Tree. The SETAR-Tree uses some time-series-specific splitting and stopping procedures. It trains global pooled regression models in the leaves allowing the models to learn cross-series information. The depth of the tree is controlled by conducting a statistical linearity test as well as measuring the error reduction percentage at each node split. Thus, the SETAR-Tree requires minimal external hyperparameter tuning and provides competitive results under its default configuration. A forest is developed by extending the SETAR-Tree. The SETAR-Forest combines the forecasts provided by a collection of diverse SETAR-Trees during the forecasting process.
This package provides new tools for analyzing discrete trait data integrating bio-ontologies and phylogenetics. It expands on the previous work of Tarasov et al. (2019) <doi:10.1093/isd/ixz009>. The PARAMO pipeline allows to reconstruct ancestral phenomes treating groups of morphological traits as a single complex character. The pipeline incorporates knowledge from ontologies during the amalgamation of individual character stochastic maps. Here we expand the current PARAMO functionality by adding new statistical methods for inferring evolutionary phenome dynamics using non-homogeneous Poisson process (NHPP). The new functionalities include: (1) reconstruction of evolutionary rate shifts of phenomes across lineages and time; (2) reconstruction of morphospace dynamics through time; and (3) estimation of rates of phenome evolution at different levels of anatomical hierarchy (e.g., entire body or specific regions only). The package also includes user-friendly tools for visualizing evolutionary rates of different anatomical regions using vector images of the organisms of interest.
An easy-to-use tool for working with presence/absence tests on pooled or grouped samples. The primary application is for estimating prevalence of a marker in a population based on the results of tests on pooled specimens. This sampling method is often employed in surveillance of rare conditions in humans or animals (e.g. molecular xenomonitoring). The package was initially conceived as an R-based alternative to the molecular xenomonitoring software, PoolScreen
<https://sites.uab.edu/statgenetics/software/>. However, it goes further, allowing for estimates of prevalence to be adjusted for hierarchical sampling frames, and perform flexible mixed-effect regression analyses (McLure
et al. Environmental Modelling and Software. <DOI:10.1016/j.envsoft.2021.105158>). The package is currently in early stages, however more features are planned or in the works: e.g. adjustments for imperfect test specificity/sensitivity, functions for helping with optimal experimental design, and functions for spatial modelling.
Allows clinicians and researchers to compute daily dose (and subsequently days supply) for prescription refills using the following methods: Fixed window, fixed tablet, defined daily dose (DDD), and Random Effects Warfarin Days Supply (REWarDS
). Daily dose is the computed dose that the patient takes every day. For medications with fixed dosing (e.g. direct oral anticoagulants) this is known and does not need to be estimated. For medications with varying dose such as warfarin, however, the daily dose should be assumed or estimated to allow measurement of drug exposure. Daysâ supply is the number of days that patientsâ supply of medication will last after each prescription fill. Estimating daysâ supply is necessary to calculate drug exposure. The package computes daysâ supply and daily dose at both the prescription and patient levels. Results at the prescription level are denoted with â -Rx-â and those at patient level are denoted with â -Pt-â .
This package provides functions for model fitting and selection of generalised hypergeometric ensembles of random graphs (gHypEG
). To learn how to use it, check the vignettes for a quick tutorial. Please reference its use as Casiraghi, G., Nanumyan, V. (2019) <doi:10.5281/zenodo.2555300> together with those relevant references from the one listed below. The package is based on the research developed at the Chair of Systems Design, ETH Zurich. Casiraghi, G., Nanumyan, V., Scholtes, I., Schweitzer, F. (2016) <arXiv:1607.02441>
. Casiraghi, G., Nanumyan, V., Scholtes, I., Schweitzer, F. (2017) <doi:10.1007/978-3-319-67256-4_11>. Casiraghi, G., (2017) <arXiv:1702.02048>
Brandenberger, L., Casiraghi, G., Nanumyan, V., Schweitzer, F. (2019) <doi:10.1145/3341161.3342926> Casiraghi, G. (2019) <doi:10.1007/s41109-019-0241-1>. Casiraghi, G., Nanumyan, V. (2021) <doi:10.1038/s41598-021-92519-y>. Casiraghi, G. (2021) <doi:10.1088/2632-072X/ac0493>.
Generative modeling for protein engineering is key to solving fundamental problems in synthetic biology, medicine, and material science. Machine learning has enabled us to generate useful protein sequences on a variety of scales. Generative models are machine learning methods which seek to model the distribution underlying the data, allowing for the generation of novel samples with similar properties to those on which the model was trained. Generative models of proteins can learn biologically meaningful representations helpful for a variety of downstream tasks. Furthermore, they can learn to generate protein sequences that have not been observed before and to assign higher probability to protein sequences that satisfy desired criteria. In this package, common deep generative models for protein sequences, such as variational autoencoder (VAE), generative adversarial networks (GAN), and autoregressive models are available. In the VAE and GAN, the Word2vec is used for embedding. The transformer encoder is applied to protein sequences for the autoregressive model.
Processes noble gas mass spectrometer data to determine the isotopic composition of argon (comprised of Ar36, Ar37, Ar38, Ar39 and Ar40) released from neutron-irradiated potassium-bearing minerals. Then uses these compositions to calculate precise and accurate geochronological ages for multiple samples as well as the covariances between them. Error propagation is done in matrix form, which jointly treats all samples and all isotopes simultaneously at every step of the data reduction process. Includes methods for regression of the time-resolved mass spectrometer signals to t=0 ('time zero') for both single- and multi-collector instruments, blank correction, mass fractionation correction, detector intercalibration, decay corrections, interference corrections, interpolation of the irradiation parameter between neutron fluence monitors, and (weighted mean) age calculation. All operations are performed on the logs of the ratios between the different argon isotopes so as to properly treat them as compositional data', sensu Aitchison [1986, The Statistics of Compositional Data, Chapman and Hall].
An important environmental impact on running water ecosystems is caused by hydropeaking - the discontinuous release of turbine water because of peaks of energy demand. An event-based algorithm is implemented to detect flow fluctuations referring to increase events (IC) and decrease events (DC). For each event, a set of parameters related to the fluctuation intensity is calculated. The framework is introduced in Greimel et al. (2016) "A method to detect and characterize sub-daily flow fluctuations" <doi:10.1002/hyp.10773> and can be used to identify different fluctuation types according to the potential source: e.g., sub-daily flow fluctuations caused by hydropeaking, rainfall, or snow and glacier melt. This is a companion to the package hydroroute', which is used to detect and follow hydropower plant-specific hydropeaking waves at the sub-catchment scale and to describe how hydropeaking flow parameters change along the longitudinal flow path as proposed and validated in Greimel et al. (2022).
An index is created using a mathematical model that transforms multi-dimensional variables into a single value. These variables are often correlated, and while PCA-based indices can address the issue of multicollinearity, they typically do not account for survey weights, which can lead to inaccurate rankings of survey units such as households, districts, or states. To resolve this, the current package facilitates the development of a principal component analysis-based composite index by incorporating survey weights for each sample observation. This ensures the generation of a survey-weighted principal component-based normalized composite index. Additionally, the package provides a normalized principal component-based composite index and ranks the sample observations based on the values of the composite indices. For method details see, Skinner, C. J., Holmes, D. J. and Smith, T. M. F. (1986) <DOI:10.1080/01621459.1986.10478336>, Singh, D., Basak, P., Kumar, R. and Ahmad, T. (2023) <DOI:10.3389/fams.2023.1274530>.
This package performs non-parametric tests of parametric specifications. Five tests are available. Specific bandwidth and kernel methods can be chosen along with many other options. Allows parallel computing to quickly compute p-values based on the bootstrap. Methods implemented in the package are H.J. Bierens (1982) <doi:10.1016/0304-4076(82)90105-1>, J.C. Escanciano (2006) <doi:10.1017/S0266466606060506>, P.L. Gozalo (1997) <doi:10.1016/S0304-4076(97)86571-2>, P. Lavergne and V. Patilea (2008) <doi:10.1016/j.jeconom.2007.08.014>, P. Lavergne and V. Patilea (2012) <doi:10.1198/jbes.2011.07152>, J.H. Stock and M.W. Watson (2006) <doi:10.1111/j.1538-4616.2007.00014.x>, C.F.J. Wu (1986) <doi:10.1214/aos/1176350142>, J. Yin, Z. Geng, R. Li, H. Wang (2010) <https://www.jstor.org/stable/24309002> and J.X. Zheng (1996) <doi:10.1016/0304-4076(95)01760-7>.
This package contains functions for retrieving, managing and analysing air quality and weather data from Regione Lombardia open database (<https://www.dati.lombardia.it/>). Data are collected by ARPA Lombardia (Lombardia Environmental Protection Agency), Italy, through its ground monitoring network (<https://www.dati.lombardia.it/stories/s/auv9-c2sj>). See the webpage <https://www.arpalombardia.it/> for further information on ARPA Lombardia's activities and history. Data quality (e.g. missing values, exported values, graphical mapping) has been checked involving members of the ARPA Lombardia's office for air quality control. The package makes available observations since 1989 (for weather) and 1968 (for air quality) and are updated with daily frequency by the regional agency. Full description of the package can be retrieved in the companion paper Maranzano \& Algieri (2024), "ARPALData: an R package for retrieving and analyzing air quality and weather data from ARPA Lombardia (Italy)", Environmental and Ecological Statistics, <doi:10.1007/s10651-024-00599-6>.
This package provides a suite of machine learning algorithms written in C++ with the R interface contains several learning techniques for classification and regression. Predictive models include e.g., classification and regression trees with optional constructive induction and models in the leaves, random forests, kNN
, naive Bayes, and locally weighted regression. All predictions obtained with these models can be explained and visualized with the ExplainPrediction
package. This package is especially strong in feature evaluation where it contains several variants of Relief algorithm and many impurity based attribute evaluation functions, e.g., Gini, information gain, MDL, and DKM. These methods can be used for feature selection or discretization of numeric attributes. The OrdEval
algorithm and its visualization is used for evaluation of data sets with ordinal features and class, enabling analysis according to the Kano model of customer satisfaction. Several algorithms support parallel multithreaded execution via OpenMP
. The top-level documentation is reachable through ?CORElearn.
This package provides tools for shoreline dating coastal Stone Age sites. The implemented method was developed in Roalkvam (2023) <doi:10.1016/j.quascirev.2022.107880> for the Norwegian Skagerrak coast. Although it can be extended to other areas, this also forms the core area for application of the package. Shoreline dating is based on the present-day elevation of a site, a reconstruction of past relative sea-level change, and empirically derived estimates of the likely elevation of the sites above the contemporaneous sea-level when they were in use. The geographical and temporal coverage of the method thus follows from the availability of local geological reconstructions of shoreline displacement and the degree to which the settlements to be dated have been located on or close to the shoreline when they were in use. Methods for numerical treatment and visualisation of the dates are provided, along with basic tools for visualising and evaluating the location of sites.
Genome-wide association studies (GWAS) are widely used to investigate the genetic basis of diseases and traits, but they pose many computational challenges. The R package SNPRelate provides a binary format for single-nucleotide polymorphism (SNP) data in GWAS utilizing CoreArray Genomic Data Structure (GDS) data files. The GDS format offers the efficient operations specifically designed for integers with two bits, since a SNP could occupy only two bits. SNPRelate is also designed to accelerate two key computations on SNP data using parallel computing for multi-core symmetric multiprocessing computer architectures: Principal Component Analysis (PCA) and relatedness analysis using Identity-By-Descent measures. The SNP GDS format is also used by the GWASTools package with the support of S4 classes and generic functions. The extended GDS format is implemented in the SeqArray package to support the storage of single nucleotide variations (SNVs), insertion/deletion polymorphism (indel) and structural variation calls in whole-genome and whole-exome variant data.
We present a rank-based Mercer kernel to compute a pair-wise similarity metric corresponding to informative representation of data. We tailor the development of a kernel to encode our prior knowledge about the data distribution over a probability space. The philosophical concept behind our construction is that objects whose feature values fall on the extreme of that featureâ s probability mass distribution are more similar to each other, than objects whose feature values lie closer to the mean. Semblance emphasizes features whose values lie far away from the mean of their probability distribution. The kernel relies on properties empirically determined from the data and does not assume an underlying distribution. The use of feature ranks on a probability space ensures that Semblance is computational efficacious, robust to outliers, and statistically stable, thus making it widely applicable algorithm for pattern analysis. The output from the kernel is a square, symmetric matrix that gives proximity values between pairs of observations.
The goal of this package is to user-friendly realizing Gaussian graphical model-based heterogeneity analysis. Recently, several Gaussian graphical model-based heterogeneity analysis techniques have been developed. A common methodological limitation is that the number of subgroups is assumed to be known a priori, which is not realistic. In a very recent study (Ren et al., 2022), a novel approach based on the penalized fusion technique is developed to fully data-dependently determine the number and structure of subgroups in Gaussian graphical model-based heterogeneity analysis. It opens the door for utilizing the Gaussian graphical model technique in more practical settings. Beyond Ren et al. (2022), more estimations and functions are added, so that the package is self-contained and more comprehensive and can provide ``more direct insights to practitioners (with the visualization function). Reference: Ren, M., Zhang S., Zhang Q. and Ma S. (2022). Gaussian Graphical Model-based Heterogeneity Analysis via Penalized Fusion. Biometrics, 78 (2), 524-535.