This package provides an R interface for SSW (Striped Smith-Waterman) via its Python binding ssw-py'. SSW is a fast C and C++ implementation of the Smith-Waterman algorithm for pairwise sequence alignment using Single-Instruction-Multiple-Data (SIMD) instructions. SSW enhances the standard algorithm by efficiently returning alignment information and suboptimal alignment scores. The core SSW library offers performance improvements for various bioinformatics tasks, including protein database searches, short-read alignments, primary and split-read mapping, structural variant detection, and read-overlap graph generation. These features make SSW particularly useful for genomic applications. Zhao et al. (2013) <doi:10.1371/journal.pone.0082138> developed the original C and C++ implementation.
This is a new version of the userfriendlyscience package, which has grown a bit unwieldy. Therefore, distinct functionalities are being consciously uncoupled into different packages. This package contains the general-purpose tools and utilities (see the behaviorchange package, the rosetta package, and the soon-to-be-released scd package for other functionality), and is the most direct successor of the original userfriendlyscience package. For example, this package contains a number of basic functions to create higher level plots, such as diamond plots, to easily plot sampling distributions, to generate confidence intervals, to plan study sample sizes for confidence intervals, and to do some basic operations such as (dis)attenuate effect size estimates.
Density, distribution, quantile function, random number generation for the BMT (Bezier-Montenegro-Torres) distribution. Torres-Jimenez C.J. and Montenegro-Diaz A.M. (2017) <doi:10.48550/arXiv.1709.05534>. Moments, descriptive measures and parameter conversion for different parameterizations of the BMT distribution. Fit of the BMT distribution to non-censored data by maximum likelihood, moment matching, quantile matching, maximum goodness-of-fit, also known as minimum distance, maximum product of spacing, also called maximum spacing, and minimum quantile distance, which can also be called maximum quantile goodness-of-fit. Fit of univariate distributions for non-censored data using maximum product of spacing estimation and minimum quantile distance estimation is also included.
This package provides tools for working with a new versatile discrete distribution, the db ("discretised Beta") distribution. This package provides density (probability), distribution, inverse distribution (quantile) and random data generation functions for the db family. It provides functions to effect conveniently maximum likelihood estimation of parameters, and a variety of useful plotting functions. It provides goodness of fit tests and functions to calculate the Fisher information, different estimates of the hessian of the log likelihood and Monte Carlo estimation of the covariance matrix of the maximum likelihood parameter estimates. In addition it provides analogous tools for working with the beta-binomial distribution which has been proposed as a competitor to the db distribution.
This package provides a set of tools to perform multiple versions of the Mobility Oriented-Parity metric. This multivariate analysis helps to characterize levels of dissimilarity between a set of conditions of reference and another set of conditions of interest. If predictive models are transferred to conditions different from those over which models were calibrated (trained), this metric helps to identify transfer conditions that differ substantially from those of calibration. These tools are implemented following principles proposed in Owens et al. (2013) <doi:10.1016/j.ecolmodel.2013.04.011>, and expanded to obtain more detailed results that aid in interpretation as in Cobos et al. (2024) <doi:10.21425/fob.17.132916>.
The main goal is to make descriptive evaluations easier to create bigger and more complex outputs in less time with less code. Introducing format containers with multilabels <https://documentation.sas.com/doc/en/pgmsascdc/v_067/proc/p06ciqes4eaqo6n0zyqtz9p21nfb.htm>, a more powerful summarise which is capable to output every possible combination of the provided grouping variables in one go <https://documentation.sas.com/doc/en/pgmsascdc/v_067/proc/p0jvbbqkt0gs2cn1lo4zndbqs1pe.htm>, tabulation functions which can create any table in different styles <https://documentation.sas.com/doc/en/pgmsascdc/v_067/proc/n1ql5xnu0k3kdtn11gwa5hc7u435.htm> and other more readable functions. The code is optimized to work fast even with datasets of over a million observations.
Using frequency matrices, very low frequency variants (VLFs) are assessed for amino acid and nucleotide sequences. The VLFs are then compared to see if they occur in only one member of a species, singleton VLFs, or if they occur in multiple members of a species, shared VLFs. The amino acid and nucleotide VLFs are then compared to see if they are concordant with one another. Amino acid VLFs are also assessed to determine if they lead to a change in amino acid residue type, and potential changes to protein structures. Based on Stoeckle and Kerr (2012) <doi:10.1371/journal.pone.0043992> and Phillips et al. (2023) <doi:10.3897/BDJ.11.e96480>.
This package provides a general framework for constructing variable importance plots from various types of machine learning models in R. Aside from some standard model- specific variable importance measures, this package also provides model- agnostic approaches that can be applied to any supervised learning algorithm. These include 1) an efficient permutation-based variable importance measure, 2) variable importance based on Shapley values (Strumbelj and Kononenko, 2014) <doi:10.1007/s10115-013-0679-x>, and 3) the variance-based approach described in Greenwell et al. (2018) <doi:10.48550/arXiv.1805.04755>. A variance-based method for quantifying the relative strength of interaction effects is also included (see the previous reference for details).
This package provides a collection of functions for analyzing data typically collected or used by behavioral scientists. Examples of the functions include a function that compares groups in a factorial experimental design, a function that conducts two-way analysis of variance (ANOVA), and a function that cleans a data set generated by Qualtrics surveys. Some of the functions will require installing additional package(s). Such packages and other references are cited within the section describing the relevant functions. Many functions in this package rely heavily on these two popular R packages: Dowle et al. (2021) <https://CRAN.R-project.org/package=data.table>. Wickham et al. (2021) <https://CRAN.R-project.org/package=ggplot2>.
This package provides statistical procedures for linear regression in the general context where the errors are assumed to be correlated. Different ways to estimate the asymptotic covariance matrix of the least squares estimators are available. Starting from this estimation of the covariance matrix, the confidence intervals and the usual tests on the parameters are modified. The functions of this package are very similar to those of lm': it contains methods such as summary(), plot(), confint() and predict(). The slm package is described in the paper by E. Caron, J. Dedecker and B. Michel (2019), "Linear regression with stationary errors: the R package slm", arXiv preprint <arXiv:1906.06583>.
Electronic health records (EHR) linked with biorepositories are a powerful platform for translational studies. A major bottleneck exists in the ability to phenotype patients accurately and efficiently. Towards that end, we developed an automated high-throughput phenotyping method integrating International Classification of Diseases (ICD) codes and narrative data extracted using natural language processing (NLP). Specifically, our proposed method, called MAP (Map Automated Phenotyping algorithm), fits an ensemble of latent mixture models on aggregated ICD and NLP counts along with healthcare utilization. The MAP algorithm yields a predicted probability of phenotype for each patient and a threshold for classifying subjects with phenotype yes/no (See Katherine P. Liao, et al. (2019) <doi:10.1093/jamia/ocz066>.).
This package provides a collection of self-labeled techniques for semi-supervised classification. In semi-supervised classification, both labeled and unlabeled data are used to train a classifier. This learning paradigm has obtained promising results, specifically in the presence of a reduced set of labeled examples. This package implements a collection of self-labeled techniques to construct a classification model. This family of techniques enlarges the original labeled set using the most confident predictions to classify unlabeled data. The techniques implemented can be applied to classification problems in several domains by the specification of a supervised base classifier. At low ratios of labeled data, it can be shown to perform better than classical supervised classifiers.
The package implements a method for normalising microarray intensities, and works for single- and multiple-color arrays. It can also be used for data from other technologies, as long as they have similar format. The method uses a robust variant of the maximum-likelihood estimator for an additive-multiplicative error model and affine calibration. The model incorporates data calibration step (a.k.a. normalization), a model for the dependence of the variance on the mean intensity and a variance stabilizing data transformation. Differences between transformed intensities are analogous to "normalized log-ratios". However, in contrast to the latter, their variance is independent of the mean, and they are usually more sensitive and specific in detecting differential transcription.
Regression methods to quantify the relation between two measurement methods are provided by this package. In particular it addresses regression problems with errors in both variables and without repeated measurements. It implements the CLSI recommendations (see J. A. Budd et al. (2018, https://clsi.org/standards/products/method-evaluation/documents/ep09/) for analytical method comparison and bias estimation using patient samples. Furthermore, algorithms for Theil-Sen and equivariant Passing-Bablok estimators are implemented, see F. Dufey (2020, <doi:10.1515/ijb-2019-0157>) and J. Raymaekers and F. Dufey (2022, <arXiv:2202:08060>). A comprehensive overview over the implemented methods and references can be found in the manual pages mcr-package and mcreg.
K Quantiles Medoids (KQM) clustering applies quantiles to divide data of each dimension into K mean intervals. Combining quantiles of all the dimensions of the data and fully permuting quantiles on each dimension is the strategy to determine a pool of candidate initial cluster centers. To find the best initial cluster centers from the pool of candidate initial cluster centers, two methods based on quantile strategy and PAM strategy respectively are proposed. During a clustering process, medoids of clusters are used to update cluster centers in each iteration. Comparison between KQM and the method of randomly selecting initial cluster centers shows that KQM is almost always getting clustering results with smaller total sum squares of distances.
This package provides functions for optimal policy learning in socioeconomic applications helping users to learn the most effective policies based on data in order to maximize empirical welfare. Specifically, OPL allows to find "treatment assignment rules" that maximize the overall welfare, defined as the sum of the policy effects estimated over all the policy beneficiaries. Documentation about OPL is provided by several international articles via Athey et al (2021, <doi:10.3982/ECTA15732>), Kitagawa et al (2018, <doi:10.3982/ECTA13288>), Cerulli (2022, <doi:10.1080/13504851.2022.2032577>), the paper by Cerulli (2021, <doi:10.1080/13504851.2020.1820939>) and the book by Gareth et al (2013, <doi:10.1007/978-1-4614-7138-7>).
This package provides API access to data from the U.S. Energy Information Administration ('EIA') <https://www.eia.gov/>. Use of the EIA's API and this package requires a free API key obtainable at <https://www.eia.gov/opendata/register.php>. This package includes functions for searching the EIA data directory and returning time series and geoset time series datasets. Datasets returned by these functions are provided by default in a tidy format, or alternatively, in more raw formats. It also offers helper functions for working with EIA date strings and time formats and for inspecting different summaries of series metadata. The package also provides control over API key storage and caching of API request results.
Offers the Generalized Berk-Jones (GBJ) test for set-based inference in genetic association studies. The GBJ is designed as an alternative to tests such as Berk-Jones (BJ), Higher Criticism (HC), Generalized Higher Criticism (GHC), Minimum p-value (minP), and Sequence Kernel Association Test (SKAT). All of these other methods (except for SKAT) are also implemented in this package, and we additionally provide an omnibus test (OMNI) which integrates information from each of the tests. The GBJ has been shown to outperform other tests in genetic association studies when signals are correlated and moderately sparse. Please see the vignette for a quickstart guide or Sun and Lin (2017) <arXiv:1710.02469> for more details.
The Integro-Difference Equation model is a linear, dynamical model used to model phenomena that evolve in space and in time; see, for example, Cressie and Wikle (2011, ISBN:978-0-471-69274-4) or Dewar et al. (2009) <doi:10.1109/TSP.2008.2005091>. At the heart of the model is the kernel, which dictates how the process evolves from one time point to the next. Both process and parameter reduction are used to facilitate computation, and spatially-varying kernels are allowed. Data used to estimate the parameters are assumed to be readings of the process corrupted by Gaussian measurement error. Parameters are fitted by maximum likelihood, and estimation is carried out using an evolution algorithm.
Given constraints for right censored data, we use a recursive computational algorithm to calculate the the "constrained" Kaplan-Meier estimator. The constraint is assumed given in linear estimating equations or mean functions. We also illustrate how this leads to the empirical likelihood ratio test with right censored data and accelerated failure time model with given coefficients. EM algorithm from emplik package is used to get the initial value. The properties and performance of the EM algorithm is discussed in Mai Zhou and Yifan Yang (2015)<doi: 10.1007/s00180-015-0567-9> and Mai Zhou and Yifan Yang (2017) <doi: 10.1002/wics.1400>. More applications could be found in Mai Zhou (2015) <doi: 10.1201/b18598>.
Time-Temperature Superposition analysis is often applied to frequency modulated data obtained by Dynamic Mechanic Analysis (DMA) and Rheometry in the analytical chemistry and physics areas. These techniques provide estimates of material mechanical properties (such as moduli) at different temperatures in a wider range of time. This package provides the Time-Temperature superposition Master Curve at a referred temperature by the three methods: the two wider used methods, Arrhenius based methods and WLF, and the newer methodology based on derivatives procedure. The Master Curve is smoothed by B-splines basis. The package output is composed of plots of experimental data, horizontal and vertical shifts, TTS data, and TTS data fitted using B-splines with bootstrap confidence intervals.
This package provides a method to identify differential expression genes in the same or different species. Given that non-DE genes have some similarities in features, a scaling-free minimum enclosing ball (SFMEB) model is built to cover those non-DE genes in feature space, then those DE genes, which are enormously different from non-DE genes, being regarded as outliers and rejected outside the ball. The method on this package is described in the article A minimum enclosing ball method to detect differential expression genes for RNA-seq data'. The SFMEB method is extended to the scMEB method that considering two or more potential types of cells or unknown labels scRNA-seq dataset DEGs identification.
Process and analyze electronic health record (EHR) data. The EHR package provides modules to perform diverse medication-related studies using data from EHR databases. Especially, the package includes modules to perform pharmacokinetic/pharmacodynamic (PK/PD) analyses using EHRs, as outlined in Choi, Beck, McNeer, Weeks, Williams, James, Niu, Abou-Khalil, Birdwell, Roden, Stein, Bejan, Denny, and Van Driest (2020) <doi:10.1002/cpt.1787>. Additional modules will be added in future. In addition, this package provides various functions useful to perform Phenome Wide Association Study (PheWAS) to explore associations between drug exposure and phenotypes obtained from EHR data, as outlined in Choi, Carroll, Beck, Mosley, Roden, Denny, and Van Driest (2018) <doi:10.1093/bioinformatics/bty306>.
This package provides simple statistics from instruments and observations at sites in the NEON network, and acts as a simple interface for v0 of the National Ecological Observatory Network (NEON) API. Statistics are generated for meteorologic and soil-based observations, and are presented for daily, annual, and one-time observations at all available NEON sites. Users can also retrieve any dataset publicly hosted by NEON. Metadata for NEON sites and data products can be returned, as well as information on data product availability by site and date. For more information on NEON, please visit <https://www.neonscience.org>. For detailed data product information, please see the NEON data product catalog at <https://data.neonscience.org/data-product-catalog>.