This package provides a set of tools to perform multiple versions of the Mobility Oriented-Parity metric. This multivariate analysis helps to characterize levels of dissimilarity between a set of conditions of reference and another set of conditions of interest. If predictive models are transferred to conditions different from those over which models were calibrated (trained), this metric helps to identify transfer conditions that differ substantially from those of calibration. These tools are implemented following principles proposed in Owens et al. (2013) <doi:10.1016/j.ecolmodel.2013.04.011>, and expanded to obtain more detailed results that aid in interpretation as in Cobos et al. (2024) <doi:10.21425/fob.17.132916>.
The main goal is to make descriptive evaluations easier to create bigger and more complex outputs in less time with less code. Introducing format containers with multilabels <https://documentation.sas.com/doc/en/pgmsascdc/v_067/proc/p06ciqes4eaqo6n0zyqtz9p21nfb.htm>, a more powerful summarise which is capable to output every possible combination of the provided grouping variables in one go <https://documentation.sas.com/doc/en/pgmsascdc/v_067/proc/p0jvbbqkt0gs2cn1lo4zndbqs1pe.htm>, tabulation functions which can create any table in different styles <https://documentation.sas.com/doc/en/pgmsascdc/v_067/proc/n1ql5xnu0k3kdtn11gwa5hc7u435.htm> and other more readable functions. The code is optimized to work fast even with datasets of over a million observations.
Using frequency matrices, very low frequency variants (VLFs) are assessed for amino acid and nucleotide sequences. The VLFs are then compared to see if they occur in only one member of a species, singleton VLFs, or if they occur in multiple members of a species, shared VLFs. The amino acid and nucleotide VLFs are then compared to see if they are concordant with one another. Amino acid VLFs are also assessed to determine if they lead to a change in amino acid residue type, and potential changes to protein structures. Based on Stoeckle and Kerr (2012) <doi:10.1371/journal.pone.0043992> and Phillips et al. (2023) <doi:10.3897/BDJ.11.e96480>.
This package provides a collection of functions for analyzing data typically collected or used by behavioral scientists. Examples of the functions include a function that compares groups in a factorial experimental design, a function that conducts two-way analysis of variance (ANOVA), and a function that cleans a data set generated by Qualtrics surveys. Some of the functions will require installing additional package(s). Such packages and other references are cited within the section describing the relevant functions. Many functions in this package rely heavily on these two popular R packages: Dowle et al. (2021) <https://CRAN.R-project.org/package=data.table>. Wickham et al. (2021) <https://CRAN.R-project.org/package=ggplot2>.
This package provides statistical procedures for linear regression in the general context where the errors are assumed to be correlated. Different ways to estimate the asymptotic covariance matrix of the least squares estimators are available. Starting from this estimation of the covariance matrix, the confidence intervals and the usual tests on the parameters are modified. The functions of this package are very similar to those of lm': it contains methods such as summary(), plot(), confint() and predict(). The slm package is described in the paper by E. Caron, J. Dedecker and B. Michel (2019), "Linear regression with stationary errors: the R package slm", arXiv preprint <arXiv:1906.06583>.
Electronic health records (EHR) linked with biorepositories are a powerful platform for translational studies. A major bottleneck exists in the ability to phenotype patients accurately and efficiently. Towards that end, we developed an automated high-throughput phenotyping method integrating International Classification of Diseases (ICD) codes and narrative data extracted using natural language processing (NLP). Specifically, our proposed method, called MAP (Map Automated Phenotyping algorithm), fits an ensemble of latent mixture models on aggregated ICD and NLP counts along with healthcare utilization. The MAP algorithm yields a predicted probability of phenotype for each patient and a threshold for classifying subjects with phenotype yes/no (See Katherine P. Liao, et al. (2019) <doi:10.1093/jamia/ocz066>.).
This package provides a collection of self-labeled techniques for semi-supervised classification. In semi-supervised classification, both labeled and unlabeled data are used to train a classifier. This learning paradigm has obtained promising results, specifically in the presence of a reduced set of labeled examples. This package implements a collection of self-labeled techniques to construct a classification model. This family of techniques enlarges the original labeled set using the most confident predictions to classify unlabeled data. The techniques implemented can be applied to classification problems in several domains by the specification of a supervised base classifier. At low ratios of labeled data, it can be shown to perform better than classical supervised classifiers.
K Quantiles Medoids (KQM) clustering applies quantiles to divide data of each dimension into K mean intervals. Combining quantiles of all the dimensions of the data and fully permuting quantiles on each dimension is the strategy to determine a pool of candidate initial cluster centers. To find the best initial cluster centers from the pool of candidate initial cluster centers, two methods based on quantile strategy and PAM strategy respectively are proposed. During a clustering process, medoids of clusters are used to update cluster centers in each iteration. Comparison between KQM and the method of randomly selecting initial cluster centers shows that KQM is almost always getting clustering results with smaller total sum squares of distances.
This package provides functions for optimal policy learning in socioeconomic applications helping users to learn the most effective policies based on data in order to maximize empirical welfare. Specifically, OPL allows to find "treatment assignment rules" that maximize the overall welfare, defined as the sum of the policy effects estimated over all the policy beneficiaries. Documentation about OPL is provided by several international articles via Athey et al (2021, <doi:10.3982/ECTA15732>), Kitagawa et al (2018, <doi:10.3982/ECTA13288>), Cerulli (2022, <doi:10.1080/13504851.2022.2032577>), the paper by Cerulli (2021, <doi:10.1080/13504851.2020.1820939>) and the book by Gareth et al (2013, <doi:10.1007/978-1-4614-7138-7>).
The package implements a method for normalising microarray intensities, and works for single- and multiple-color arrays. It can also be used for data from other technologies, as long as they have similar format. The method uses a robust variant of the maximum-likelihood estimator for an additive-multiplicative error model and affine calibration. The model incorporates data calibration step (a.k.a. normalization), a model for the dependence of the variance on the mean intensity and a variance stabilizing data transformation. Differences between transformed intensities are analogous to "normalized log-ratios". However, in contrast to the latter, their variance is independent of the mean, and they are usually more sensitive and specific in detecting differential transcription.
Regression methods to quantify the relation between two measurement methods are provided by this package. In particular it addresses regression problems with errors in both variables and without repeated measurements. It implements the CLSI recommendations (see J. A. Budd et al. (2018, https://clsi.org/standards/products/method-evaluation/documents/ep09/) for analytical method comparison and bias estimation using patient samples. Furthermore, algorithms for Theil-Sen and equivariant Passing-Bablok estimators are implemented, see F. Dufey (2020, <doi:10.1515/ijb-2019-0157>) and J. Raymaekers and F. Dufey (2022, <arXiv:2202:08060>). A comprehensive overview over the implemented methods and references can be found in the manual pages mcr-package and mcreg.
This package provides API access to data from the U.S. Energy Information Administration ('EIA') <https://www.eia.gov/>. Use of the EIA's API and this package requires a free API key obtainable at <https://www.eia.gov/opendata/register.php>. This package includes functions for searching the EIA data directory and returning time series and geoset time series datasets. Datasets returned by these functions are provided by default in a tidy format, or alternatively, in more raw formats. It also offers helper functions for working with EIA date strings and time formats and for inspecting different summaries of series metadata. The package also provides control over API key storage and caching of API request results.
Offers the Generalized Berk-Jones (GBJ) test for set-based inference in genetic association studies. The GBJ is designed as an alternative to tests such as Berk-Jones (BJ), Higher Criticism (HC), Generalized Higher Criticism (GHC), Minimum p-value (minP), and Sequence Kernel Association Test (SKAT). All of these other methods (except for SKAT) are also implemented in this package, and we additionally provide an omnibus test (OMNI) which integrates information from each of the tests. The GBJ has been shown to outperform other tests in genetic association studies when signals are correlated and moderately sparse. Please see the vignette for a quickstart guide or Sun and Lin (2017) <arXiv:1710.02469> for more details.
The Integro-Difference Equation model is a linear, dynamical model used to model phenomena that evolve in space and in time; see, for example, Cressie and Wikle (2011, ISBN:978-0-471-69274-4) or Dewar et al. (2009) <doi:10.1109/TSP.2008.2005091>. At the heart of the model is the kernel, which dictates how the process evolves from one time point to the next. Both process and parameter reduction are used to facilitate computation, and spatially-varying kernels are allowed. Data used to estimate the parameters are assumed to be readings of the process corrupted by Gaussian measurement error. Parameters are fitted by maximum likelihood, and estimation is carried out using an evolution algorithm.
Given constraints for right censored data, we use a recursive computational algorithm to calculate the the "constrained" Kaplan-Meier estimator. The constraint is assumed given in linear estimating equations or mean functions. We also illustrate how this leads to the empirical likelihood ratio test with right censored data and accelerated failure time model with given coefficients. EM algorithm from emplik package is used to get the initial value. The properties and performance of the EM algorithm is discussed in Mai Zhou and Yifan Yang (2015)<doi: 10.1007/s00180-015-0567-9> and Mai Zhou and Yifan Yang (2017) <doi: 10.1002/wics.1400>. More applications could be found in Mai Zhou (2015) <doi: 10.1201/b18598>.
Time-Temperature Superposition analysis is often applied to frequency modulated data obtained by Dynamic Mechanic Analysis (DMA) and Rheometry in the analytical chemistry and physics areas. These techniques provide estimates of material mechanical properties (such as moduli) at different temperatures in a wider range of time. This package provides the Time-Temperature superposition Master Curve at a referred temperature by the three methods: the two wider used methods, Arrhenius based methods and WLF, and the newer methodology based on derivatives procedure. The Master Curve is smoothed by B-splines basis. The package output is composed of plots of experimental data, horizontal and vertical shifts, TTS data, and TTS data fitted using B-splines with bootstrap confidence intervals.
Process and analyze electronic health record (EHR) data. The EHR package provides modules to perform diverse medication-related studies using data from EHR databases. Especially, the package includes modules to perform pharmacokinetic/pharmacodynamic (PK/PD) analyses using EHRs, as outlined in Choi, Beck, McNeer, Weeks, Williams, James, Niu, Abou-Khalil, Birdwell, Roden, Stein, Bejan, Denny, and Van Driest (2020) <doi:10.1002/cpt.1787>. Additional modules will be added in future. In addition, this package provides various functions useful to perform Phenome Wide Association Study (PheWAS) to explore associations between drug exposure and phenotypes obtained from EHR data, as outlined in Choi, Carroll, Beck, Mosley, Roden, Denny, and Van Driest (2018) <doi:10.1093/bioinformatics/bty306>.
This package provides simple statistics from instruments and observations at sites in the NEON network, and acts as a simple interface for v0 of the National Ecological Observatory Network (NEON) API. Statistics are generated for meteorologic and soil-based observations, and are presented for daily, annual, and one-time observations at all available NEON sites. Users can also retrieve any dataset publicly hosted by NEON. Metadata for NEON sites and data products can be returned, as well as information on data product availability by site and date. For more information on NEON, please visit <https://www.neonscience.org>. For detailed data product information, please see the NEON data product catalog at <https://data.neonscience.org/data-product-catalog>.
This package provides a method to identify differential expression genes in the same or different species. Given that non-DE genes have some similarities in features, a scaling-free minimum enclosing ball (SFMEB) model is built to cover those non-DE genes in feature space, then those DE genes, which are enormously different from non-DE genes, being regarded as outliers and rejected outside the ball. The method on this package is described in the article A minimum enclosing ball method to detect differential expression genes for RNA-seq data'. The SFMEB method is extended to the scMEB method that considering two or more potential types of cells or unknown labels scRNA-seq dataset DEGs identification.
LEA is an R package dedicated to population genomics, landscape genomics and genotype-environment association tests. LEA can run analyses of population structure and genome-wide tests for local adaptation, and also performs imputation of missing genotypes. The package includes statistical methods for estimating ancestry coefficients from large genotypic matrices and for evaluating the number of ancestral populations (snmf). It performs statistical tests using latent factor mixed models for identifying genetic polymorphisms that exhibit association with environmental gradients or phenotypic traits (lfmm2). In addition, LEA computes values of genetic offset statistics based on new or predicted environments (genetic.gap, genetic.offset). LEA is mainly based on optimized programs that can scale with the dimensions of large data sets.
Regression splines that handle a mix of continuous and categorical (discrete) data often encountered in applied settings. I would like to gratefully acknowledge support from the Natural Sciences and Engineering Research Council of Canada (NSERC, <https://www.nserc-crsng.gc.ca>), the Social Sciences and Humanities Research Council of Canada (SSHRC, <https://www.sshrc-crsh.gc.ca>), and the Shared Hierarchical Academic Research Computing Network (SHARCNET, <https://www.sharcnet.ca>). We would also like to acknowledge the contributions of the GNU GSL authors. In particular, we adapt the GNU GSL B-spline routine gsl_bspline.c adding automated support for quantile knots (in addition to uniform knots), providing missing functionality for derivatives, and for extending the splines beyond their endpoints.
This package provides several methods for computing the Quantile Treatment Effect (QTE) and Quantile Treatment Effect on the Treated (QTT). The main cases covered are (i) Treatment is randomly assigned, (ii) Treatment is as good as randomly assigned after conditioning on some covariates (also called conditional independence or selection on observables) using the methods developed in Firpo (2007) <doi:10.1111/j.1468-0262.2007.00738.x>, (iii) Identification is based on a Difference in Differences assumption (several varieties are available in the package e.g. Athey and Imbens (2006) <doi:10.1111/j.1468-0262.2006.00668.x> Callaway and Li (2019) <doi:10.3982/QE935>, Callaway, Li, and Oka (2018) <doi:10.1016/j.jeconom.2018.06.008>).
This package provides interpretability methods to analyze the behavior and predictions of any machine learning model. Implemented methods are:
Feature importance described by Fisher et al. (2018),
accumulated local effects plots described by Apley (2018),
partial dependence plots described by Friedman (2001),
individual conditional expectation ('ice') plots described by Goldstein et al. (2013) https://doi.org/10.1080/10618600.2014.907095,
local models (variant of 'lime') described by Ribeiro et. al (2016),
the Shapley Value described by Strumbelj et. al (2014) https://doi.org/10.1007/s10115-013-0679-x,
feature interactions described by Friedman et. al https://doi.org/10.1214/07-AOAS148 and tree surrogate models.
The wavelet-based quantile mapping (WQM) technique is designed to correct biases in spatio-temporal precipitation forecasts across multiple time scales. The WQM method effectively enhances forecast accuracy by generating an ensemble of precipitation forecasts that account for uncertainties in the prediction process. For a comprehensive overview of the methodologies employed in this package, please refer to Jiang, Z., and Johnson, F. (2023) <doi:10.1029/2022EF003350>. The package relies on two packages for continuous wavelet transforms: WaveletComp', which can be installed automatically, and wmtsa', which is optional and available from the CRAN archive <https://cran.r-project.org/src/contrib/Archive/wmtsa/>. Users need to manually install wmtsa from this archive if they prefer to use wmtsa based decomposition.