It can be used to create/encode molecular "license-plates" from sequences and to also decode the "license-plates" back to sequences. While initially created for transfer RNA-derived small fragments (tRFs
), this tool can be used for any genomic sequences including but not limited to: tRFs
, microRNAs
, etc. The detailed information can reference to Pliatsika V, Loher P, Telonis AG, Rigoutsos I (2016) <doi:10.1093/bioinformatics/btw194>. It can also be used to annotate tRFs
. The detailed information can reference to Loher P, Telonis AG, Rigoutsos I (2017) <doi:10.1038/srep41184>.
An open source software package written in R statistical language. It consist in a set of decision making tools to conduct missing person searches. Particularly, it allows computing optimal LR threshold for declaring potential matches in DNA-based database search. More recently mispitools incorporates preliminary investigation data based LRs. Statistical weight of different traces of evidence such as biological sex, age and hair color are presented. For citing mispitools please use the following references: Marsico and Caridi, 2023 <doi:10.1016/j.fsigen.2023.102891> and Marsico, Vigeland et al. 2021 <doi:10.1016/j.fsigen.2021.102519>.
Implement and enhance the performance of spatial fuzzy clustering using Fuzzy Geographically Weighted Clustering with various optimization algorithms, mainly from Xin She Yang (2014) <ISBN:9780124167438> with book entitled Nature-Inspired Optimization Algorithms. The optimization algorithm is useful to tackle the disadvantages of clustering inconsistency when using the traditional approach. The distance measurements option is also provided in order to increase the quality of clustering results. The Fuzzy Geographically Weighted Clustering with nature inspired optimisation algorithm was firstly developed by Arie Wahyu Wijayanto and Ayu Purwarianti (2014) <doi:10.1109/CITSM.2014.7042178> using Artificial Bee Colony algorithm.
This package implements multi-study learning algorithms such as merging, the study-specific ensemble (trained-on-observed-studies ensemble) the study strap, the covariate-matched study strap, covariate-profile similarity weighting, and stacking weights. Embedded within the caret framework, this package allows for a wide range of single-study learners (e.g., neural networks, lasso, random forests). The package offers over 20 default similarity measures and allows for specification of custom similarity measures for covariate-profile similarity weighting and an accept/reject step. This implements methods described in Loewinger, Kishida, Patil, and Parmigiani. (2019) <doi:10.1101/856385>.
Reconstruct phylogenetic trees from discrete data. Inapplicable character states are handled using the algorithm of Brazeau, Guillerme and Smith (2019) <doi:10.1093/sysbio/syy083> with the "Morphy" library, under equal or implied step weights. Contains a "shiny" user interface for interactive tree search and exploration of results, including character visualization, rogue taxon detection, tree space mapping, and cluster consensus trees (Smith 2022a, b) <doi:10.1093/sysbio/syab099>, <doi:10.1093/sysbio/syab100>. Profile Parsimony (Faith and Trueman, 2001) <doi:10.1080/10635150118627>, Successive Approximations (Farris, 1969) <doi:10.2307/2412182> and custom optimality criteria are implemented.
This package implements functions to retrieve the nearest genes around the peak, annotate genomic region of the peak, statstical methods for estimate the significance of overlap among ChIP peak data sets, and incorporate GEO database for user to compare the own dataset with those deposited in database. The comparison can be used to infer cooperative regulation and thus can be used to generate hypotheses. Several visualization functions are implemented to summarize the coverage of the peak experiment, average profile and heatmap of peaks binding to TSS regions, genomic annotation, distance to TSS, and overlap of peaks or genes.
Utilities for working with hourly air quality monitoring data with a focus on small particulates (PM2.5). A compact data model is structured as a list with two dataframes. A meta dataframe contains spatial and measuring device metadata associated with deployments at known locations. A data dataframe contains a datetime column followed by columns of measurements associated with each "device-deployment". Algorithms to calculate NowCast
and the associated Air Quality Index (AQI) are defined at the US Environmental Projection Agency AirNow
program: <https://document.airnow.gov/technical-assistance-document-for-the-reporting-of-daily-air-quailty.pdf>.
Dual Wavelet based Nonlinear Autoregressive Distributed Lag model has been developed for noisy time series analysis. This package is designed to capture both short-run and long-run relationships in time series data, while incorporating wavelet transformations. The methodology combines the NARDL model with wavelet decomposition to better capture the nonlinear dynamics of the series and exogenous variables. The package is useful for analyzing economic and financial time series data that exhibit both long-term trends and short-term fluctuations. This package has been developed using algorithm of Jammazi et al. <doi:10.1016/j.intfin.2014.11.011>.
The kernel of this Rcpp based package is an efficient implementation of the generalized gradient projection method for spline function based constrained maximum likelihood estimator for interval censored survival data (Wu, Yuan; Zhang, Ying. Partially monotone tensor spline estimation of the joint distribution function with bivariate current status data. Ann. Statist. 40, 2012, 1609-1636 <doi:10.1214/12-AOS1016>). The key function computes the density function of the joint distribution of event time and the marker and returns the receiver operating characteristic (ROC) curve for the interval censored survival data as well as area under the curve (AUC).
The developed function is designed to facilitate the seamless conversion of KML (Keyhole Markup Language) files to Shapefiles while preserving attribute values. It provides a straightforward interface for users to effortlessly import KML data, extract relevant attributes, and export them into the widely compatible Shapefile format. The package ensures accurate representation of spatial data while maintaining the integrity of associated attribute information. For details see, Flores, G. (2021). <DOI:10.1007/978-3-030-63665-4_15>. Whether for spatial analysis, visualization, or data interoperability, it simplifies the conversion process and empowers users to seamlessly work with geospatial datasets.
Convenient wrapper functions for the analysis of matrix-assisted laser desorption/ionization-time-of-flight (MALDI-TOF) spectra data in order to select only representative spectra (also called cherry-pick). The package covers the preprocessing and dereplication steps (based on Strejcek, Smrhova, Junkova and Uhlik (2018) <doi:10.3389/fmicb.2018.01294>) needed to cluster MALDI-TOF spectra before the final cherry-picking step. It enables the easy exclusion of spectra and/or clusters to accommodate complex cherry-picking strategies. Alternatively, cherry-picking using taxonomic identification MALDI-TOF data is made easy with functions to import inconsistently formatted reports.
The goal of snpsettest is to provide simple tools that perform set-based association tests (e.g., gene-based association tests) using GWAS (genome-wide association study) summary statistics. A set-based association test in this package is based on the statistical model described in VEGAS (versatile gene-based association study), which combines the effects of a set of SNPs accounting for linkage disequilibrium between markers. This package uses a different approach from the original VEGAS implementation to compute set-level p values more efficiently, as described in <https://github.com/HimesGroup/snpsettest/wiki/Statistical-test-in-snpsettest>
.
CellTrails
is an unsupervised algorithm for the de novo chronological ordering, visualization and analysis of single-cell expression data. CellTrails
makes use of a geometrically motivated concept of lower-dimensional manifold learning, which exhibits a multitude of virtues that counteract intrinsic noise of single cell data caused by drop-outs, technical variance, and redundancy of predictive variables. CellTrails
enables the reconstruction of branching trajectories and provides an intuitive graphical representation of expression patterns along all branches simultaneously. It allows the user to define and infer the expression dynamics of individual and multiple pathways towards distinct phenotypes.
This package provides tools for the calculation of common biodiversity indices from count data. Additionally, it incorporates bootstrapping techniques to generate multiple samples, facilitating the estimation of confidence intervals around these indices. Furthermore, the package allows for the exploration of how variation in these indices changes with differing numbers of sites, making it a useful tool with which to begin an ecological analysis. Methods are based on the following references: Chao et al. (2014) <doi:10.1890/13-0133.1>, Chao and Colwell (2022) <doi:10.1002/9781119902911.ch2>, Hsieh, Ma,` and Chao (2016) <doi:10.1111/2041-210X.12613>.
This package provides a collection of functions for outlier detection in functional data analysis. Methods implemented include directional outlyingness by Dai and Genton (2019) <doi:10.1016/j.csda.2018.03.017>, MS-plot by Dai and Genton (2018) <doi:10.1080/10618600.2018.1473781>, total variation depth and modified shape similarity index by Huang and Sun (2019) <doi:10.1080/00401706.2019.1574241>, and sequential transformations by Dai et al. (2020) <doi:10.1016/j.csda.2020.106960 among others. Additional outlier detection tools and depths for functional data like functional boxplot, (modified) band depth etc., are also available.
Two distinct but related statistical approaches to the problem of identifying the combinations of medication error characteristics that are more likely to result in harm are implemented in this package: 1) a Bayesian hierarchical model with optimal Bayesian ranking on the log odds of harm, and 2) an empirical Bayes model that estimates the ratio of the observed count of harm to the count that would be expected if error characteristics and harm were independent. In addition, for the Bayesian hierarchical model, the package provides functions to assess the sensitivity of results to different specifications of the random effects distributions.
General implementation of core function from phase-type theory. PhaseTypeR
can be used to model continuous and discrete phase-type distributions, both univariate and multivariate. The package includes functions for outputting the mean and (co)variance of phase-type distributions; their density, probability and quantile functions; functions for random draws; functions for reward-transformation; and functions for plotting the distributions as networks. For more information on these functions please refer to Bladt and Nielsen (2017, ISBN: 978-1-4939-8377-3) and Campillo Navarro (2019) <https://orbit.dtu.dk/en/publications/order-statistics-and-multivariate-discrete-phase-type-distributio>.
This package provides a spatial population can be generated based on spatially varying regression model under the assumption that observations are collected from a uniform two-dimensional grid consist of (m * m) lattice points with unit distance between any two neighbouring points. For method details see Chao, Liu., Chuanhua, Wei. and Yunan, Su. (2018).<DOI:10.1080/10485252.2018.1499907>. This spatially generated data can be used to test different issues related to the statistical analysis of spatial data. This generated spatial data can be utilized in geographically weighted regression analysis for studying the spatially varying relationships among the variables.
Social risks are increasingly becoming a critical component of health care research. One of the most common ways to identify social needs is by using ICD-10-CM "Z-codes." This package identifies social risks using varying taxonomies of ICD-10-CM Z-codes from administrative health care data. The conceptual taxonomies come from: Centers for Medicare and Medicaid Services (2021) <https://www.cms.gov/files/document/zcodes-infographic.pdf>, Reidhead (2018) <https://web.mhanet.com/>, A Arons, S DeSilvey
, C Fichtenberg, L Gottlieb (2018) <https://sirenetwork.ucsf.edu/tools-resources/resources/compendium-medical-terminology-codes-social-risk-factors>.
The CNVMetrics package calculates similarity metrics to facilitate copy number variant comparison among samples and/or methods. Similarity metrics can be employed to compare CNV profiles of genetically unrelated samples as well as those with a common genetic background. Some metrics are based on the shared amplified/deleted regions while other metrics rely on the level of amplification/deletion. The data type used as input is a plain text file containing the genomic position of the copy number variations, as well as the status and/or the log2 ratio values. Finally, a visualization tool is provided to explore resulting metrics.
Peptide Set Test (PepSetTest
) is a peptide-centric strategy to infer differentially expressed proteins in LC-MS/MS proteomics data. This test detects coordinated changes in the expression of peptides originating from the same protein and compares these changes against the rest of the peptidome. Compared to traditional aggregation-based approaches, the peptide set test demonstrates improved statistical power, yet controlling the Type I error rate correctly in most cases. This test can be valuable for discovering novel biomarkers and prioritizing drug targets, especially when the direct application of statistical analysis to protein data fails to provide substantial insights.
Prognostic Enrichment is a strategy of enriching a clinical trial for testing an intervention intended to prevent or delay an unwanted clinical event. A prognostically enriched trial enrolls only patients who are more likely to experience the unwanted clinical event than the broader patient population (R. Temple (2010) <doi:10.1038/clpt.2010.233>). By testing the intervention in an enriched study population, the trial may be adequately powered with a smaller sample size, which can have both practical and ethical advantages. This package provides tools to evaluate biomarkers for prognostic enrichment of clinical trials with survival/time-to-event outcomes.
The function takes a DNA sequence, a start point, an end point in the sequence, dot size and dot color and draws a fractal image of the sequence. The fractal starts in the center of the canvas. The image is drawn by moving base by base along the sequence and dropping a midpoint between the actual point and the corner designated by the actual base. For more details see Jeffrey (1990) <doi:10.1093/nar/18.8.2163>, Hill, Schisler, and Singh (1992) <doi:10.1007/BF00178602>, and Löchel and Heider (2021) <doi:10.1016/j.csbj.2021.11.008>.
We provide three distance metrics for measuring the separation between two clusters in high-dimensional spaces. The first metric is the centroid distance, which calculates the Euclidean distance between the centers of the two groups. The second is a ridge Mahalanobis distance, which incorporates a ridge correction constant, alpha, to ensure that the covariance matrix is invertible. The third metric is the maximal data piling distance, which computes the orthogonal distance between the affine spaces spanned by each class. These three distances are asymptotically interconnected and are applicable in tasks such as discrimination, clustering, and outlier detection in high-dimensional settings.