This package provides functions to prepare time priors for MCMCtree analyses in the PAML software from Yang (2007)<doi:10.1093/molbev/msm088> and plot time-scaled phylogenies from any Bayesian divergence time analysis. Most time-calibrated node prior distributions require user-specified parameters. The package provides functions to refine these parameters, so that the resulting prior distributions accurately reflect confidence in known, usually fossil, time information. These functions also enable users to visualise distributions and write MCMCtree ready input files. Additionally, the package supplies flexible functions to visualise age uncertainty on a plotted tree with using node bars, using branch widths proportional to the age uncertainty, or by plotting the full posterior distributions on nodes. Time-scaled phylogenetic plots can be visualised with absolute and geological timescales . All plotting functions are applicable with output from any Bayesian software, not just MCMCtree'.
GBScleanR is a package for quality check, filtering, and error correction of genotype data derived from next generation sequcener (NGS) based genotyping platforms. GBScleanR takes Variant Call Format (VCF) file as input. The main function of this package is `estGeno()` which estimates the true genotypes of samples from given read counts for genotype markers using a hidden Markov model with incorporating uneven observation ratio of allelic reads. This implementation gives robust genotype estimation even in noisy genotype data usually observed in Genotyping-By-Sequnencing (GBS) and similar methods, e.g. RADseq. The current implementation accepts genotype data of a diploid population at any generation of multi-parental cross, e.g. biparental F2 from inbred parents, biparental F2 from outbred parents, and 8-way recombinant inbred lines (8-way RILs) which can be refered to as MAGIC population.
MethylMix is an algorithm implemented to identify hyper and hypomethylated genes for a disease. MethylMix is based on a beta mixture model to identify methylation states and compares them with the normal DNA methylation state. MethylMix uses a novel statistic, the Differential Methylation value or DM-value defined as the difference of a methylation state with the normal methylation state. Finally, matched gene expression data is used to identify, besides differential, functional methylation states by focusing on methylation changes that effect gene expression. References: Gevaert 0. MethylMix: an R package for identifying DNA methylation-driven genes. Bioinformatics (Oxford, England). 2015;31(11):1839-41. doi:10.1093/bioinformatics/btv020. Gevaert O, Tibshirani R, Plevritis SK. Pancancer analysis of DNA methylation-driven genes using MethylMix. Genome Biology. 2015;16(1):17. doi:10.1186/s13059-014-0579-8.
This package provides a set of functions for applying a restricted linear algebra to the analysis of count-based data. See the accompanying preprint manuscript: "Normalizing need not be the norm: count-based math for analyzing single-cell data" Church et al (2022) <doi:10.1101/2022.06.01.494334> This tool is specifically designed to analyze count matrices from single cell RNA sequencing assays. The tools implement several count-based approaches for standard steps in single-cell RNA-seq analysis, including scoring genes and cells, comparing cells and clustering, calculating differential gene expression, and several methods for rank reduction. There are many opportunities for further optimization that may prove useful in the analysis of other data. We provide the source code freely available at <https://github.com/shchurch/countland> and encourage users and developers to fork the code for their own purposes.
Implement some models for correlation/covariance matrices including two approaches to model correlation matrices from a graphical structure. One use latent parent variables as proposed in Sterrantino et. al. (2024) <doi:10.48550/arXiv.2312.06289>. The other uses a graph to specify conditional relations between the variables. The graphical structure makes correlation matrices interpretable and avoids the quadratic increase of parameters as a function of the dimension. In the first approach a natural sequence of simpler models along with a complexity penalization is used. The second penalizes deviations from a base model. These can be used as prior for model parameters, considering C code through the cgeneric interface for the INLA package (<https://www.r-inla.org>). This allows one to use these models as building blocks combined and to other latent Gaussian models in order to build complex data models.
Simplified odds ratio calculation of GAM(M)s & GLM(M)s. Provides structured output (data frame) of all predictors and their corresponding odds ratios and confident intervals for further analyses. It helps to avoid false references of predictors and increments by specifying these parameters in a list instead of using exp(coef(model)) (standard approach of odds ratio calculation for GLMs) which just returns a plain numeric output. For GAM(M)s, odds ratio calculation is highly simplified with this package since it takes care of the multiple predict() calls of the chosen predictor while holding other predictors constant. Also, this package allows odds ratio calculation of percentage steps across the whole predictor distribution range for GAM(M)s. In both cases, confident intervals are returned additionally. Calculated odds ratio of GAM(M)s can be inserted into the smooth function plot.
Decision support tool for prioritizing sites for ecological surveys based on their potential to improve plans for conserving biodiversity (e.g. plans for establishing protected areas). Given a set of sites that could potentially be acquired for conservation management, it can be used to generate and evaluate plans for surveying additional sites. Specifically, plans for ecological surveys can be generated using various conventional approaches (e.g. maximizing expected species richness, geographic coverage, diversity of sampled environmental algorithms. After generating such survey plans, they can be evaluated using conditions) and maximizing value of information. Please note that several functions depend on the Gurobi optimization software (available from <https://www.gurobi.com>). Additionally, the JAGS software (available from <https://mcmc-jags.sourceforge.io/>) is required to fit hierarchical generalized linear models. For further details, see Hanson et al. (2023) <doi:10.1111/1365-2664.14309>.
The anomalize package enables a "tidy" workflow for detecting anomalies in data. The main functions are time_decompose(), anomalize(), and time_recompose(). When combined, it's quite simple to decompose time series, detect anomalies, and create bands separating the "normal" data from the anomalous data at scale (i.e. for multiple time series). Time series decomposition is used to remove trend and seasonal components via the time_decompose() function and methods include seasonal decomposition of time series by Loess ("stl") and seasonal decomposition by piecewise medians ("twitter"). The anomalize() function implements two methods for anomaly detection of residuals including using an inner quartile range ("iqr") and generalized extreme studentized deviation ("gesd"). These methods are based on those used in the forecast package and the Twitter AnomalyDetection package. Refer to the associated functions for specific references for these methods.
Extract data from Birdscan MR1 SQL vertical-looking radar databases, filter, and process them to Migration Traffic Rates (#objects per hour and km) or density (#objects per km3) of, for example birds, and insects. Object classifications in the Birdscan MR1 databases are based on the dataset of Haest et al. (2021) <doi:10.5281/zenodo.5734960>). Migration Traffic Rates and densities can be calculated separately for different height bins (with a height resolution of choice) as well as over time periods of choice (e.g., 1/2 hour, 1 hour, 1 day, day/night, the full time period of observation, and anything in between). Two plotting functions are also included to explore the data in the SQL databases and the resulting Migration Traffic Rate results. For details on the Migration Traffic Rate calculation procedures, see Schmid et al. (2019) <doi:10.1111/ecog.04025>.
Statistical tools for analyzing cognitive diagnosis (CD) data collected from small settings using the nonparametric classification (NPCD) framework. The core methods of the NPCD framework includes the nonparametric classification (NPC) method developed by Chiu and Douglas (2013) <DOI:10.1007/s00357-013-9132-9> and the general NPC (GNPC) method developed by Chiu, Sun, and Bian (2018) <DOI:10.1007/s11336-017-9595-4> and Chiu and Köhn (2019) <DOI:10.1007/s11336-019-09660-x>. An extension of the NPCD framework included in the package is the nonparametric method for multiple-choice items (MC-NPC) developed by Wang, Chiu, and Koehn (2023) <DOI:10.3102/10769986221133088>. Functions associated with various extensions concerning the evaluation, validation, and feasibility of the CD analysis are also provided. These topics include the completeness of Q-matrix, Q-matrix refinement method, as well as Q-matrix estimation.
This package implements Bayesian data analyses of balanced repeatability and reproducibility studies with ordinal measurements. Model fitting is based on MCMC posterior sampling with rjags'. Function ordinalRR() directly carries out the model fitting, and this function has the flexibility to allow the user to specify key aspects of the model, e.g., fixed versus random effects. Functions for preprocessing data and for the numerical and graphical display of a fitted model are also provided. There are also functions for displaying the model at fixed (user-specified) parameters and for simulating a hypothetical data set at a fixed (user-specified) set of parameters for a random-effects rater population. For additional technical details, refer to Culp, Ryan, Chen, and Hamada (2018) and cite this Technometrics paper when referencing any aspect of this work. The demo of this package reproduces results from the Technometrics paper.
This package contains functions to perform various models and methods for test equating (Kolen and Brennan, 2014 <doi:10.1007/978-1-4939-0317-7> ; Gonzalez and Wiberg, 2017 <doi:10.1007/978-3-319-51824-4> ; von Davier et. al, 2004 <doi:10.1007/b97446>). It currently implements the traditional mean, linear and equipercentile equating methods. Both IRT observed-score and true-score equating are also supported, as well as the mean-mean, mean-sigma, Haebara and Stocking-Lord IRT linking methods. It also supports newest methods such that local equating, kernel equating (using Gaussian, logistic, Epanechnikov, uniform and adaptive kernels) with presmoothing, and IRT parameter linking methods based on asymmetric item characteristic functions. Functions to obtain both standard error of equating (SEE) and standard error of equating differences between two equating functions (SEED) are also implemented for the kernel method of equating.
Many complex diseases are known to be affected by the interactions between genetic variants and environmental exposures beyond the main genetic and environmental effects. Existing Bayesian methods for gene-environment (GÃ E) interaction studies are challenged by the high-dimensional nature of the study and the complexity of environmental influences. We have developed a novel and powerful semi-parametric Bayesian variable selection method that can accommodate linear and nonlinear GÃ E interactions simultaneously (Ren et al. (2020) <doi:10.1002/sim.8434>). Furthermore, the proposed method can conduct structural identification by distinguishing nonlinear interactions from main effects only case within Bayesian framework. Spike-and-slab priors are incorporated on both individual and group level to shrink coefficients corresponding to irrelevant main and interaction effects to zero exactly. The Markov chain Monte Carlo algorithms of the proposed and alternative methods are efficiently implemented in C++.
The package contains methods to visualise the expression profile of genes from a microarray or RNA-seq experiment, and offers a supervised clustering approach to identify GO terms containing genes with expression levels that best classify two or more predefined groups of samples. Annotations for the genes present in the expression dataset may be obtained from Ensembl through the biomaRt package, if not provided by the user. The default random forest framework is used to evaluate the capacity of each gene to cluster samples according to the factor of interest. Finally, GO terms are scored by averaging the rank (alternatively, score) of their respective gene sets to cluster the samples. P-values may be computed to assess the significance of GO term ranking. Visualisation function include gene expression profile, gene ontology-based heatmaps, and hierarchical clustering of experimental samples using gene expression data.
We consider the problem of estimating two isotonic regression curves g1* and g2* under the constraint that they are ordered, i.e. g1* <= g2*. Given two sets of n data points y_1, ..., y_n and z_1, ..., z_n that are observed at (the same) deterministic design points x_1, ..., x_n, the estimates are obtained by minimizing the Least Squares criterion L(a, b) = sum_i=1^n (y_i - a_i)^2 w1(x_i) + sum_i=1^n (z_i - b_i)^2 w2(x_i) over the class of pairs of vectors (a, b) such that a and b are isotonic and a_i <= b_i for all i = 1, ..., n. We offer two different approaches to compute the estimates: a projected subgradient algorithm where the projection is calculated using a PAVA as well as Dykstra's cyclical projection algorithm.
NOTE: PARAMLINK HAS BEEN SUPERSEDED BY THE PEDSUITE PACKAGES (<https://magnusdv.github.io/pedsuite/>). PARAMLINK IS MAINTAINED ONLY FOR LEGACY PURPOSES AND SHOULD NOT BE USED IN NEW PROJECTS. A suite of tools for analysing pedigrees with marker data, including parametric linkage analysis, forensic computations, relatedness analysis and marker simulations. The core of the package is an implementation of the Elston-Stewart algorithm for pedigree likelihoods, extended to allow mutations as well as complex inbreeding. Features for linkage analysis include singlepoint LOD scores, power analysis, and multipoint analysis (the latter through a wrapper to the MERLIN software). Forensic applications include exclusion probabilities, genotype distributions and conditional simulations. Data from the Familias software can be imported and analysed in paramlink'. Finally, paramlink offers many utility functions for creating, manipulating and plotting pedigrees with or without marker data (the actual plotting is done by the kinship2 package).
For fitting N-mixture models using either FFT or asymptotic approaches. FFT N-mixture models extend the work of Cowen et al. (2017) <doi:10.1111/biom.12701>. Asymptotic N-mixture models extend the work of Dail and Madsen (2011) <doi:10.1111/j.1541-0420.2010.01465.x>, to consider asymptotic solutions to the open population N-mixture models. The FFT models are derived and described in "Parker, M.R.P., Elliott, L., Cowen, L.L.E. (2022). Computational efficiency and precision for replicated-count and batch-marked hidden population models [Manuscript in preparation]. Department of Statistics and Actuarial Sciences, Simon Fraser University.". The asymptotic models are derived and described in: "Parker, M.R.P., Elliott, L., Cowen, L.L.E., Cao, J. (2022). Fast asymptotic solutions for N-mixtures on large populations [Manuscript in preparation]. Department of Statistics and Actuarial Sciences, Simon Fraser University.".
In a clinical trial, it frequently occurs that the most credible outcome to evaluate the effectiveness of a new therapy (the true endpoint) is difficult to measure. In such a situation, it can be an effective strategy to replace the true endpoint by a (bio)marker that is easier to measure and that allows for a prediction of the treatment effect on the true endpoint (a surrogate endpoint). The package Surrogate allows for an evaluation of the appropriateness of a candidate surrogate endpoint based on the meta-analytic, information-theoretic, and causal-inference frameworks. Part of this software has been developed using funding provided from the European Union's Seventh Framework Programme for research, technological development and demonstration (Grant Agreement no 602552), the Special Research Fund (BOF) of Hasselt University (BOF-number: BOF2OCPO3), GlaxoSmithKline Biologicals, Baekeland Mandaat (HBC.2022.0145), and Johnson & Johnson Innovative Medicine.
pipeFrame is an R package for building a componentized bioinformatics pipeline. Each step in this pipeline is wrapped in the framework, so the connection among steps is created seamlessly and automatically. Users could focus more on fine-tuning arguments rather than spending a lot of time on transforming file format, passing task outputs to task inputs or installing the dependencies. Componentized step elements can be customized into other new pipelines flexibly as well. This pipeline can be split into several important functional steps, so it is much easier for users to understand the complex arguments from each step rather than parameter combination from the whole pipeline. At the same time, componentized pipeline can restart at the breakpoint and avoid rerunning the whole pipeline, which may save a lot of time for users on pipeline tuning or such issues as power off or process other interrupts.
Drafting an epidemiological report in Microsoft Word format for a given disease, similar to the Annual Epidemiological Reports published by the European Centre for Disease Prevention and Control. Through standalone functions, it is specifically designed to generate each disease specific output presented in these reports and includes: - Table with the distribution of cases by Member State over the last five years; - Seasonality plot with the distribution of cases at the European Union / European Economic Area level, by month, over the past five years; - Trend plot with the trend and number of cases at the European Union / European Economic Area level, by month, over the past five years; - Age and gender bar graph with the distribution of cases at the European Union / European Economic Area level. Two types of datasets can be used: - The default dataset of dengue 2015-2019 data; - Any dataset specified as described in the vignette.
Allows calculating global scores for characteristics of visual stimuli as assessed by human raters. Stimuli are presented as sequence of pairwise comparisons ('contests'), during each of which a rater expresses preference for one stimulus over the other (forced choice). The algorithm for calculating global scores is based on Elo rating, which updates individual scores after each single pairwise contest. Elo rating is widely used to rank chess players according to their performance. Its core feature is that dyadic contests with expected outcomes lead to smaller changes of participants scores than outcomes that were unexpected. As such, Elo rating is an efficient tool to rate individual stimuli when a large number of such stimuli are paired against each other in the context of experiments where the goal is to rank stimuli according to some characteristic of interest. Clark et al (2018) <doi:10.1371/journal.pone.0190393> provide details.
Most common exact, asymptotic and resample based tests are provided for testing the homogeneity of variances of k normal distributions under normality. These tests are Barlett, Bhandary & Dai, Brown & Forsythe, Chang et al., Gokpinar & Gokpinar, Levene, Liu and Xu, Gokpinar. Also, a data generation function from multiple normal distribution is provided using any multiple normal parameters. Bartlett, M. S. (1937) <doi:10.1098/rspa.1937.0109> Bhandary, M., & Dai, H. (2008) <doi:10.1080/03610910802431011> Brown, M. B., & Forsythe, A. B. (1974).<doi:10.1080/01621459.1974.10482955> Chang, C. H., Pal, N., & Lin, J. J. (2017) <doi:10.1080/03610918.2016.1202277> Gokpinar E. & Gokpinar F. (2017) <doi:10.1080/03610918.2014.955110> Liu, X., & Xu, X. (2010) <doi:10.1016/j.spl.2010.05.017> Levene, H. (1960) <https://cir.nii.ac.jp/crid/1573950400526848896> Gökpınar, E. (2020) <doi:10.1080/03610918.2020.1800037>.
Simply and efficiently simulates (i) variants from reference genomes and (ii) reads from both Illumina <https://www.illumina.com/> and Pacific Biosciences (PacBio) <https://www.pacb.com/> platforms. It can either read reference genomes from FASTA files or simulate new ones. Genomic variants can be simulated using summary statistics, phylogenies, Variant Call Format (VCF) files, and coalescent simulationsâ the latter of which can include selection, recombination, and demographic fluctuations. jackalope can simulate single, paired-end, or mate-pair Illumina reads, as well as PacBio reads. These simulations include sequencing errors, mapping qualities, multiplexing, and optical/polymerase chain reaction (PCR) duplicates. Simulating Illumina sequencing is based on ART by Huang et al. (2012) <doi:10.1093/bioinformatics/btr708>. PacBio sequencing simulation is based on SimLoRD by Stöcker et al. (2016) <doi:10.1093/bioinformatics/btw286>. All outputs can be written to standard file formats.
The inference in multi-state models is traditionally performed under a Markov assumption that claims that past and future of the process are independent given the present state. In this package, we consider tests of the Markov assumption that are applicable to general multi-state models. Three approaches using existing methodology are considered: a simple method based on including covariates depending on the history in Cox models for the transition intensities; methods based on measuring the discrepancy of the non-Markov estimators of the transition probabilities to the Markov Aalen-Johansen estimators; and, finally, methods that were developed by considering summaries from families of log-rank statistics where patients are grouped by the state occupied of the process at a particular time point (see Soutinho G, Meira-Machado L (2021) <doi:10.1007/s00180-021-01139-7> and Titman AC, Putter H (2020) <doi:10.1093/biostatistics/kxaa030>).