Processing collections of Earth observation images as on-demand multispectral, multitemporal raster data cubes. Users define cubes by spatiotemporal extent, resolution, and spatial reference system and let gdalcubes automatically apply cropping, reprojection, and resampling using the Geospatial Data Abstraction Library ('GDAL'). Implemented functions on data cubes include reduction over space and time, applying arithmetic expressions on pixel band values, moving window aggregates over time, filtering by space, time, bands, and predicates on pixel values, exporting data cubes as netCDF
or GeoTIFF
files, plotting, and extraction from spatial and or spatiotemporal features. All computational parts are implemented in C++, linking to the GDAL', netCDF
', CURL', and SQLite libraries. See Appel and Pebesma (2019) <doi:10.3390/data4030092> for further details.
This package provides an extension to ggplot2 (Wickham, 2016, <doi:10.1007/978-3-319-24277-4>) for creating two types of continuous confidence interval plots (Violin CI and Gradient CI plots), typically for the sample mean. These plots contain multiple user-defined confidence areas with varying colours, defined by the underlying t-distribution used to compute standard confidence intervals for the mean of the normal distribution when the variance is unknown. Two types of plots are available, a gradient plot with rectangular areas, and a violin plot where the shape (horizontal width) is defined by the probability density function of the t-distribution. These visualizations are studied in (Helske, Helske, Cooper, Ynnerman, and Besancon, 2021) <doi:10.1109/TVCG.2021.3073466>.
In the omics data association studies, it is common to conduct the p-value corrections to control the false significance. Beyond the P-value corrections, E-value is recently studied to facilitate multiple testing correction based on V. Vovk and R. Wang (2021) <doi:10.1214/20-AOS2020>. This package provides E-value calculation for DNA methylation data and RNA-seq data. Currently, five data formats are supported: DNA methylation levels using DMR detection tools (BiSeq
, DMRfinder, MethylKit
, Metilene and other DNA methylation tools) and RNA-seq data. The relevant references are listed below: Katja Hebestreit and Hans-Ulrich Klein (2022) <doi:10.18129/B9.bioc.BiSeq>
; Altuna Akalin et.al (2012) <doi:10.18129/B9.bioc.methylKit>
.
Generates LaTeX
code for drawing well-formatted neural network diagrams with TikZ
'. Users have to define number of neurons on each layer, and optionally define neuron connections they would like to keep or omit, layers they consider to be oversized and neurons they would like to draw with lighter color. They can also specify the title of diagram, color, opacity of figure, labels of layers, input and output neurons. In addition, this package helps to produce LaTeX
code for drawing activation functions which are crucial in neural network analysis. To make the code work in a LaTeX
editor, users need to install and import some TeX
packages including TikZ
in the setting of TeX
file.
The Poverty Probability Index (PPI) is a poverty measurement tool for organizations and businesses with a mission to serve the poor. The PPI is statistically-sound, yet simple to use: the answers to 10 questions about a household's characteristics and asset ownership are scored to compute the likelihood that the household is living below the poverty line - or above by only a narrow margin. This package contains country-specific lookup data tables used as reference to determine the poverty likelihood of a household based on their score from the country-specific PPI questionnaire. These lookup tables have been extracted from documentation of the PPI found at <https://www.povertyindex.org> and managed by Innovations for Poverty Action <https://poverty-action.org/>.
Survey sampling using permanent random numbers (PRN's). A solution to the problem of unknown overlap between survey samples, which leads to a low precision in estimates when the survey is repeated or combined with other surveys. The PRN solution is to supply the U(0, 1) random numbers to the sampling procedure, instead of having the sampling procedure generate them. In Lindblom (2014) <doi:10.2478/jos-2014-0047>, and therein cited papers, it is shown how this is carried out and how it improves the estimates. This package supports two common fixed-size sampling procedures (simple random sampling and probability-proportional-to-size sampling) and includes a function for transforming the PRN's in order to control the sample overlap.
This package provides a toolkit for analysis and visualization of data from fluorophore-assisted seed amplification assays, such as Real-Time Quaking-Induced Conversion (RT-QuIC
) and Fluorophore-Assisted Protein Misfolding Cyclic Amplification (PMCA). QuICSeedR
addresses limitations in existing software by automating data processing, supporting large-scale analysis, and enabling comparative studies of analysis methods. It incorporates methods described in Henderson et al. (2015) <doi:10.1099/vir.0.069906-0>, Li et al. (2020) <doi:10.1038/s41598-021-96127-8>, Rowden et al. (2023) <doi:10.3390/pathogens12020309>, Haley et al. (2013) <doi:10.1371/journal.pone.0081488>, and Mair and Wilcox (2020) <doi:10.3758/s13428-019-01246-w>. Please refer to the original publications for details.
An extension of the AlphaSimR
package (<https://cran.r-project.org/package=AlphaSimR>
) for stochastic simulations of honeybee populations and breeding programmes. SIMplyBee
enables simulation of individual bees that form a colony, which includes a queen, fathers (drones the queen mated with), virgin queens, workers, and drones. Multiple colony can be merged into a population of colonies, such as an apiary or a whole country of colonies. Functions enable operations on castes, colony, or colonies, to ease R scripting of whole populations. All AlphaSimR
functionality with respect to genomes and genetic and phenotype values is available and further extended for honeybees, including haplo-diploidy, complementary sex determiner locus, colony events (swarming, supersedure, etc.), and colony phenotype values.
Creation of an individual claims simulator which generates various features of non-life insurance claims. An initial set of test parameters, designed to mirror the experience of an Auto Liability portfolio, were set up and applied by default to generate a realistic test data set of individual claims (see vignette). The simulated data set then allows practitioners to back-test the validity of various reserving models and to prove and/or disprove certain actuarial assumptions made in claims modelling. The distributional assumptions used to generate this data set can be easily modified by users to match their experiences. Reference: Avanzi B, Taylor G, Wang M, Wong B (2020) "SynthETIC
: an individual insurance claim simulator with feature control" <arXiv:2008.05693>
.
This package implements the adaptive designs for integrated phase I/II trials of drug combinations via continual reassessment method (CRM) to evaluate toxicity and efficacy simultaneously for each enrolled patient cohort based on Bayesian inference. It supports patients assignment guidance in a single trial using current enrolled data, as well as conducting extensive simulation studies to evaluate operating characteristics before the trial starts. It includes various link functions such as empiric, one-parameter logistic, two-parameter logistic, and hyperbolic tangent, as well as considering multiple prior distributions of the parameters like normal distribution, gamma distribution and exponential distribution to accommodate diverse clinical scenarios. Method using Bayesian framework with empiric link function is described in: Wages and Conaway (2014) <doi:10.1002/sim.6097>.
Many statistical models and analyses in R are implemented through formula objects. The formulaic package creates a unified approach for programmatically and dynamically generating formula objects. Users may specify the outcome and inputs of a model directly, search for variables to include based upon naming patterns, incorporate interactions, and identify variables to exclude. A wide range of quality checks are implemented to identify issues such as misspecified variables, duplication, a lack of contrast in the inputs, and a large number of levels in categorical data. Variables that do not meet these quality checks can be automatically excluded from the model. These issues are documented and reported in a manner that provides greater accountability and useful information to guide an investigation of the data.
The goal of this package is to cover the most common steps in probability of default (PD) rating model development and validation. The main procedures available are those that refer to univariate, bivariate, multivariate analysis, calibration and validation. Along with accompanied monobin and monobinShiny
packages, PDtoolkit provides functions which are suitable for different data transformation and modeling tasks such as: imputations, monotonic binning of numeric risk factors, binning of categorical risk factors, weights of evidence (WoE
) and information value (IV) calculations, WoE
coding (replacement of risk factors modalities with WoE
values), risk factor clustering, area under curve (AUC) calculation and others. Additionally, package provides set of validation functions for testing homogeneity, heterogeneity, discriminatory and predictive power of the model.
Because larger (> 50 MB) data files cannot easily be committed to git, a different approach is required to manage data associated with an analysis in a GitHub
repository. This package provides a simple work-around by allowing larger (up to 2 GB) data files to piggyback on a repository as assets attached to individual GitHub
releases. These files are not handled by git in any way, but instead are uploaded, downloaded, or edited directly by calls through the GitHub
API. These data files can be versioned manually by creating different releases. This approach works equally well with public or private repositories. Data can be uploaded and downloaded programmatically from scripts. No authentication is required to download data from public repositories.
This package implements models of leaf temperature using energy balance. It uses units to ensure that parameters are properly specified and transformed before calculations. It allows separate lower and upper surface conductances to heat and water vapour, so sensible and latent heat loss are calculated for each surface separately as in Foster and Smith (1986) <doi:10.1111/j.1365-3040.1986.tb02108.x>. It's straightforward to model leaf temperature over environmental gradients such as light, air temperature, humidity, and wind. It can also model leaf temperature over trait gradients such as leaf size or stomatal conductance. Other references are Monteith and Unsworth (2013, ISBN:9780123869104), Nobel (2009, ISBN:9780123741431), and Okajima et al. (2012) <doi:10.1007/s11284-011-0905-5>.
We develop a novel matrix factorization tool named scINSIGHT
to jointly analyze multiple single-cell gene expression samples from biologically heterogeneous sources, such as different disease phases, treatment groups, or developmental stages. Given multiple gene expression samples from different biological conditions, scINSIGHT
simultaneously identifies common and condition-specific gene modules and quantify their expression levels in each sample in a lower-dimensional space. With the factorized results, the inferred expression levels and memberships of common gene modules can be used to cluster cells and detect cell identities, and the condition-specific gene modules can help compare functional differences in transcriptomes from distinct conditions. Please also see Qian K, Fu SW, Li HW, Li WV (2022) <doi:10.1186/s13059-022-02649-3>.
Nucleolus is an important structure inside the nucleus in eukaryotic cells. It is the site for transcribing rDNA
into rRNA
and for assembling ribosomes, aka ribosome biogenesis. In addition, nucleoli are dynamic hubs through which numerous proteins shuttle and contact specific non-rDNA
genomic loci. Deep sequencing analyses of DNA associated with isolated nucleoli (NAD- seq) have shown that specific loci, termed nucleolus- associated domains (NADs) form frequent three- dimensional associations with nucleoli. NAD-seq has been used to study the biological functions of NAD and the dynamics of NAD distribution during embryonic stem cell (ESC) differentiation. Here, we developed a Bioconductor package NADfinder for bioinformatic analysis of the NAD-seq data, including baseline correction, smoothing, normalization, peak calling, and annotation.
LINCS L1000 is a high-throughput technology that allows the gene expression measurement in a large number of assays. However, to fit the measurements of ~1000 genes in the ~500 color channels of LINCS L1000, every two landmark genes are designed to share a single channel. Thus, a deconvolution step is required to infer the expression values of each gene. Any errors in this step can be propagated adversely to the downstream analyses. We present a LINCS L1000 data peak calling R package l1kdeconv based on a new outlier detection method and an aggregate Gaussian mixture model. Upon the remove of outliers and the borrowing information among similar samples, l1kdeconv shows more stable and better performance than methods commonly used in LINCS L1000 data deconvolution.
Traditional and spatial capture-mark-recapture analysis with multiple non-invasive marks. The models implemented in multimark combine encounter history data arising from two different non-invasive "marks", such as images of left-sided and right-sided pelage patterns of bilaterally asymmetrical species, to estimate abundance and related demographic parameters while accounting for imperfect detection. Bayesian models are specified using simple formulae and fitted using Markov chain Monte Carlo. Addressing deficiencies in currently available software, multimark also provides a user-friendly interface for performing Bayesian multimodel inference using non-spatial or spatial capture-recapture data consisting of a single conventional mark or multiple non-invasive marks. See McClintock
(2015) <doi:10.1002/ece3.1676> and Maronde et al. (2020) <doi:10.1002/ece3.6990>.
This is a computational package designed to identify the most sensitive interactions within a network which must be estimated most accurately in order to produce qualitatively robust predictions to a press perturbation. This is accomplished by enumerating the number of sign switches (and their magnitude) in the net effects matrix when an edge experiences uncertainty. The package produces data and visualizations when uncertainty is associated to one or more edges in the network and according to a variety of distributions. The software requires the network to be described by a system of differential equations but only requires as input a numerical Jacobian matrix evaluated at an equilibrium point. This package is based on Koslicki, D., & Novak, M. (2017) <doi:10.1007/s00285-017-1163-0>.
CellScape
facilitates interactive browsing of single cell clonal evolution datasets. The tool requires two main inputs: (i) the genomic content of each single cell in the form of either copy number segments or targeted mutation values, and (ii) a single cell phylogeny. Phylogenetic formats can vary from dendrogram-like phylogenies with leaf nodes to evolutionary model-derived phylogenies with observed or latent internal nodes. The CellScape
phylogeny is flexibly input as a table of source-target edges to support arbitrary representations, where each node may or may not have associated genomic data. The output of CellScape
is an interactive interface displaying a single cell phylogeny and a cell-by-locus genomic heatmap representing the mutation status in each cell for each locus.
Analysis of forest population structure and quantitative dynamics is the research and evaluation of the composition, distribution, age structure and changes in quantity over time of various populations in the forest. By deeply understanding these characteristics of forest populations, scientific basis can be provided for the management, protection and sustainable utilization of forest resources. This R package conducts a systematic analysis of forest population structure and quantitative dynamics through analyzing age structure, compiling life tables, population quantitative dynamic change indices and time series models, in order to provide support for forest population protection and sustainable management. References: Zhang Y, Wang J, Wang X, et al(2024)<doi:10.3390/plants13070946>. Yuan G, Guo Q, Xie N, et al(2023)<doi:10.1007/s11629-022-7429-z>.
This package provides functions for the computation of F-, f- and D-statistics (e.g., Fst, hierarchical F-statistics, Patterson's F2, F3, F3*, F4 and D parameters) in population genomics studies from allele count or Pool-Seq read count data and for the fitting, building and visualization of admixture graphs. The package also includes several utilities to manipulate Pool-Seq data stored in standard format (e.g., such as vcf files or rsync files generated by the the PoPoolation
software) and perform conversion to alternative format (as used in the BayPass
and SelEstim
software). As of version 2.0, the package also includes utilities to manipulate standard allele count data (e.g., stored in TreeMix
, BayPass
and SelEstim
format).
The biomarker data set by Vermeulen et al. (2009) <doi:10.1016/S1470-2045(09)70154-8> is provided. The data source, however, is by Ruijter et al. (2013) <doi:10.1016/j.ymeth.2012.08.011>. The original data set may be downloaded from <https://medischebiologie.nl/wp-content/uploads/2019/02/qpcrdatamethods.zip>. This data set is for a real-time quantitative polymerase chain reaction (PCR) experiment that comprises the raw fluorescence data of 24,576 amplification curves. This data set comprises 59 genes of interest and 5 reference genes. Each gene was assessed on 366 neuroblastoma complementary DNA (cDNA
) samples and on 18 standard dilution series samples (10-fold 5-point dilution series x 3 replicates + no template controls (NTC) x 3 replicates).
The analysis and visualization of alternative splicing (AS) events from RNA sequencing data remains challenging. SpliceWiz
is a user-friendly and performance-optimized R package for AS analysis, by processing alignment BAM files to quantify read counts across splice junctions, IRFinder-based intron retention quantitation, and supports novel splicing event identification. We introduce a novel visualization for AS using normalized coverage, thereby allowing visualization of differential AS across conditions. SpliceWiz
features a shiny-based GUI facilitating interactive data exploration of results including gene ontology enrichment. It is performance optimized with multi-threaded processing of BAM files and a new COV file format for fast recall of sequencing coverage. Overall, SpliceWiz
streamlines AS analysis, enabling reliable identification of functionally relevant AS events for further characterization.