This package provides a parallel implementation of Weighted Subspace Random Forest. The Weighted Subspace Random Forest algorithm was proposed in the International Journal of Data Warehousing and Mining by Baoxun Xu, Joshua Zhexue Huang, Graham Williams, Qiang Wang, and Yunming Ye (2012) <DOI:10.4018/jdwm.2012040103>. The algorithm can classify very high-dimensional data with random forests built using small subspaces. A novel variable weighting method is used for variable subspace selection in place of the traditional random variable sampling.This new approach is particularly useful in building models from high-dimensional data.
Supports a structured approach for exploring PKPD data <https://opensource.nibr.com/xgx/>. It also contains helper functions for enabling the modeler to follow best R practices (by appending the program name, figure name location, and draft status to each plot). In addition, it enables the modeler to follow best graphical practices (by providing a theme that reduces chart ink, and by providing time-scale, log-scale, and reverse-log-transform-scale functions for more readable axes). Finally, it provides some data checking and summarizing functions for rapidly exploring pharmacokinetics and pharmacodynamics (PKPD) datasets.
Publicly available RNA-seq data is routinely used for retrospective analysis to elucidate new biology. Novel transcript discovery enabled by large collections of RNA-seq datasets has emerged as one of such analysis. To increase the power of transcript discovery from large collections of RNA-seq datasets, we developed a new R package named Pooling RNA-seq and Assembling Models (PRAM), which builds transcript models in intergenic regions from pooled RNA-seq datasets. This package includes functions for defining intergenic regions, extracting and pooling related RNA-seq alignments, predicting, selected, and evaluating transcript models.
We provide an implementation for Sum of Ranking Differences (SRD), a novel statistical test introduced by Héberger (2010) <doi:10.1016/j.trac.2009.09.009>. The test allows the comparison of different solutions through a reference by first performing a rank transformation on the input, then calculating and comparing the distances between the solutions and the reference - the latter is measured in the L1 norm. The reference can be an external benchmark (e.g. an established gold standard) or can be aggregated from the data. The calculated distances, called SRD scores, are validated in two ways, see Héberger and Kollár-Hunek (2011) <doi:10.1002/cem.1320>. A randomization test (also called permutation test) compares the SRD scores of the solutions to the SRD scores of randomly generated rankings. The second validation option is cross-validation that checks whether the rankings generated from the solutions come from the same distribution or not. For a detailed analysis about the cross-validation process see Sziklai, Baranyi and Héberger (2021) <doi:10.48550/arXiv.2105.11939>. The package offers a wide array of features related to SRD including the computation of the SRD scores, validation options, input preprocessing and plotting tools.
Application of genome prediction for a continuous variable, focused on genotype by environment (GE) genomic selection models (GS). It consists a group of functions that help to create regression kernels for some GE genomic models proposed by Jarquà n et al. (2014) <doi:10.1007/s00122-013-2243-1> and Lopez-Cruz et al. (2015) <doi:10.1534/g3.114.016097>. Also, it computes genomic predictions based on Bayesian approaches. The prediction function uses an orthogonal transformation of the data and specific priors present by Cuevas et al. (2014) <doi:10.1534/g3.114.013094>.
This package creates survey designs for distance sampling surveys. These designs can be assessed for various effort and coverage statistics. Once the user is satisfied with the design characteristics they can generate a set of transects to use in their distance sampling survey. Many of the designs implemented in this R package were first made available in our Distance for Windows software and are detailed in Chapter 7 of Advanced Distance Sampling, Buckland et. al. (2008, ISBN-13: 978-0199225873). Find out more about estimating animal/plant abundance with distance sampling at <https://distancesampling.org/>.
This package provides a program for Bayesian analysis of univariate normal mixtures with an unknown number of components, following the approach of Richardson and Green (1997) <doi:10.1111/1467-9868.00095>. This makes use of reversible jump Markov chain Monte Carlo methods that are capable of jumping between the parameter sub-spaces corresponding to different numbers of components in the mixture. A sample from the full joint distribution of all unknown variables is thereby generated, and this can be used as a basis for a thorough presentation of many aspects of the posterior distribution.
Supplementary utils for CRAN maintainers and R packages developers. Validating the library, packages and lock files. Exploring a complexity of a specific package like evaluating its size in bytes with all dependencies. The shiny app complexity could be explored too. Assessing the life duration of a specific package version. Checking a CRAN package check page status for any errors and warnings. Retrieving a DESCRIPTION or NAMESPACE file for any package version. Comparing DESCRIPTION or NAMESPACE files between different package versions. Getting a list of all releases for a specific package. The Bioconductor is partly supported.
Code to identify functional enrichments across diverse taxa in phylogenetic tree, particularly where these taxa differ in abundance across samples in a non-random pattern. The motivation for this approach is to identify microbial functions encoded by diverse taxa that are at higher abundance in certain samples compared to others, which could indicate that such functions are broadly adaptive under certain conditions. See GitHub repository for tutorial and examples: <https://github.com/gavinmdouglas/POMS/wiki>. Citation: Gavin M. Douglas, Molly G. Hayes, Morgan G. I. Langille, Elhanan Borenstein (2022) <doi:10.1093/bioinformatics/btac655>.
This package provides a system to plan analyses within the mental model where you have one (or more) datasets and want to run either A) the same function multiple times with different arguments, or B) multiple functions. This is appropriate when you have multiple strata (e.g. locations, age groups) that you want to apply the same function to, or you have multiple variables (e.g. exposures) that you want to apply the same statistical method to, or when you are creating the output for a report and you need multiple different tables or graphs.
This package provides a multiple testing procedure for testing several groups of hypotheses is implemented. Linear dependency among the hypotheses within the same group is modeled by using hidden Markov Models. It is noted that a smaller p value does not necessarily imply more significance due to the dependency. A typical application is to analyze genome wide association studies datasets, where SNPs from the same chromosome are treated as a group and exhibit strong linear genomic dependency. See Wei Z, Sun W, Wang K, Hakonarson H (2009) <doi:10.1093/bioinformatics/btp476> for more details.
This R package assists breeders in linking data systems with their analytic pipelines, a crucial step in digitizing breeding processes. It supports querying and retrieving phenotypic and genotypic data from systems like EBS <https://ebs.excellenceinbreeding.org/>, BMS <https://bmspro.io>, BreedBase <https://breedbase.org>, GIGWA <https://github.com/SouthGreenPlatform/Gigwa2> (using BrAPI <https://brapi.org> calls), , and Germinate <https://germinateplatform.github.io/get-germinate/>. Extra helper functions support environmental data sources, including TerraClimate <https://www.climatologylab.org/terraclimate.html> and FAO HWSDv2 <https://gaez.fao.org/pages/hwsd> soil database.
Estimate the transition diagnostic classification model (TDCM) described in Madison & Bradshaw (2018) <doi:10.1007/s11336-018-9638-5>, a longitudinal extension of the log-linear cognitive diagnosis model (LCDM) in Henson, Templin & Willse (2009) <doi:10.1007/s11336-008-9089-5>. As the LCDM subsumes many other diagnostic classification models (DCMs), many other DCMs can be estimated longitudinally via the TDCM. The TDCM package includes functions to estimate the single-group and multigroup TDCM, summarize results of interest including item parameters, growth proportions, transition probabilities, transitional reliability, attribute correlations, model fit, and growth plots.
This package provides an up-to-date copy of the Internet Assigned Numbers Authority (IANA) Time Zone Database. It is updated periodically to reflect changes made by political bodies to time zone boundaries, UTC offsets, and daylight saving time rules. Additionally, this package provides a C++ interface for working with the date library. date provides comprehensive support for working with dates and date-times, which this package exposes to make it easier for other R packages to utilize. Headers are provided for calendar specific calculations, along with a limited interface for time zone manipulations.
This package provides methods to detect differential item functioning (DIF) in dichotomous and polytomous items, using both classical and modern approaches. These include Mantel-Haenszel procedures, logistic regression (including ordinal models), and regularization-based methods such as LASSO. Uniform and non-uniform DIF effects can be detected, and some methods support multiple focal groups. The package also provides tools for anchor purification, rest score matching, effect size estimation, and DIF simulation. See Magis, Beland, Tuerlinckx, and De Boeck (2010, Behavior Research Methods, 42, 847â 862, <doi:10.3758/BRM.42.3.847>) for a general overview.
Evidence of Absence software (EoA) is a user-friendly application for estimating bird and bat fatalities at wind farms and designing search protocols. The software is particularly useful in addressing whether the number of fatalities has exceeded a given threshold and what search parameters are needed to give assurance that thresholds were not exceeded. The models are applicable even when zero carcasses have been found in searches, following Huso et al. (2015) <doi:10.1890/14-0764.1>, Dalthorp et al. (2017) <doi:10.3133/ds1055>, and Dalthorp and Huso (2015) <doi:10.3133/ofr20151227>.
Model fitting and species biotic interaction network topology selection for explicit interaction community models. Explicit interaction community models are an extension of binomial linear models for joint modelling of species communities, that incorporate both the effects of species biotic interactions and the effects of missing covariates. Species interactions are modelled as direct effects of each species on each of the others, and are estimated alongside the effects of missing covariates, modelled as latent factors. The package includes a penalized maximum likelihood fitting function, and a genetic algorithm for selecting the most parsimonious species interaction network topology.
Mining informative genes with certain biological meanings are important for clinical diagnosis of disease and discovery of disease mechanisms in plants and animals. This process involves identification of relevant genes and removal of redundant genes as much as possible from a whole gene set. This package selects the informative genes related to a specific trait using gene expression dataset. These trait specific genes are considered as informative genes. This package returns the informative gene set from the high dimensional gene expression data using a combination of methods SVM and MRMR (for feature selection) with bootstrapping procedure.
This package implements the Self-Similarity Test for Normality (SSTN), a new statistical test designed to assess whether a given sample originates from a normal distribution. The procedure is based on iteratively estimating the characteristic function of the sum of standardized i.i.d. random variables and comparing it to the characteristic function of the standard normal distribution. A Monte Carlo procedure is used to determine the empirical distribution of the test statistic under the null hypothesis. Details of the methodology are described in Anarat and Schwender (2025), "A normality test based on self-similarity" (Submitted).
An easy to use implementation of routine structural missing data diagnostics with functions to visualize the proportions of missing observations, investigate missing data patterns and conduct various empirical missing data diagnostic tests. Reference: Weberpals J, Raman SR, Shaw PA, Lee H, Hammill BG, Toh S, Connolly JG, Dandreo KJ, Tian F, Liu W, Li J, Hernández-Muñoz JJ, Glynn RJ, Desai RJ. smdi: an R package to perform structural missing data investigations on partially observed confounders in real-world evidence studies. JAMIA Open. 2024 Jan 31;7(1):ooae008. <doi:10.1093/jamiaopen/ooae008>.
The sdrt() function is designed for estimating subspaces for Sufficient Dimension Reduction (SDR) in time series, with a specific focus on the Time Series Central Mean subspace (TS-CMS). The package employs the Fourier transformation method proposed by Samadi and De Alwis (2023) <doi:10.48550/arXiv.2312.02110> and the Nadaraya-Watson kernel smoother method proposed by Park et al. (2009) <doi:10.1198/jcgs.2009.08076> for estimating the TS-CMS. The package provides tools for estimating distances between subspaces and includes functions for selecting model parameters using the Fourier transformation method.
Allows to map species richness and endemism based on stacked species distribution models (SSDM). Individuals SDMs can be created using a single or multiple algorithms (ensemble SDMs). For each species, an SDM can yield a habitat suitability map, a binary map, a between-algorithm variance map, and can assess variable importance, algorithm accuracy, and between- algorithm correlation. Methods to stack individual SDMs include summing individual probabilities and thresholding then summing. Thresholding can be based on a specific evaluation metric or by drawing repeatedly from a Bernoulli distribution. The SSDM package also provides a user-friendly interface.
The R package CTSV implements the CTSV approach developed by Jinge Yu and Xiangyu Luo that detects cell-type-specific spatially variable genes accounting for excess zeros. CTSV directly models sparse raw count data through a zero-inflated negative binomial regression model, incorporates cell-type proportions, and performs hypothesis testing based on R package pscl. The package outputs p-values and q-values for genes in each cell type, and CTSV is scalable to datasets with tens of thousands of genes measured on hundreds of spots. CTSV can be installed in Windows, Linux, and Mac OS.
This package provides interface to the Copernicus Data Space Ecosystem API <https://dataspace.copernicus.eu/analyse/apis>, mainly for searching the catalog of available data from Copernicus Sentinel missions and obtaining the images for just the area of interest based on selected spectral bands. The package uses the Sentinel Hub REST API interface <https://dataspace.copernicus.eu/analyse/apis/sentinel-hub> that provides access to various satellite imagery archives. It allows you to access raw satellite data, rendered images, statistical analysis, and other features. This package is in no way officially related to or endorsed by Copernicus.