Genomic coordinates of CTCF binding sites, with strand orientation (directionality of binding). Position weight matrices (PWMs) from JASPAR, HOCOMOCO, CIS-BP, CTCFBSDB, SwissRegulon, Jolma 2013, were used to uniformly predict CTCF binding sites using FIMO (default settings) on human (hg18, hg19, hg38, T2T) and mouse (mm9, mm10, mm39) genome assemblies. Extra columns include motif/PWM name (e.g., MA0139.1), score, p-value, q-value, and the motif sequence. It is recommended to filter FIMO-predicted sites by 1e-6 p-value threshold instead of using the default 1e-4 threshold. Experimentally obtained CTCF-bound cis-regulatory elements from ENCODE SCREEN and predicted CTCF sites from CTCFBSDB are also included. Selected data are lifted over from a different genome assembly as we demonstrated liftOver is a viable option to obtain CTCF coordinates in different genome assemblies. CTCF sites obtained using JASPAR's MA0139.1 PWM and filtered at 1e-6 p-value threshold are recommended.
Implementations of an estimator for the multivariate regression association measure (MRAM) proposed in Shih and Chen (2025) <in revision> and its associated variable selection algorithm. The MRAM quantifies the predictability of a random vector Y from a random vector X given a random vector Z. It takes the maximum value 1 if and only if Y is almost surely a measurable function of X and Z, and the minimum value of 0 if Y is conditionally independent of X given Z. The MRAM generalizes the Kendall's tau copula correlation ratio proposed in Shih and Emura (2021) <doi:10.1016/j.jmva.2020.104708> by employing the spatial sign function. The estimator is based on the nearest neighbor method, and the associated variable selection algorithm is adapted from the feature ordering by conditional independence (FOCI) algorithm of Azadkia and Chatterjee (2021) <doi:10.1214/21-AOS2073>. For further details, see the paper Shih and Chen (2025) <in revision>.
This package provides a sizable genomics study such as microarray often involves the use of multiple batches (groups) of experiment due to practical complication. To minimize batch effects, a careful experiment design should ensure the even distribution of biological groups and confounding factors across batches. OSAT (Optimal Sample Assignment Tool) is developed to facilitate the allocation of collected samples to different batches. With minimum steps, it produces setup that optimizes the even distribution of samples in groups of biological interest into different batches, reducing the confounding or correlation between batches and the biological variables of interest. It can also optimize the even distribution of confounding factors across batches. Our tool can handle challenging instances where incomplete and unbalanced sample collections are involved as well as ideal balanced RCBD. OSAT provides a number of predefined layout for some of the most commonly used genomics platform. Related paper can be find at http://www.biomedcentral.com/1471-2164/13/689 .
Kevin Dowd's book Measuring Market Risk is a widely read book in the area of risk measurement by students and practitioners alike. As he claims, MATLAB indeed might have been the most suitable language when he originally wrote the functions, but, with growing popularity of R it is not entirely valid. As Dowd's code was not intended to be error free and were mainly for reference, some functions in this package have inherited those errors. An attempt will be made in future releases to identify and correct them. Dowd's original code can be downloaded from www.kevindowd.org/measuring-market-risk/. It should be noted that Dowd offers both MMR2 and MMR1 toolboxes. Only MMR2 was ported to R. MMR2 is more recent version of MMR1 toolbox and they both have mostly similar function. The toolbox mainly contains different parametric and non parametric methods for measurement of market risk as well as backtesting risk measurement methods.
Obtaining relevant set of trait specific genes from gene expression data is important for clinical diagnosis of disease and discovery of disease mechanisms in plants and animals. This process involves identification of relevant genes and removal of redundant genes as much as possible from a whole gene set. This package returns the trait specific gene set from the high dimensional RNA-seq count data by applying combination of two conventional machine learning algorithms, support vector machine (SVM) and genetic algorithm (GA). GA is used to control and optimize the subset of genes sent to the SVM for classification and evaluation. Genetic algorithm uses repeated learning steps and cross validation over number of possible solution and selects the best. The algorithm selects the set of genes based on a fitness function that is obtained via support vector machines. Using SVM as the classifier performance and the genetic algorithm for feature selection, a set of trait specific gene set is obtained.
Gives some hypothesis test functions (sign test, median and other quantile tests, Wilcoxon signed rank test, coefficient of variation test, test of normal variance, test on weighted sums of Poisson [see Fay and Kim <doi:10.1002/bimj.201600111>], sample size for t-tests with different variances and non-equal n per arm, Behrens-Fisher test, nonparametric ABC intervals, Wilcoxon-Mann-Whitney test [with effect estimates and confidence intervals, see Fay and Malinovsky <doi:10.1002/sim.7890>], two-sample melding tests [see Fay, Proschan, and Brittain <doi:10.1111/biom.12231>], one-way ANOVA allowing var.equal=FALSE [see Brown and Forsythe, 1974, Biometrics]), prevalence confidence intervals that adjust for sensitivity and specificity [see Lang and Reiczigel, 2014 <doi:10.1016/j.prevetmed.2013.09.015>] or Bayer, Fay, and Graubard, 2023 <doi:10.48550/arXiv.2205.13494>). The focus is on hypothesis tests that have compatible confidence intervals, but some functions only have confidence intervals (e.g., prevSeSp).
This package provides tools for interacting with the geographic name resolution service ('GNRS') API <https://github.com/ojalaquellueva/gnrs> and associated functionality. The GNRS is a batch application for resolving & standardizing political division names against standard name in the geonames database <http://www.geonames.org/>. The GNRS resolves political division names at three levels: country, state/province and county/parish. Resolution is performed in a series of steps, beginning with direct matching to standard names, followed by direct matching to alternate names in different languages, followed by direct matching to standard codes (such as ISO and FIPS codes). If direct matching fails, the GNRS attempts to match to standard and then alternate names using fuzzy matching, but does not perform fuzzing matching of political division codes. The GNRS works down the political division hierarchy, stopping at the current level if all matches fail. In other words, if a country cannot be matched, the GNRS does not attempt to match state or county.
This package provides a set of tools for fitting Markov-modulated linear regression, where responses Y(t) are time-additive, and model operates in the external environment, which is described as a continuous time Markov chain with finite state space. Model is proposed by Alexander Andronov (2012) <arXiv:1901.09600v1> and algorithm of parameters estimation is based on eigenvalues and eigenvectors decomposition. Markov-switching regression models have the same idea of varying the regression parameters randomly in accordance with external environment. The difference is that for Markov-modulated linear regression model the external environment is described as a continuous-time homogeneous irreducible Markov chain with known parameters while switching models consider Markov chain as unobserved and estimation procedure involves estimation of transition matrix. These models have significant differences in terms of the analytical approach. Also, package provides a set of data simulation tools for Markov-modulated linear regression (for academical/research purposes). Research project No. 1.1.1.2/VIAA/1/16/075.
This project aims to enable the method of Path Analysis to infer causalities from data. For this we propose a hybrid approach, which uses Bayesian network structure learning algorithms from data to create the input file for creation of a PA model. The process is performed in a semi-automatic way by our intermediate algorithm, allowing novice researchers to create and evaluate their own PA models from a data set. The references used for this project are: Koller, D., & Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT press. <doi:10.1017/S0269888910000275>. Nagarajan, R., Scutari, M., & Lèbre, S. (2013). Bayesian networks in r. Springer, 122, 125-127. Scutari, M., & Denis, J. B. <doi:10.1007/978-1-4614-6446-4>. Scutari M (2010). Bayesian networks: with examples in R. Chapman and Hall/CRC. <doi:10.1201/b17065>. Rosseel, Y. (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(2), 1 - 36. <doi:10.18637/jss.v048.i02>.
Holistic Multimodel Domain Analysis (HMDA) is a robust and transparent framework designed for exploratory machine learning research, aiming to enhance the process of feature assessment and selection. HMDA addresses key limitations of traditional machine learning methods by evaluating the consistency across multiple high-performing models within a fine-tuned modeling grid, thereby improving the interpretability and reliability of feature importance assessments. Specifically, it computes Weighted Mean SHapley Additive exPlanations (WMSHAP), which aggregate feature contributions from multiple models based on weighted performance metrics. HMDA also provides confidence intervals to demonstrate the stability of these feature importance estimates. This framework is particularly beneficial for analyzing complex, multidimensional datasets common in health research, supporting reliable exploration of mental health outcomes such as suicidal ideation, suicide attempts, and other psychological conditions. Additionally, HMDA includes automated procedures for feature selection based on WMSHAP ratios and performs dimension reduction analyses to identify underlying structures among features. For more details see Haghish (2025) <doi:10.13140/RG.2.2.32473.63846>.
This package provides a set of models to estimate nonlinear longitudinal data using Bayesian estimation methods. These models include the: 1) Bayesian Piecewise Random Effects Model (Bayes_PREM()) which estimates a piecewise random effects (mixture) model for a given number of latent classes and a latent number of possible changepoints in each class, and can incorporate class and outcome predictive covariates (see Lamm (2022) <https://hdl.handle.net/11299/252533> and Lock et al., (2018) <doi:10.1007/s11336-017-9594-5>), 2) Bayesian Crossed Random Effects Model (Bayes_CREM()) which estimates a linear, quadratic, exponential, or piecewise crossed random effects models where individuals are changing groups over time (e.g., students and schools; see Rohloff et al., (2024) <doi:10.1111/bmsp.12334>), and 3) Bayesian Bivariate Piecewise Random Effects Model (Bayes_BPREM()) which estimates a bivariate piecewise random effects model to jointly model two related outcomes (e.g., reading and math achievement; see Peralta et al., (2022) <doi:10.1037/met0000358>).
This package provides nonparametric CUSUM tests for detecting changes in possibly serially dependent univariate or low-dimensional multivariate observations. Retrospective tests sensitive to changes in the expectation, the variance, the covariance, the autocovariance, the distribution function, Spearman's rho, Kendall's tau, Gini's mean difference, and the copula are provided, as well as a test for detecting changes in the distribution of independent block maxima (with environmental studies in mind). The package also contains a test sensitive to changes in the autocopula and a combined test of stationarity sensitive to changes in the distribution function and the autocopula. The latest additions are an open-end sequential test based on the retrospective CUSUM statistic that can be used for monitoring changes in the mean of possibly serially dependent univariate observations, as well as closed-end and open-end sequential tests based on empirical distribution functions that can be used for monitoring changes in the contemporary distribution of possibly serially dependent univariate or low-dimensional multivariate observations.
Ordination comprises several multivariate exploratory and explanatory techniques with theoretical foundations in geometric data analysis; see Podani (2000, ISBN:90-5782-067-6) for techniques and applications and Le Roux & Rouanet (2005) <doi:10.1007/1-4020-2236-0> for foundations. Greenacre (2010, ISBN:978-84-923846) shows how the most established of these, including principal components analysis, correspondence analysis, multidimensional scaling, factor analysis, and discriminant analysis, rely on eigen-decompositions or singular value decompositions of pre-processed numeric matrix data. These decompositions give rise to a set of shared coordinates along which the row and column elements can be measured. The overlay of their scatterplots on these axes, introduced by Gabriel (1971) <doi:10.1093/biomet/58.3.453>, is called a biplot. ordr provides inspection, extraction, manipulation, and visualization tools for several popular ordination classes supported by a set of recovery methods. It is inspired by and designed to integrate into Tidyverse workflows provided by Wickham et al (2019) <doi:10.21105/joss.01686>.
Institutional performance assessment remains a key challenge to a multitude of stakeholders. Existing indicators such as h-type indicators, g-type indicators, and many others do not reflect expertise of institutions that defines their research portfolio. The package offers functionality to compute and visualise two novel indices: the x-index and the xd-index. The x-index evaluates an institution's scholarly expertise within a specific discipline or field, while the xd-index provides a broader assessment of overall scholarly expertise considering an institution's publication pattern and strengths across coarse thematic areas. These indices offer a nuanced understanding of institutional research capabilities, aiding stakeholders in research management and resource allocation decisions. Lathabai, H.H., Nandy, A., and Singh, V.K. (2021) <doi:10.1007/s11192-021-04188-3>. Nandy, A., Lathabai, H.H., and Singh, V.K. (2023) <doi:10.5281/zenodo.8305585>. This package provides the h, g, x, and xd indices for use with standard format of Web of Science (WoS) scrapped datasets.
The President of the United States is constitutionally obligated to provide a report known as the State of the Union'. The report summarizes the current challenges facing the country and the president's upcoming legislative agenda. While historically the State of the Union was often a written document, in recent decades it has always taken the form of an oral address to a joint session of the United States Congress. This package provides the raw text from every such address with the intention of being used for meaningful examples of text analysis in R. The corpus is well suited to the task as it is historically important, includes material intended to be read and material intended to be spoken, and it falls in the public domain. As the corpus spans over two centuries it is also a good test of how well various methods hold up to the idiosyncrasies of historical texts. Associated data about each address, such as the year, president, party, and format, are also included.
The Stratified-Petersen Analysis System (SPAS) is designed to estimate abundance in two-sample capture-recapture experiments where the capture and recaptures are stratified. This is a generalization of the simple Lincoln-Petersen estimator. Strata may be defined in time or in space or both, and the s strata in which marking takes place may differ from the t strata in which recoveries take place. When s=t, SPAS reduces to the method described by Darroch (1961) <doi:10.2307/2332748>. When s<t, SPAS implements the methods described in Plante, Rivest, and Tremblay (1988) <doi:10.2307/2533994>. Schwarz and Taylor (1998) <doi:10.1139/f97-238> describe the use of SPAS in estimating return of salmon stratified by time and geography. A related package, BTSPAS, deals with temporal stratification where a spline is used to model the distribution of the population over time as it passes the second capture location. This is the R-version of the (now obsolete) standalone Windows program of the same name.
This package implements the novel testing approach by Janitza et al.(2015) <http://nbn-resolving.de/urn/resolver.pl?urn=nbn:de:bvb:19-epub-25587-4> for the permutation variable importance measure in a random forest and the PIMP-algorithm by Altmann et al.(2010) <doi:10.1093/bioinformatics/btq134>. Janitza et al.(2015) <http://nbn-resolving.de/urn/resolver.pl?urn=nbn:de:bvb:19-epub-25587-4> do not use the "standard" permutation variable importance but the cross-validated permutation variable importance for the novel test approach. The cross-validated permutation variable importance is not based on the out-of-bag observations but uses a similar strategy which is inspired by the cross-validation procedure. The novel test approach can be applied for classification trees as well as for regression trees. However, the use of the novel testing approach has not been tested for regression trees so far, so this routine is meant for the expert user only and its current state is rather experimental.
Haplotype calling from phased marker data. Given user-defined haplotype blocks (HapBlock), the package identifies the different haplotype alleles (HapAllele) present in the data and scores sample haplotype allele genotypes (HapGenotype) based on HapAllele dose (i.e. 0, 1 or 2 copies). The output is not only useful for analyses that can handle multi-allelic markers, but is also conveniently formatted for existing pipelines intended for bi-allelic markers. The package was first described in Bioinformatics by Utsunomiya et al. (2016, <doi:10.1093/bioinformatics/btw356>). Since the v2 release, the package provides functions for unsupervised and supervised detection of ancestry tracks. The methods implemented in these functions were described in an article published in Methods in Ecology and Evolution by Utsunomiya et al. (2020, <doi:10.1111/2041-210X.13467>). The source code for v3 was modified for improved performance and inclusion of new functionality, including analysis of unphased data, runs of homozygosity, sampling methods for virtual gamete mating, mixed model fitting and GWAS.
This package provides functions similar to the SAS macros previously provided to accompany Collins, Dziak, and Li (2009) <DOI:10.1037/a0015826> and Dziak, Nahum-Shani, and Collins (2012) <DOI:10.1037/a0026972>, papers which outline practical benefits and challenges of factorial and fractional factorial experiments for scientists interested in developing biological and/or behavioral interventions, especially in the context of the multiphase optimization strategy (see Collins, Kugler & Gwadz 2016) <DOI:10.1007/s10461-015-1145-4>. The package currently contains three functions. First, RelativeCosts1() draws a graph of the relative cost of complete and reduced factorial designs versus other alternatives. Second, RandomAssignmentGenerator() returns a dataframe which contains a list of random numbers that can be used to conveniently assign participants to conditions in an experiment with many conditions. Third, FactorialPowerPlan() estimates the power, detectable effect size, or required sample size of a factorial or fractional factorial experiment, for main effects or interactions, given several possible choices of effect size metric, and allowing pretests and clustering.
This package provides a C++ implementation of the following evolutionary algorithms: Bat Algorithm (Yang, 2010 <doi:10.1007/978-3-642-12538-6_6>), Cuckoo Search (Yang, 2009 <doi:10.1109/nabic.2009.5393690>), Genetic Algorithms (Holland, 1992, ISBN:978-0262581110), Gravitational Search Algorithm (Rashedi et al., 2009 <doi:10.1016/j.ins.2009.03.004>), Grey Wolf Optimization (Mirjalili et al., 2014 <doi:10.1016/j.advengsoft.2013.12.007>), Harmony Search (Geem et al., 2001 <doi:10.1177/003754970107600201>), Improved Harmony Search (Mahdavi et al., 2007 <doi:10.1016/j.amc.2006.11.033>), Moth-flame Optimization (Mirjalili, 2015 <doi:10.1016/j.knosys.2015.07.006>), Particle Swarm Optimization (Kennedy et al., 2001 ISBN:1558605959), Simulated Annealing (Kirkpatrick et al., 1983 <doi:10.1126/science.220.4598.671>), Whale Optimization Algorithm (Mirjalili and Lewis, 2016 <doi:10.1016/j.advengsoft.2016.01.008>). EmiR can be used not only for unconstrained optimization problems, but also in presence of inequality constrains, and variables restricted to be integers.
Targeted maximum likelihood estimation of point treatment effects (Targeted Maximum Likelihood Learning, The International Journal of Biostatistics, 2(1), 2006. This version automatically estimates the additive treatment effect among the treated (ATT) and among the controls (ATC). The tmle() function calculates the adjusted marginal difference in mean outcome associated with a binary point treatment, for continuous or binary outcomes. Relative risk and odds ratio estimates are also reported for binary outcomes. Missingness in the outcome is allowed, but not in treatment assignment or baseline covariate values. The population mean is calculated when there is missingness, and no variation in the treatment assignment. The tmleMSM() function estimates the parameters of a marginal structural model for a binary point treatment effect. Effect estimation stratified by a binary mediating variable is also available. An ID argument can be used to identify repeated measures. Default settings call SuperLearner to estimate the Q and g portions of the likelihood, unless values or a user-supplied regression function are passed in as arguments.
The Bayesian optimal interval (BOIN) design is a novel phase I clinical trial design for finding the maximum tolerated dose (MTD). It can be used to design both single-agent and drug-combination trials. The BOIN design is motivated by the top priority and concern of clinicians when testing a new drug, which is to effectively treat patients and minimize the chance of exposing them to subtherapeutic or overly toxic doses. The prominent advantage of the BOIN design is that it achieves simplicity and superior performance at the same time. The BOIN design is algorithm-based and can be implemented in a simple way similar to the traditional 3+3 design. The BOIN design yields an average performance that is comparable to that of the continual reassessment method (CRM, one of the best model-based designs) in terms of selecting the MTD, but has a substantially lower risk of assigning patients to subtherapeutic or overly toxic doses. For tutorial, please check Yan et al. (2020) <doi:10.18637/jss.v094.i13>.
This package performs parametric and non-parametric estimation and simulation for multi-state discrete-time semi-Markov processes. For the parametric estimation, several discrete distributions are considered for the sojourn times: Uniform, Geometric, Poisson, Discrete Weibull and Negative Binomial. The non-parametric estimation concerns the sojourn time distributions, where no assumptions are done on the shape of distributions. Moreover, the estimation can be done on the basis of one or several sample paths, with or without censoring at the beginning or/and at the end of the sample paths. Reliability indicators such as reliability, maintainability, availability, BMP-failure rate, RG-failure rate, mean time to failure and mean time to repair are available as well. The implemented methods are described in Barbu, V.S., Limnios, N. (2008) <doi:10.1007/978-0-387-73173-5>, Barbu, V.S., Limnios, N. (2008) <doi:10.1080/10485250701261913> and Trevezas, S., Limnios, N. (2011) <doi:10.1080/10485252.2011.555543>. Estimation and simulation of discrete-time k-th order Markov chains are also considered.
Count data is prevalent and informative, with widespread application in many fields such as social psychology, personality, and public health. Classical statistical methods for the analysis of count outcomes are commonly variants of the log-linear model, including Poisson regression and Negative Binomial regression. However, a typical problem with count data modeling is inflation, in the sense that the counts are evidently accumulated on some integers. Such an inflation problem could distort the distribution of the observed counts, further bias estimation and increase error, making the classic methods infeasible. Traditional inflated value selection methods based on histogram inspection are easy to neglect true points and computationally expensive in addition. Therefore, we propose a multiple-inflated negative binomial model to handle count data modeling with multiple inflated values, achieving data-driven inflated value selection. The proposed approach provides simultaneous identification of important regression predictors on the target count response as well. More details about the proposed method are described in Li, Y., Wu, M., Wu, M., & Ma, S. (2023) <arXiv:2309.15585>.