This package provides a set of model-assisted survey estimators and corresponding variance estimators for single stage, unequal probability, without replacement sampling designs. All of the estimators can be written as a generalized regression estimator with the Horvitz-Thompson, ratio, post-stratified, and regression estimators summarized by Sarndal et al. (1992, ISBN:978-0-387-40620-6). Two of the estimators employ a statistical learning model as the assisting model: the elastic net regression estimator, which is an extension of the lasso regression estimator given by McConville et al. (2017) <doi:10.1093/jssam/smw041>, and the regression tree estimator described in McConville and Toth (2017) <arXiv:1712.05708>. The variance estimators which approximate the joint inclusion probabilities can be found in Berger and Tille (2009) <doi:10.1016/S0169-7161(08)00002-3> and the bootstrap variance estimator is presented in Mashreghi et al. (2016) <doi:10.1214/16-SS113>.
The general principle relies on calculating the cumulative signal of nascent RNA sequencing over the gene body of any given gene or transcription unit. tepr can identify transcription attenuation sites by comparing profile to a null model which assumes uniform read density over the entirety of the transcription unit. It can also identify increased or diminished transcription attenuation by comparing two conditions. Besides rigorous statistical testing and high sensitivity, a major feature of tepr is its ability to provide the elongation pattern of each individual gene, including the position of the main attenuation point when such a phenomenon occurs. Using tepr', users can visualize and refine genome-wide aggregated analyses of elongation patterns to robustly identify effects specific to subsets of genes. These metrics are suitable for internal comparisons (between genes in each condition) and for studying elongation of the same gene in different conditions or comparing it to a perfect theoretical uniform elongation.
Spline regression, generalized additive models and component-wise gradient boosting utilizing geometrically designed (GeD) splines. GeDS regression is a non-parametric method inspired by geometric principles, for fitting spline regression models with variable knots in one or two independent variables. It efficiently estimates the number of knots and their positions, as well as the spline order, assuming the response variable follows a distribution from the exponential family. GeDS models integrate the broader category of generalized (non-)linear models, offering a flexible approach to model complex relationships. A description of the method can be found in Kaishev et al. (2016) <doi:10.1007/s00180-015-0621-7> and Dimitrova et al. (2023) <doi:10.1016/j.amc.2022.127493>. Further extending its capabilities, GeDS's implementation includes generalized additive models (GAM) and functional gradient boosting (FGB), enabling versatile multivariate predictor modeling, as discussed in the forthcoming work of Dimitrova et al. (2025).
Provide the core functionality to transform longitudinal data to complex-time (kime) data using analytic and numerical techniques, visualize the original time-series and reconstructed kime-surfaces, perform model based (e.g., tensor-linear regression) and model-free classification and clustering methods in the book Dinov, ID and Velev, MV. (2021) "Data Science: Time Complexity, Inferential Uncertainty, and Spacekime Analytics", De Gruyter STEM Series, ISBN 978-3-11-069780-3. <https://www.degruyter.com/view/title/576646>. The package includes 18 core functions which can be separated into three groups. 1) draw longitudinal data, such as Functional magnetic resonance imaging(fMRI) time-series, and forecast or transform the time-series data. 2) simulate real-valued time-series data, e.g., fMRI time-courses, detect the activated areas, report the corresponding p-values, and visualize the p-values in the 3D brain space. 3) Laplace transform and kimesurface reconstructions of the fMRI data.
Dynamic CUR (dCUR) boosts the CUR decomposition (Mahoney MW., Drineas P. (2009) <doi:10.1073/pnas.0803205106>) varying the k, the number of columns and rows used, and its final purposes to help find the stage, which minimizes the relative error to reduce matrix dimension. The goal of CUR Decomposition is to give a better interpretation of the matrix decomposition employing proper variable selection in the data matrix, in a way that yields a simplified structure. Its origins come from analysis in genetics. The goal of this package is to show an alternative to variable selection (columns) or individuals (rows). The idea proposed consists of adjusting the probability distributions to the leverage scores and selecting the best columns and rows that minimize the reconstruction error of the matrix approximation ||A-CUR||. It also includes a method that recalibrates the relative importance of the leverage scores according to an external variable of the user's interest.
GPU'/CPU Benchmarking on Debian-package based systems This package benchmarks performance of a few standard linear algebra operations (such as a matrix product and QR, SVD and LU decompositions) across a number of different BLAS libraries as well as a GPU implementation. To do so, it takes advantage of the ability to plug and play different BLAS implementations easily on a Debian and/or Ubuntu system. The current version supports - Reference BLAS ('refblas') which are un-accelerated as a baseline - Atlas which are tuned but typically configure single-threaded - Atlas39 which are tuned and configured for multi-threaded mode - Goto Blas which are accelerated and multi-threaded - Intel MKL which is a commercial accelerated and multithreaded version. As for GPU computing, we use the CRAN package - gputools For Goto Blas', the gotoblas2-helper script from the ISM in Tokyo can be used. For Intel MKL we use the Revolution R packages from Ubuntu 9.10.
Helper functions to implement univariate and bivariate latent change score models in R using the lavaan package. For details about Latent Change Score Modeling (LCSM) see McArdle (2009) <doi:10.1146/annurev.psych.60.110707.163612> and Grimm, An, McArdle, Zonderman and Resnick (2012) <doi:10.1080/10705511.2012.659627>. The package automatically generates lavaan syntax for different model specifications and varying timepoints. The lavaan syntax generated by this package can be returned and further specifications can be added manually. Longitudinal plots as well as simplified path diagrams can be created to visualise data and model specifications. Estimated model parameters and fit statistics can be extracted as data frames. Data for different univariate and bivariate LCSM can be simulated by specifying estimates for model parameters to explore their effects. This package combines the strengths of other R packages like lavaan', broom', and semPlot by generating lavaan syntax that helps these packages work together.
Genomic coordinates of CTCF binding sites, with strand orientation (directionality of binding). Position weight matrices (PWMs) from JASPAR, HOCOMOCO, CIS-BP, CTCFBSDB, SwissRegulon, Jolma 2013, were used to uniformly predict CTCF binding sites using FIMO (default settings) on human (hg18, hg19, hg38, T2T) and mouse (mm9, mm10, mm39) genome assemblies. Extra columns include motif/PWM name (e.g., MA0139.1), score, p-value, q-value, and the motif sequence. It is recommended to filter FIMO-predicted sites by 1e-6 p-value threshold instead of using the default 1e-4 threshold. Experimentally obtained CTCF-bound cis-regulatory elements from ENCODE SCREEN and predicted CTCF sites from CTCFBSDB are also included. Selected data are lifted over from a different genome assembly as we demonstrated liftOver is a viable option to obtain CTCF coordinates in different genome assemblies. CTCF sites obtained using JASPAR's MA0139.1 PWM and filtered at 1e-6 p-value threshold are recommended.
Universally unique identifiers ('UUIDs') can be sub-optimal for many uses-cases because they are not the most character efficient way of encoding 128 bits of randomness; v1/v2 versions are impractical in many environments, as they require access to a unique, stable MAC address; v3/v5 versions require a unique seed and produce randomly distributed IDs, which can cause fragmentation in many data structures; v4 provides no other information than randomness which can cause fragmentation in many data structures. Providing an alternative, ULIDs (<https://github.com/ulid/spec>) have 128-bit compatibility with UUID', 1.21e+24 unique ULIDs per millisecond, support standard (text) sorting, canonically encoded as a 26 character string, as opposed to the 36 character UUID', use base32 encoding for better efficiency and readability (5 bits per character), are case insensitive, have no special characters (i.e. are URL safe) and have a monotonic sort order (correctly detects and handles the same millisecond).
This package provides a sizable genomics study such as microarray often involves the use of multiple batches (groups) of experiment due to practical complication. To minimize batch effects, a careful experiment design should ensure the even distribution of biological groups and confounding factors across batches. OSAT (Optimal Sample Assignment Tool) is developed to facilitate the allocation of collected samples to different batches. With minimum steps, it produces setup that optimizes the even distribution of samples in groups of biological interest into different batches, reducing the confounding or correlation between batches and the biological variables of interest. It can also optimize the even distribution of confounding factors across batches. Our tool can handle challenging instances where incomplete and unbalanced sample collections are involved as well as ideal balanced RCBD. OSAT provides a number of predefined layout for some of the most commonly used genomics platform. Related paper can be find at http://www.biomedcentral.com/1471-2164/13/689 .
Implementations of an estimator for the multivariate regression association measure (MRAM) proposed in Shih and Chen (2025) <in revision> and its associated variable selection algorithm. The MRAM quantifies the predictability of a random vector Y from a random vector X given a random vector Z. It takes the maximum value 1 if and only if Y is almost surely a measurable function of X and Z, and the minimum value of 0 if Y is conditionally independent of X given Z. The MRAM generalizes the Kendall's tau copula correlation ratio proposed in Shih and Emura (2021) <doi:10.1016/j.jmva.2020.104708> by employing the spatial sign function. The estimator is based on the nearest neighbor method, and the associated variable selection algorithm is adapted from the feature ordering by conditional independence (FOCI) algorithm of Azadkia and Chatterjee (2021) <doi:10.1214/21-AOS2073>. For further details, see the paper Shih and Chen (2025) <in revision>.
Kevin Dowd's book Measuring Market Risk is a widely read book in the area of risk measurement by students and practitioners alike. As he claims, MATLAB indeed might have been the most suitable language when he originally wrote the functions, but, with growing popularity of R it is not entirely valid. As Dowd's code was not intended to be error free and were mainly for reference, some functions in this package have inherited those errors. An attempt will be made in future releases to identify and correct them. Dowd's original code can be downloaded from www.kevindowd.org/measuring-market-risk/. It should be noted that Dowd offers both MMR2 and MMR1 toolboxes. Only MMR2 was ported to R. MMR2 is more recent version of MMR1 toolbox and they both have mostly similar function. The toolbox mainly contains different parametric and non parametric methods for measurement of market risk as well as backtesting risk measurement methods.
This package provides a comprehensive suite of genome-wide association study (GWAS) methods specifically designed for biobank-scale data, including but not limited to, robust approaches for time-to-event traits (Li et al., 2025 <doi:10.1038/s43588-025-00864-z>) and ordinal categorical traits (Bi et al., 2021 <doi:10.1016/j.ajhg.2021.03.019>). The package also offers general frameworks for GWAS of any trait type (Bi et al., 2020 <doi:10.1016/j.ajhg.2020.06.003>), while accounting for sample relatedness (Xu et al., 2025 <doi:10.1038/s41467-025-56669-1>) or population structure (Ma et al., 2025 <doi:10.1186/s13059-025-03827-9>). By accurately approximating score statistic distributions using saddlepoint approximation (SPA), these methods can effectively control type I error rates for rare variants and in the presence of unbalanced phenotype distributions. Additionally, the package includes functions for simulating genotype and phenotype data to support research and method development.
Obtaining relevant set of trait specific genes from gene expression data is important for clinical diagnosis of disease and discovery of disease mechanisms in plants and animals. This process involves identification of relevant genes and removal of redundant genes as much as possible from a whole gene set. This package returns the trait specific gene set from the high dimensional RNA-seq count data by applying combination of two conventional machine learning algorithms, support vector machine (SVM) and genetic algorithm (GA). GA is used to control and optimize the subset of genes sent to the SVM for classification and evaluation. Genetic algorithm uses repeated learning steps and cross validation over number of possible solution and selects the best. The algorithm selects the set of genes based on a fitness function that is obtained via support vector machines. Using SVM as the classifier performance and the genetic algorithm for feature selection, a set of trait specific gene set is obtained.
Gives some hypothesis test functions (sign test, median and other quantile tests, Wilcoxon signed rank test, coefficient of variation test, test of normal variance, test on weighted sums of Poisson [see Fay and Kim <doi:10.1002/bimj.201600111>], sample size for t-tests with different variances and non-equal n per arm, Behrens-Fisher test, nonparametric ABC intervals, Wilcoxon-Mann-Whitney test [with effect estimates and confidence intervals, see Fay and Malinovsky <doi:10.1002/sim.7890>], two-sample melding tests [see Fay, Proschan, and Brittain <doi:10.1111/biom.12231>], one-way ANOVA allowing var.equal=FALSE [see Brown and Forsythe, 1974, Biometrics]), prevalence confidence intervals that adjust for sensitivity and specificity [see Lang and Reiczigel, 2014 <doi:10.1016/j.prevetmed.2013.09.015>] or Bayer, Fay, and Graubard, 2023 <doi:10.48550/arXiv.2205.13494>). The focus is on hypothesis tests that have compatible confidence intervals, but some functions only have confidence intervals (e.g., prevSeSp).
This package provides tools for interacting with the geographic name resolution service ('GNRS') API <https://github.com/ojalaquellueva/gnrs> and associated functionality. The GNRS is a batch application for resolving & standardizing political division names against standard name in the geonames database <http://www.geonames.org/>. The GNRS resolves political division names at three levels: country, state/province and county/parish. Resolution is performed in a series of steps, beginning with direct matching to standard names, followed by direct matching to alternate names in different languages, followed by direct matching to standard codes (such as ISO and FIPS codes). If direct matching fails, the GNRS attempts to match to standard and then alternate names using fuzzy matching, but does not perform fuzzing matching of political division codes. The GNRS works down the political division hierarchy, stopping at the current level if all matches fail. In other words, if a country cannot be matched, the GNRS does not attempt to match state or county.
This package provides a set of tools for fitting Markov-modulated linear regression, where responses Y(t) are time-additive, and model operates in the external environment, which is described as a continuous time Markov chain with finite state space. Model is proposed by Alexander Andronov (2012) <arXiv:1901.09600v1> and algorithm of parameters estimation is based on eigenvalues and eigenvectors decomposition. Markov-switching regression models have the same idea of varying the regression parameters randomly in accordance with external environment. The difference is that for Markov-modulated linear regression model the external environment is described as a continuous-time homogeneous irreducible Markov chain with known parameters while switching models consider Markov chain as unobserved and estimation procedure involves estimation of transition matrix. These models have significant differences in terms of the analytical approach. Also, package provides a set of data simulation tools for Markov-modulated linear regression (for academical/research purposes). Research project No. 1.1.1.2/VIAA/1/16/075.
This project aims to enable the method of Path Analysis to infer causalities from data. For this we propose a hybrid approach, which uses Bayesian network structure learning algorithms from data to create the input file for creation of a PA model. The process is performed in a semi-automatic way by our intermediate algorithm, allowing novice researchers to create and evaluate their own PA models from a data set. The references used for this project are: Koller, D., & Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT press. <doi:10.1017/S0269888910000275>. Nagarajan, R., Scutari, M., & Lèbre, S. (2013). Bayesian networks in r. Springer, 122, 125-127. Scutari, M., & Denis, J. B. <doi:10.1007/978-1-4614-6446-4>. Scutari M (2010). Bayesian networks: with examples in R. Chapman and Hall/CRC. <doi:10.1201/b17065>. Rosseel, Y. (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(2), 1 - 36. <doi:10.18637/jss.v048.i02>.
Holistic Multimodel Domain Analysis (HMDA) is a robust and transparent framework designed for exploratory machine learning research, aiming to enhance the process of feature assessment and selection. HMDA addresses key limitations of traditional machine learning methods by evaluating the consistency across multiple high-performing models within a fine-tuned modeling grid, thereby improving the interpretability and reliability of feature importance assessments. Specifically, it computes Weighted Mean SHapley Additive exPlanations (WMSHAP), which aggregate feature contributions from multiple models based on weighted performance metrics. HMDA also provides confidence intervals to demonstrate the stability of these feature importance estimates. This framework is particularly beneficial for analyzing complex, multidimensional datasets common in health research, supporting reliable exploration of mental health outcomes such as suicidal ideation, suicide attempts, and other psychological conditions. Additionally, HMDA includes automated procedures for feature selection based on WMSHAP ratios and performs dimension reduction analyses to identify underlying structures among features. For more details see Haghish (2025) <doi:10.13140/RG.2.2.32473.63846>.
This package provides a set of models to estimate nonlinear longitudinal data using Bayesian estimation methods. These models include the: 1) Bayesian Piecewise Random Effects Model (Bayes_PREM()) which estimates a piecewise random effects (mixture) model for a given number of latent classes and a latent number of possible changepoints in each class, and can incorporate class and outcome predictive covariates (see Lamm (2022) <https://hdl.handle.net/11299/252533> and Lock et al., (2018) <doi:10.1007/s11336-017-9594-5>), 2) Bayesian Crossed Random Effects Model (Bayes_CREM()) which estimates a linear, quadratic, exponential, or piecewise crossed random effects models where individuals are changing groups over time (e.g., students and schools; see Rohloff et al., (2024) <doi:10.1111/bmsp.12334>), and 3) Bayesian Bivariate Piecewise Random Effects Model (Bayes_BPREM()) which estimates a bivariate piecewise random effects model to jointly model two related outcomes (e.g., reading and math achievement; see Peralta et al., (2022) <doi:10.1037/met0000358>).
This package provides nonparametric CUSUM tests for detecting changes in possibly serially dependent univariate or low-dimensional multivariate observations. Retrospective tests sensitive to changes in the expectation, the variance, the covariance, the autocovariance, the distribution function, Spearman's rho, Kendall's tau, Gini's mean difference, and the copula are provided, as well as a test for detecting changes in the distribution of independent block maxima (with environmental studies in mind). The package also contains a test sensitive to changes in the autocopula and a combined test of stationarity sensitive to changes in the distribution function and the autocopula. The latest additions are an open-end sequential test based on the retrospective CUSUM statistic that can be used for monitoring changes in the mean of possibly serially dependent univariate observations, as well as closed-end and open-end sequential tests based on empirical distribution functions that can be used for monitoring changes in the contemporary distribution of possibly serially dependent univariate or low-dimensional multivariate observations.
Ordination comprises several multivariate exploratory and explanatory techniques with theoretical foundations in geometric data analysis; see Podani (2000, ISBN:90-5782-067-6) for techniques and applications and Le Roux & Rouanet (2005) <doi:10.1007/1-4020-2236-0> for foundations. Greenacre (2010, ISBN:978-84-923846) shows how the most established of these, including principal components analysis, correspondence analysis, multidimensional scaling, factor analysis, and discriminant analysis, rely on eigen-decompositions or singular value decompositions of pre-processed numeric matrix data. These decompositions give rise to a set of shared coordinates along which the row and column elements can be measured. The overlay of their scatterplots on these axes, introduced by Gabriel (1971) <doi:10.1093/biomet/58.3.453>, is called a biplot. ordr provides inspection, extraction, manipulation, and visualization tools for several popular ordination classes supported by a set of recovery methods. It is inspired by and designed to integrate into Tidyverse workflows provided by Wickham et al (2019) <doi:10.21105/joss.01686>.
Institutional performance assessment remains a key challenge to a multitude of stakeholders. Existing indicators such as h-type indicators, g-type indicators, and many others do not reflect expertise of institutions that defines their research portfolio. The package offers functionality to compute and visualise two novel indices: the x-index and the xd-index. The x-index evaluates an institution's scholarly expertise within a specific discipline or field, while the xd-index provides a broader assessment of overall scholarly expertise considering an institution's publication pattern and strengths across coarse thematic areas. These indices offer a nuanced understanding of institutional research capabilities, aiding stakeholders in research management and resource allocation decisions. Lathabai, H.H., Nandy, A., and Singh, V.K. (2021) <doi:10.1007/s11192-021-04188-3>. Nandy, A., Lathabai, H.H., and Singh, V.K. (2023) <doi:10.5281/zenodo.8305585>. This package provides the h, g, x, and xd indices for use with standard format of Web of Science (WoS) scrapped datasets.
The Stratified-Petersen Analysis System (SPAS) is designed to estimate abundance in two-sample capture-recapture experiments where the capture and recaptures are stratified. This is a generalization of the simple Lincoln-Petersen estimator. Strata may be defined in time or in space or both, and the s strata in which marking takes place may differ from the t strata in which recoveries take place. When s=t, SPAS reduces to the method described by Darroch (1961) <doi:10.2307/2332748>. When s<t, SPAS implements the methods described in Plante, Rivest, and Tremblay (1988) <doi:10.2307/2533994>. Schwarz and Taylor (1998) <doi:10.1139/f97-238> describe the use of SPAS in estimating return of salmon stratified by time and geography. A related package, BTSPAS, deals with temporal stratification where a spline is used to model the distribution of the population over time as it passes the second capture location. This is the R-version of the (now obsolete) standalone Windows program of the same name.