Multidimensional scaling (MDS) functions for various tasks that are beyond the beta stage and way past the alpha stage. Currently, options are available for weights, restrictions, classical scaling or principal coordinate analysis, transformations (linear, power, Box-Cox, spline, ordinal), outlier mitigation (rdop), out-of-sample estimation (predict), negative dissimilarities, fast and faster executions with low memory footprints, penalized restrictions, cross-validation-based penalty selection, supplementary variable estimation (explain), additive constant estimation, mixed measurement level distance calculation, restricted classical scaling, etc. More will come in the future. References. Busing (2024) "A Simple Population Size Estimator for Local Minima Applied to Multidimensional Scaling". Manuscript submitted for publication. Busing (2025) "Node Localization by Multidimensional Scaling with Iterative Majorization". Manuscript submitted for publication. Busing (2025) "Faster Multidimensional Scaling". Manuscript in preparation. Barroso and Busing (2025) "e-RDOP, Relative Density-Based Outlier Probabilities, Extended to Proximity Mapping". Manuscript submitted for publication.
An end-to-end toolkit for land use and land cover classification using big Earth observation data. Builds satellite image data cubes from cloud collections. Supports visualization methods for images and time series and smoothing filters for dealing with noisy time series. Enables merging of multi-source imagery (SAR, optical, DEM). Includes functions for quality assessment of training samples using self-organized maps and to reduce training samples imbalance. Provides machine learning algorithms including support vector machines, random forests, extreme gradient boosting, multi-layer perceptrons, temporal convolution neural networks, and temporal attention encoders. Performs efficient classification of big Earth observation data cubes and includes functions for post-classification smoothing based on Bayesian inference. Enables best practices for estimating area and assessing accuracy of land change. Includes object-based spatio-temporal segmentation for space-time OBIA. Minimum recommended requirements: 16 GB RAM and 4 CPU dual-core.
This package provides a Bayesian latent space model for complex networks, either weighted or unweighted. Given an observed input graph, the estimates for the latent coordinates of the nodes are obtained through a Bayesian MCMC algorithm. The overall likelihood of the graph depends on a fundamental probability equation, which is defined so that ties are more likely to exist between nodes whose latent space coordinates are close. The package is mainly based on the model by Hoff, Raftery and Handcock (2002) <doi:10.1198/016214502388618906> and contains some extra features (e.g., removal of the Procrustean step, weights implemented as coefficients of the latent distances, 3D plots). The original code related to the above model was retrieved from <https://www.stat.washington.edu/people/pdhoff/Code/hoff_raftery_handcock_2002_jasa/>. Users can inspect the MCMC simulation, create and customize insightful graphical representations or apply clustering techniques.
Mapping, spatial analysis, and statistical modeling of microdata from sources such as the Demographic and Health Surveys <https://www.dhsprogram.com/> and Integrated Public Use Microdata Series <https://www.ipums.org/>. It can also be extended to other datasets. The package supports spatial correlation index construction and visualization, along with empirical Bayes approximation of regression coefficients in a multistage setup. The main functionality is repeated regression â for example, if we have to run regression for n groups, the group ID should be vertically composed into the variable for the parameter `location_var`. It can perform various kinds of regression, such as Generalized Regression Models, logit, probit, and more. Additionally, it can incorporate interaction effects. The key benefit of the package is its ability to store the regression results performed repeatedly on a dataset by the group ID, along with respective p-values and map those estimates.
Entropy weighted k-means (ewkm) by Liping Jing, Michael K. Ng and Joshua Zhexue Huang (2007) <doi:10.1109/TKDE.2007.1048> is a weighted subspace clustering algorithm that is well suited to very high dimensional data. Weights are calculated as the importance of a variable with regard to cluster membership. The two-level variable weighting clustering algorithm tw-k-means (twkm) by Xiaojun Chen, Xiaofei Xu, Joshua Zhexue Huang and Yunming Ye (2013) <doi:10.1109/TKDE.2011.262> introduces two types of weights, the weights on individual variables and the weights on variable groups, and they are calculated during the clustering process. The feature group weighted k-means (fgkm) by Xiaojun Chen, Yunminng Ye, Xiaofei Xu and Joshua Zhexue Huang (2012) <doi:10.1016/j.patcog.2011.06.004> extends this concept by grouping features and weighting the group in addition to weighting individual features.
This package implements the Single Transferable Vote (STV) electoral system, with clear explanatory graphics. The core function stv() uses Meek's method, the purest expression of the simple principles of STV, but which requires electronic counting. It can handle votes expressing equal preferences for subsets of the candidates. A function stv.wig() implementing the Weighted Inclusive Gregory method, as used in Scottish council elections, is also provided, and with the same options, as described in the manual. The required vote data format is as an R list: a function pref.data() is provided to transform some commonly used data formats into this format. References for methodology: Hill, Wichmann and Woodall (1987) <doi:10.1093/comjnl/30.3.277>, Hill, David (2006) <https://www.votingmatters.org.uk/ISSUE22/I22P2.pdf>, Mollison, Denis (2023) <arXiv:2303.15310>, (see also the package manual pref_pkg_manual.pdf).
This package provides functions that allow you to generate and compare power spectral density (PSD) plots given time series data. Fast Fourier Transform (FFT) is used to take a time series data, analyze the oscillations, and then output the frequencies of these oscillations in the time series in the form of a PSD plot.Thus given a time series, the dominant frequencies in the time series can be identified. Additional functions in this package allow the dominant frequencies of multiple groups of time series to be compared with each other. To see example usage with the main functions of this package, please visit this site: <https://yhhc2.github.io/psdr/articles/Introduction.html>. The mathematical operations used to generate the PSDs are described in these sites: <https://www.mathworks.com/help/matlab/ref/fft.html>. <https://www.mathworks.com/help/signal/ug/power-spectral-density-estimates-using-fft.html>.
Accelerated destructive degradation tests (ADDT) are often used to collect necessary data for assessing the long-term properties of polymeric materials. Based on the collected data, a thermal index (TI) is estimated. The TI can be useful for material rating and comparison. This package implements the traditional method based on the least-squares method, the parametric method based on maximum likelihood estimation, and the semiparametric method based on spline methods, and the corresponding methods for estimating TI for polymeric materials. The traditional approach is a two-step approach that is currently used in industrial standards, while the parametric method is widely used in the statistical literature. The semiparametric method is newly developed. Both the parametric and semiparametric approaches allow one to do statistical inference such as quantifying uncertainties in estimation, hypothesis testing, and predictions. Publicly available datasets are provided illustrations. More details can be found in Jin et al. (2017).
An implementation of a taxonomy of models of restricted diffusion in biological tissues parametrized by the tissue geometry (axis, diameter, density, etc.). This is primarily used in the context of diffusion magnetic resonance (MR) imaging to model the MR signal attenuation in the presence of diffusion gradients. The goal is to provide tools to simulate the MR signal attenuation predicted by these models under different experimental conditions. The package feeds a companion shiny app available at <https://midi-pastrami.apps.math.cnrs.fr> that serves as a graphical interface to the models and tools provided by the package. Models currently available are the ones in Neuman (1974) <doi:10.1063/1.1680931>, Van Gelderen et al. (1994) <doi:10.1006/jmrb.1994.1038>, Stanisz et al. (1997) <doi:10.1002/mrm.1910370115>, Soderman & Jonsson (1995) <doi:10.1006/jmra.1995.0014> and Callaghan (1995) <doi:10.1006/jmra.1995.1055>.
Reconstruction of paleoclimate niches using phylogenetic comparative methods and projection reconstructed niches onto paleoclimate maps. The user can specify various models of trait evolution or estimate the best fit model, include fossils, use one or multiple phylogenies for inference, and make animations of shifting suitable habitat through time. This model was first used in Lawing and Polly (2011), and further implemented in Lawing et al (2016) and Rivera et al (2020). Lawing and Polly (2011) <doi:10.1371/journal.pone.0028554> "Pleistocene climate, phylogeny and climate envelope models: An integrative approach to better understand species response to climate change" Lawing et al (2016) <doi:10.1086/687202> "Including fossils in phylogenetic climate reconstructions: A deep time perspective on the climatic niche evolution and diversification of spiny lizards (Sceloporus)" Rivera et al (2020) <doi:10.1111/jbi.13915> "Reconstructing historical shifts in suitable habitat of Sceloporus lineages using phylogenetic niche modelling.".
Time series area-level models for small area estimation. The package supplements the functionality of the sae package. Specifically, it includes EBLUP fitting of the Rao-Yu model in the original form without a spatial component. The package also offers a modified ("dynamic") version of the Rao-Yu model, replacing the assumption of stationarity. Both univariate and multivariate applications are supported. Of particular note is the allowance for covariance of the area-level sample estimates over time, as encountered in rotating panel designs such as the U.S. National Crime Victimization Survey or present in a time-series of 5-year estimates from the American Community Survey. Key references to the methods include J.N.K. Rao and I. Molina (2015, ISBN:9781118735787), J.N.K. Rao and M. Yu (1994) <doi:10.2307/3315407>, and R.E. Fay and R.A. Herriot (1979) <doi:10.1080/01621459.1979.10482505>.
Machine coded genetic algorithm (MCGA) is a fast tool for real-valued optimization problems. It uses the byte representation of variables rather than real-values. It performs the classical crossover operations (uniform) on these byte representations. Mutation operator is also similar to classical mutation operator, which is to say, it changes a randomly selected byte value of a chromosome by +1 or -1 with probability 1/2. In MCGAs there is no need for encoding-decoding process and the classical operators are directly applicable on real-values. It is fast and can handle a wide range of a search space with high precision. Using a 256-unary alphabet is the main disadvantage of this algorithm but a moderate size population is convenient for many problems. Package also includes multi_mcga function for multi objective optimization problems. This function sorts the chromosomes using their ranks calculated from the non-dominated sorting algorithm.
Calculate the probability density functions (PDFs) for two threshold evidence accumulation models (EAMs). These are defined using the following Stochastic Differential Equation (SDE), dx(t) = v(x(t),t)*dt+D(x(t),t)*dW, where x(t) is the accumulated evidence at time t, v(x(t),t) is the drift rate, D(x(t),t) is the noise scale, and W is the standard Wiener process. The boundary conditions of this process are the upper and lower decision thresholds, represented by b_u(t) and b_l(t), respectively. Upper threshold b_u(t) > 0, while lower threshold b_l(t) < 0. The initial condition of this process x(0) = z where b_l(t) < z < b_u(t). We represent this as the relative start point w = z/(b_u(0)-b_l(0)), defined as a ratio of the initial threshold location. This package generates the PDF using the same approach as the python package it is based upon, PyBEAM by Murrow and Holmes (2023) <doi:10.3758/s13428-023-02162-w>. First, it converts the SDE model into the forwards Fokker-Planck equation dp(x,t)/dt = d(v(x,t)*p(x,t))/dt-0.5*d^2(D(x,t)^2*p(x,t))/dx^2, then solves this equation using the Crank-Nicolson method to determine p(x,t). Finally, it calculates the flux at the decision thresholds, f_i(t) = 0.5*d(D(x,t)^2*p(x,t))/dx evaluated at x = b_i(t), where i is the relevant decision threshold, either upper (i = u) or lower (i = l). The flux at each thresholds f_i(t) is the PDF for each threshold, specifically its PDF. We discuss further details of this approach in this package and PyBEAM publications. Additionally, one can calculate the cumulative distribution functions of and sampling from the EAMs.
Generate systems of ordinary differential equations (ODE) and integrate them, using a domain specific language (DSL). The DSL uses R's syntax, but compiles to C in order to efficiently solve the system. A solver is not provided, but instead interfaces to the packages deSolve and dde are generated. With these, while solving the differential equations, no allocations are done and the calculations remain entirely in compiled code. Alternatively, a model can be transpiled to R for use in contexts where a C compiler is not present. After compilation, models can be inspected to return information about parameters and outputs, or intermediate values after calculations. odin is not targeted at any particular domain and is suitable for any system that can be expressed primarily as mathematical expressions. Additional support is provided for working with delays (delay differential equations, DDE), using interpolated functions during interpolation, and for integrating quantities that represent arrays.
Estimate average treatment effects (ATEs) in stratified randomized experiments. `sreg` supports a wide range of stratification designs, including matched pairs, n-tuple designs, and larger strata with many units â possibly of unequal size across strata. sreg is designed to accommodate scenarios with multiple treatments and cluster-level treatment assignments, and accommodates optimal linear covariate adjustment based on baseline observable characteristics. sreg computes estimators and standard errors based on Bugni, Canay, Shaikh (2018) <doi:10.1080/01621459.2017.1375934>; Bugni, Canay, Shaikh, Tabord-Meehan (2024+) <doi:10.48550/arXiv.2204.08356>; Jiang, Linton, Tang, Zhang (2023+) <doi:10.48550/arXiv.2201.13004>; Bai, Jiang, Romano, Shaikh, and Zhang (2024) <doi:10.1016/j.jeconom.2024.105740>; Bai (2022) <doi:10.1257/aer.20201856>; Bai, Romano, and Shaikh (2022) <doi:10.1080/01621459.2021.1883437>; Liu (2024+) <doi:10.48550/arXiv.2301.09016>; and Cytrynbaum (2024) <doi:10.3982/QE2475>.
Facilitates nonresponse bias analysis (NRBA) for survey data. Such data may arise from a complex sampling design with features such as stratification, clustering, or unequal probabilities of selection. Multiple types of analyses may be conducted: comparisons of response rates across subgroups; comparisons of estimates before and after weighting adjustments; comparisons of sample-based estimates to external population totals; tests of systematic differences in covariate means between respondents and full samples; tests of independence between response status and covariates; and modeling of outcomes and response status as a function of covariates. Extensive documentation and references are provided for each type of analysis. Krenzke, Van de Kerckhove, and Mohadjer (2005) <http://www.asasrms.org/Proceedings/y2005/files/JSM2005-000572.pdf> and Lohr and Riddles (2016) <https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2016002/article/14677-eng.pdf?st=q7PyNsGR> provide an overview of the methods implemented in this package.
In practice, we will encounter problems where the longitudinal performance of processes needs to be monitored over time. Dynamic screening systems (DySS) are methods that aim to identify and give signals to processes with poor performance as early as possible. This package is designed to implement dynamic screening systems and the related methods. References: Qiu, P. and Xiang, D. (2014) <doi:10.1080/00401706.2013.822423>; Qiu, P. and Xiang, D. (2015) <doi:10.1002/sim.6477>; Li, J. and Qiu, P. (2016) <doi:10.1080/0740817X.2016.1146423>; Li, J. and Qiu, P. (2017) <doi:10.1002/qre.2160>; You, L. and Qiu, P. (2019) <doi:10.1080/00949655.2018.1552273>; Qiu, P., Xia, Z., and You, L. (2020) <doi:10.1080/00401706.2019.1604434>; You, L., Qiu, A., Huang, B., and Qiu, P. (2020) <doi:10.1002/bimj.201900127>; You, L. and Qiu, P. (2021) <doi:10.1080/00224065.2020.1767006>.
Imbalanced domain learning has almost exclusively focused on solving classification tasks, where the objective is to predict cases labelled with a rare class accurately. Such a well-defined approach for regression tasks lacked due to two main factors. First, standard regression tasks assume that each value is equally important to the user. Second, standard evaluation metrics focus on assessing the performance of the model on the most common cases. This package contains methods to tackle imbalanced domain learning problems in regression tasks, where the objective is to predict extreme (rare) values. The methods contained in this package are: 1) an automatic and non-parametric method to obtain such relevance functions; 2) visualisation tools; 3) suite of evaluation measures for optimisation/validation processes; 4) the squared-error relevance area measure, an evaluation metric tailored for imbalanced regression tasks. More information can be found in Ribeiro and Moniz (2020) <doi:10.1007/s10994-020-05900-9>.
This package provides a set of model-assisted survey estimators and corresponding variance estimators for single stage, unequal probability, without replacement sampling designs. All of the estimators can be written as a generalized regression estimator with the Horvitz-Thompson, ratio, post-stratified, and regression estimators summarized by Sarndal et al. (1992, ISBN:978-0-387-40620-6). Two of the estimators employ a statistical learning model as the assisting model: the elastic net regression estimator, which is an extension of the lasso regression estimator given by McConville et al. (2017) <doi:10.1093/jssam/smw041>, and the regression tree estimator described in McConville and Toth (2017) <arXiv:1712.05708>. The variance estimators which approximate the joint inclusion probabilities can be found in Berger and Tille (2009) <doi:10.1016/S0169-7161(08)00002-3> and the bootstrap variance estimator is presented in Mashreghi et al. (2016) <doi:10.1214/16-SS113>.
The general principle relies on calculating the cumulative signal of nascent RNA sequencing over the gene body of any given gene or transcription unit. tepr can identify transcription attenuation sites by comparing profile to a null model which assumes uniform read density over the entirety of the transcription unit. It can also identify increased or diminished transcription attenuation by comparing two conditions. Besides rigorous statistical testing and high sensitivity, a major feature of tepr is its ability to provide the elongation pattern of each individual gene, including the position of the main attenuation point when such a phenomenon occurs. Using tepr', users can visualize and refine genome-wide aggregated analyses of elongation patterns to robustly identify effects specific to subsets of genes. These metrics are suitable for internal comparisons (between genes in each condition) and for studying elongation of the same gene in different conditions or comparing it to a perfect theoretical uniform elongation.
Spline regression, generalized additive models and component-wise gradient boosting utilizing geometrically designed (GeD) splines. GeDS regression is a non-parametric method inspired by geometric principles, for fitting spline regression models with variable knots in one or two independent variables. It efficiently estimates the number of knots and their positions, as well as the spline order, assuming the response variable follows a distribution from the exponential family. GeDS models integrate the broader category of generalized (non-)linear models, offering a flexible approach to model complex relationships. A description of the method can be found in Kaishev et al. (2016) <doi:10.1007/s00180-015-0621-7> and Dimitrova et al. (2023) <doi:10.1016/j.amc.2022.127493>. Further extending its capabilities, GeDS's implementation includes generalized additive models (GAM) and functional gradient boosting (FGB), enabling versatile multivariate predictor modeling, as discussed in the forthcoming work of Dimitrova et al. (2025).
Provide the core functionality to transform longitudinal data to complex-time (kime) data using analytic and numerical techniques, visualize the original time-series and reconstructed kime-surfaces, perform model based (e.g., tensor-linear regression) and model-free classification and clustering methods in the book Dinov, ID and Velev, MV. (2021) "Data Science: Time Complexity, Inferential Uncertainty, and Spacekime Analytics", De Gruyter STEM Series, ISBN 978-3-11-069780-3. <https://www.degruyter.com/view/title/576646>. The package includes 18 core functions which can be separated into three groups. 1) draw longitudinal data, such as Functional magnetic resonance imaging(fMRI) time-series, and forecast or transform the time-series data. 2) simulate real-valued time-series data, e.g., fMRI time-courses, detect the activated areas, report the corresponding p-values, and visualize the p-values in the 3D brain space. 3) Laplace transform and kimesurface reconstructions of the fMRI data.
Dynamic CUR (dCUR) boosts the CUR decomposition (Mahoney MW., Drineas P. (2009) <doi:10.1073/pnas.0803205106>) varying the k, the number of columns and rows used, and its final purposes to help find the stage, which minimizes the relative error to reduce matrix dimension. The goal of CUR Decomposition is to give a better interpretation of the matrix decomposition employing proper variable selection in the data matrix, in a way that yields a simplified structure. Its origins come from analysis in genetics. The goal of this package is to show an alternative to variable selection (columns) or individuals (rows). The idea proposed consists of adjusting the probability distributions to the leverage scores and selecting the best columns and rows that minimize the reconstruction error of the matrix approximation ||A-CUR||. It also includes a method that recalibrates the relative importance of the leverage scores according to an external variable of the user's interest.
GPU'/CPU Benchmarking on Debian-package based systems This package benchmarks performance of a few standard linear algebra operations (such as a matrix product and QR, SVD and LU decompositions) across a number of different BLAS libraries as well as a GPU implementation. To do so, it takes advantage of the ability to plug and play different BLAS implementations easily on a Debian and/or Ubuntu system. The current version supports - Reference BLAS ('refblas') which are un-accelerated as a baseline - Atlas which are tuned but typically configure single-threaded - Atlas39 which are tuned and configured for multi-threaded mode - Goto Blas which are accelerated and multi-threaded - Intel MKL which is a commercial accelerated and multithreaded version. As for GPU computing, we use the CRAN package - gputools For Goto Blas', the gotoblas2-helper script from the ISM in Tokyo can be used. For Intel MKL we use the Revolution R packages from Ubuntu 9.10.