The multiple instance data set consists of many independent subjects (called bags) and each subject is composed of several components (called instances). The outcomes of such data set are binary or categorical responses, and, we can only observe the subject-level outcomes. For example, in manufacturing processes, a subject is labeled as "defective" if at least one of its own components is defective, and otherwise, is labeled as "non-defective". The milr package focuses on the predictive model for the multiple instance data set with binary outcomes and performs the maximum likelihood estimation with the Expectation-Maximization algorithm under the framework of logistic regression. Moreover, the LASSO penalty is attached to the likelihood function for simultaneous parameter estimation and variable selection.
With the provision of several tools and templates the MOSAIC project (DFG-Grant Number HO 1937/2-1) supports the implementation of a central data management in epidemiological research projects. The MOQA package enables epidemiologists with none or low experience in R to generate basic data quality reports for a wide range of application scenarios. See <https://mosaic-greifswald.de/> for more information. Please read and cite the corresponding open access publication (using the former package-name) in METHODS OF INFORMATION IN MEDICINE by M. Bialke, H. Rau, T. Schwaneberg, R. Walk, T. Bahls and W. Hoffmann (2017) <doi:10.3414/ME16-01-0123>. <https://methods.schattauer.de/en/contents/most-recent-articles/issue/2483/issue/special/manuscript/27573/show.html>.
An updated and extended version of spm package, by introducing some further novel functions for modern statistical methods (i.e., generalised linear models, glmnet, generalised least squares), thin plate splines, support vector machine, kriging methods (i.e., simple kriging, universal kriging, block kriging, kriging with an external drift), and novel hybrid methods (228 hybrids plus numerous variants) of modern statistical methods or machine learning methods with mathematical and/or univariate geostatistical methods for spatial predictive modelling. For each method, two functions are provided, with one function for assessing the predictive errors and accuracy of the method based on cross-validation, and the other for generating spatial predictions. It also contains a couple of functions for data preparation and predictive accuracy assessment.
Interaction between a genetic variant (e.g., a single nucleotide polymorphism) and an environmental variable (e.g., physical activity) can have a shared effect on multiple phenotypes (e.g., blood lipids). We implement a two-step method to test for an overall interaction effect on multiple phenotypes. In first step, the method tests for an overall marginal genetic association between the genetic variant and the multivariate phenotype. The genetic variants which show an evidence of marginal overall genetic effect in the first step are prioritized while testing for an overall gene-environment interaction effect in the second step. Methodology is available from: A Majumdar, KS Burch, T Haldar, S Sankararaman, B Pasaniuc, WJ Gauderman, JS Witte (2020) <doi:10.1093/bioinformatics/btaa1083>.
Gradient boosting is a powerful statistical learning method known for its ability to model complex relationships between predictors and outcomes while performing inherent variable selection. However, traditional gradient boosting methods lack flexibility in handling longitudinal data where within-subject correlations play a critical role. In this package, we propose a novel approach Mixed Effect Gradient Boosting ('MEGB'), designed specifically for high-dimensional longitudinal data. MEGB incorporates a flexible semi-parametric model that embeds random effects within the gradient boosting framework, allowing it to account for within-individual covariance over time. Additionally, the method efficiently handles scenarios where the number of predictors greatly exceeds the number of observations (p>>n) making it particularly suitable for genomics data and other large-scale biomedical studies.
piRNAs (short for PIWI-interacting RNAs) and their PIWI protein partners play a key role in fertility and maintaining genome integrity by restricting mobile genetic elements (transposons) in germ cells. piRNAs originate from genomic regions known as piRNA clusters. The piRNA Cluster Builder (PICB) is a versatile toolkit designed to identify genomic regions with a high density of piRNAs. It constructs piRNA clusters through a stepwise integration of unique and multimapping piRNAs and offers wide-ranging parameter settings, supported by an optimization function that allows users to test different parameter combinations to tailor the analysis to their specific piRNA system. The output includes extensive metadata columns, enabling researchers to rank clusters and extract cluster characteristics.
Package contains functions for analyzing check-all-that-apply (CATA) data from consumer and sensory tests. Cochran's Q test, McNemar's test, and Penalty-Lift analysis are provided; for details, see Meyners, Castura & Carr (2013) <doi:10.1016/j.foodqual.2013.06.010>. Cluster analysis can be performed using b-cluster analysis, then evaluated using various measures; for details, see Castura, Meyners, Varela & Næs (2022) <doi:10.1016/j.foodqual.2022.104564>. Consumers can also be clustered on their product-related hedonic responses; see Castura, Meyners, Pohjanheimo, Varela & Næs (2023) <doi:10.1111/joss.12860>. Permutation tests based on the L1-norm methods are provided; for details, see Chaya, Castura & Greenacre (2025) <doi:10.1016/j.foodqual.2025.105639>.
Supplements for a book, "iTOS" = "Introduction to the Theory of Observational Studies." Data sets are aHDL from Rosenbaum (2023a) <doi:10.1111/biom.13558> and bingeM from Rosenbaum (2023b) <doi:10.1111/biom.13921>. The function makematch() uses two-criteria matching from Zhang et al. (2023) <doi:10.1080/01621459.2021.1981337> to create the matched data bingeM from binge'. The makematch() function also implements optimal matching (Rosenbaum (1989) <doi:10.2307/2290079>) and matching with fine or near-fine balance (Rosenbaum et al. (2007) <doi:10.1198/016214506000001059> and Yang et al (2012) <doi:10.1111/j.1541-0420.2011.01691.x>). The book makes use of two other R packages, weightedRank and tightenBlock'.
Plug-in and difference-based long-run covariance matrix estimation for time series regression. Two applications of hypothesis testing are also provided. The first one is for testing for structural stability in coefficient functions. The second one is aimed at detecting long memory in time series regression. Lujia Bai and Weichi Wu (2024)<doi:10.3150/23-BEJ1680> Zhou Zhou and Wei Biao Wu(2010)<doi:10.1111/j.1467-9868.2010.00743.x> Jianqing Fan and Wenyang Zhang<doi:10.1214/aos/1017939139> Lujia Bai and Weichi Wu(2024)<doi:10.1093/biomet/asae013> Dimitris N. Politis, Joseph P. Romano, Michael Wolf(1999)<doi:10.1007/978-1-4612-1554-7> Weichi Wu and Zhou Zhou(2018)<doi:10.1214/17-AOS1582>.
Fit penalized splines mixed-effects models (a special case of additive models) for large longitudinal datasets. The package includes a psme() function that (1) relies on package mgcv for constructing population and subject smooth functions as penalized splines, (2) transforms the constructed additive model to a linear mixed-effects model, (3) exploits package lme4 for model estimation and (4) backtransforms the estimated linear mixed-effects model to the additive model for interpretation and visualization. See Pedersen et al. (2019) <doi:10.7717/peerj.6876> and Bates et al. (2015) <doi:10.18637/jss.v067.i01> for an introduction. Unlike the gamm() function in mgcv', the psme() function is fast and memory-efficient, able to handle datasets with millions of observations.
Training of neural networks for classification and regression tasks using mini-batch gradient descent. Special features include a function for training autoencoders, which can be used to detect anomalies, and some related plotting functions. Multiple activation functions are supported, including tanh, relu, step and ramp. For the use of the step and ramp activation functions in detecting anomalies using autoencoders, see Hawkins et al. (2002) <doi:10.1007/3-540-46145-0_17>. Furthermore, several loss functions are supported, including robust ones such as Huber and pseudo-Huber loss, as well as L1 and L2 regularization. The possible options for optimization algorithms are RMSprop, Adam and SGD with momentum. The package contains a vectorized C++ implementation that facilitates fast training through mini-batch learning.
Mixed models for repeated measures (MMRM) are a popular choice for analyzing longitudinal continuous outcomes in randomized clinical trials and beyond; see Cnaan, Laird and Slasor (1997) <doi:10.1002/(SICI)1097-0258(19971030)16:20%3C2349::AID-SIM667%3E3.0.CO;2-E> for a tutorial and Mallinckrodt, Lane, Schnell, Peng and Mancuso (2008) <doi:10.1177/009286150804200402> for a review. This package implements MMRM based on the marginal linear model without random effects using Template Model Builder ('TMB') which enables fast and robust model fitting. Users can specify a variety of covariance matrices, weight observations, fit models with restricted or standard maximum likelihood inference, perform hypothesis testing with Satterthwaite or Kenward-Roger adjustment, and extract least square means estimates by using emmeans'.
Parameter estimation and classification for Gaussian Mixture Models (GMMs) in the presence of missing data. This package complements existing implementations by allowing for both missing elements in the input vectors and full (as opposed to strictly diagonal) covariance matrices. Estimation is performed using an expectation conditional maximization algorithm that accounts for missingness of both the cluster assignments and the vector components. The output includes the marginal cluster membership probabilities; the mean and covariance of each cluster; the posterior probabilities of cluster membership; and a completed version of the input data, with missing values imputed to their posterior expectations. For additional details, please see McCaw ZR, Julienne H, Aschard H. "Fitting Gaussian mixture models on incomplete data." <doi:10.1186/s12859-022-04740-9>.
This package contains the functions for construction and visualization of various families of the proximity catch digraphs (PCDs), see (Ceyhan (2005) ISBN:978-3-639-19063-2), for computing the graph invariants for testing the patterns of segregation and association against complete spatial randomness (CSR) or uniformity in one, two and three dimensional cases. The package also has tools for generating points from these spatial patterns. The graph invariants used in testing spatial point data are the domination number (Ceyhan (2011) <doi:10.1080/03610921003597211>) and arc density (Ceyhan et al. (2006) <doi:10.1016/j.csda.2005.03.002>; Ceyhan et al. (2007) <doi:10.1002/cjs.5550350106>). The PCD families considered are Arc-Slice PCDs, Proportional-Edge PCDs, and Central Similarity PCDs.
Most price indexes are made with a two-step procedure, where period-over-period elementary indexes are first calculated for a collection of elementary aggregates at each point in time, and then aggregated according to a price index aggregation structure. These indexes can then be chained together to form a time series that gives the evolution of prices with respect to a fixed base period. This package contains a collection of functions that revolve around this work flow, making it easy to build standard price indexes, and implement the methods described by Balk (2008, <doi:10.1017/CBO9780511720758>), von der Lippe (2007, <doi:10.3726/978-3-653-01120-3>), and the CPI manual (2020, <doi:10.5089/9781484354841.069>) for bilateral price indexes.
Data science methods used in wind energy applications. Current functionalities include creating a multi-dimensional power curve model, performing power curve function comparison, covariate matching, and energy decomposition. Relevant works for the developed functions are: funGP() - Prakash et al. (2022) <doi:10.1080/00401706.2021.1905073>, AMK() - Lee et al. (2015) <doi:10.1080/01621459.2014.977385>, tempGP() - Prakash et al. (2022) <doi:10.1080/00401706.2022.2069158>, ComparePCurve() - Ding et al. (2021) <doi:10.1016/j.renene.2021.02.136>, deltaEnergy() - Latiffianti et al. (2022) <doi:10.1002/we.2722>, syncSize() - Latiffianti et al. (2022) <doi:10.1002/we.2722>, imptPower() - Latiffianti et al. (2022) <doi:10.1002/we.2722>, All other functions - Ding (2019, ISBN:9780429956508).
This package provides deterministic forecasting for weekly, monthly, quarterly, and yearly time series using the Generalized Adaptive Capped Estimator. The method includes preprocessing for missing and extreme values, extraction of multiple growth components (including long-term, short-term, rolling, and drift-based signals), volatility-aware asymmetric capping, optional seasonal adjustment via damped and normalized seasonal factors, and a recursive forecast formulation with moderated growth. The package includes a user-facing forecasting interface and a plotting helper for visualization. Related forecasting background is discussed in Hyndman and Athanasopoulos (2021) <https://otexts.com/fpp3/> and Hyndman and Khandakar (2008) <doi:10.18637/jss.v027.i03>. The method extends classical extrapolative forecasting approaches and is suited for operational and business planning contexts where stability and interpretability are important.
Maximum likelihood estimation of component lifetime parameters from system-level observations of k-out-of-n systems. Supports exponential and Weibull component distributions under multiple observation schemes: Scheme 0 (system lifetime only), Scheme 1 (periodic inspection), and Scheme 2 (complete monitoring). Provides an EM algorithm for Weibull parallel systems and Fisher information comparison across schemes. The k-out-of-n framework unifies series (k=1) and parallel (k=m) systems as a censoring problem on component lifetimes. Conforms to the likelihood.model generics and returns fitted objects compatible with algebraic.mle'. The data-generating process and topology infrastructure (system survival, density, signature, structure function, importance measures) are delegated to the dist.structure package; kofn focuses exclusively on inference for the k-out-of-n family.
This package implements the Transcendental Algorithm for Mixtures of Distributions (TAMD), a penalized likelihood framework for fitting finite Gaussian mixture models. TAMD augments the Expectation-Maximization (EM) algorithm with analytic barrier terms built from the Hellinger affinity that diverge on the singular locus, actively preventing component coalescence and weight degeneracy. Provides the core TAMD fitting function, closed-form Hellinger affinity and gradient computations, the Transcendental Affinity Criterion (TAC) for geometry-aware model selection, the regularity index rho (a scalar diagnostic for mixture fit quality), and reproduction scripts for all simulation studies. Methods are described in Fokoue (2024) <doi:10.48550/arXiv.2602.03889>. See also Titterington, Smith and Makov (1985, ISBN:0-471-90510-4) and Watanabe (2009, ISBN:978-0-521-86408-7).
Analysis of dyadic network and relational data using additive and multiplicative effects (AME) models. The basic model includes regression terms, the covariance structure of the social relations model (Warner, Kenny and Stoto (1979) <DOI:10.1037/0022-3514.37.10.1742>, Wong (1982) <DOI:10.2307/2287296>), and multiplicative factor models (Hoff(2009) <DOI:10.1007/s10588-008-9040-4>). Several different link functions accommodate different relational data structures, including binary/network data, normal relational data, zero-inflated positive outcomes using a tobit model, ordinal relational data and data from fixed-rank nomination schemes. Several of these link functions are discussed in Hoff, Fosdick, Volfovsky and Stovel (2013) <DOI:10.1017/nws.2013.17>. Development of this software was supported in part by NIH grant R01HD067509.
This package provides a set of functions to perform distribution-free Bayesian analyses. Included are Bayesian analogues to the frequentist Mann-Whitney U test, the Wilcoxon Signed-Ranks test, Kendall's Tau Rank Correlation Coefficient, Goodman and Kruskal's Gamma, McNemar's Test, the binomial test, the sign test, the median test, as well as distribution-free methods for testing contrasts among condition and for computing Bayes factors for hypotheses. The package also includes procedures to estimate the power of distribution-free Bayesian tests based on data simulations using various probability models for the data. The set of functions provide data analysts with a set of Bayesian procedures that avoids requiring parametric assumptions about measurement error and is robust to problem of extreme outlier scores.
This package provides a comprehensive framework for visualizing associations and interaction structures in matrix-formatted data using Generalized Association Plots (GAP). The package implements multiple proximity computation methods (e.g., correlation, distance metrics), ordering techniques including hierarchical clustering (HCT) and Rank-2-Ellipse (R2E) seriation, and optional flipping strategies to enhance visual symmetry. It supports a variety of covariate-based color annotations, allows flexible customization of layout and output, and is suitable for analyzing multivariate data across domains such as social sciences, genomics, and medical research. The method is based on Generalized Association Plots introduced by Chen (2002) <https://www3.stat.sinica.edu.tw/statistica/J12N1/J12N11/J12N11.html> and further extended by Wu, Tien, and Chen (2010) <doi:10.1016/j.csda.2008.09.029>.
Nested loop cross validation for classification purposes for misclassification error rate estimation. The package supports several methodologies for feature selection: random forest, Student t-test, limma, and provides an interface to the following classification methods in the MLInterfaces package: linear, quadratic discriminant analyses, random forest, bagging, prediction analysis for microarray, generalized linear model, support vector machine (svm and ksvm). Visualizations to assess the quality of the classifier are included: plot of the ranks of the features, scores plot for a specific classification algorithm and number of features, misclassification rate for the different number of features and classification algorithms tested and ROC plot. For further details about the methodology, please check: Markus Ruschhaupt, Wolfgang Huber, Annemarie Poustka, and Ulrich Mansmann (2004) <doi:10.2202/1544-6115.1078>.
This package provides a collection of functions to perform core tasks within Energy Trading and Risk Management (ETRM). Calculation of maximum smoothness forward price curves for electricity and natural gas contracts with flow delivery, as presented in F. E. Benth, S. Koekebakker, and F. Ollmar (2007) <doi:10.3905/jod.2007.694791> and F. E. Benth, J. S. Benth, and S. Koekebakker (2008) <doi:10.1142/6811>. Portfolio insurance trading strategies for price risk management in the forward market, see F. Black (1976) <doi:10.1016/0304-405X(76)90024-6>, T. Bjork (2009) <https://EconPapers.repec.org/RePEc:oxp:obooks:9780199574742>, F. Black and R. W. Jones (1987) <doi:10.3905/jpm.1987.409131> and H. E. Leland (1980) <http://www.jstor.org/stable/2327419>.