All datasets and functions required for the examples and exercises of the book "Data Science for Psychologists" (by Hansjoerg Neth, Konstanz University, 2022), available at <https://bookdown.org/hneth/ds4psy/>. The book and course introduce principles and methods of data science to students of psychology and other biological or social sciences. The ds4psy package primarily provides datasets, but also functions for data generation and manipulation (e.g., of text and time data) and graphics that are used in the book and its exercises. All functions included in ds4psy are designed to be explicit and instructive, rather than efficient or elegant.
This package infers state-recorded gender categories from first names and dates of birth using historical datasets. By using these datasets instead of lists of male and female names, this package is able to more accurately infer the gender of a name, and it is able to report the probability that a name was male or female. GUIDELINES: This method must be used cautiously and responsibly. Please be sure to see the guidelines and warnings about usage in the README or the package documentation. See Blevins and Mullen (2015) <http://www.digitalhumanities.org/dhq/vol/9/3/000223/000223.html>.
Error type I and Optimal critical values to test statistical hypothesis based on Neyman-Pearson Lemma and Likelihood ratio test based on random samples from several distributions. The families of distributions are Bernoulli, Exponential, Geometric, Inverse Normal, Normal, Gamma, Gumbel, Lognormal, Poisson, and Weibull. This package is an ideal resource to help with the teaching of Statistics. The main references for this package are Casella G. and Berger R. (2003,ISBN:0-534-24312-6 , "Statistical Inference. Second Edition", Duxbury Press) and Hogg, R., McKean
, J., and Craig, A. (2019,ISBN:013468699, "Introduction to Mathematical Statistic. Eighth edition", Pearson).
Helps with the thoughtful saving, reading, and management of result files (using rds files). The core functions take a list of parameters that are used to generate a unique hash to save results under. Then, the same parameter list can be used to read those results back in. This is helpful to avoid clunky file naming when running a large number of simulations. Additionally, helper functions are available for compiling a flat file of parameters of saved results, monitoring result usage, and cleaning up unwanted or unused results. For more information, visit the indexr homepage <https://lharris421.github.io/indexr/>.
This package implements a methodology for the design and analysis of dose-response studies that combines aspects of multiple comparison procedures and modeling approaches (Bretz, Pinheiro and Branson, 2005, Biometrics 61, 738-748, <doi: 10.1111/j.1541-0420.2005.00344.x>). The package provides tools for the analysis of dose finding trials as well as a variety of tools necessary to plan a trial to be conducted with the MCP-Mod methodology. Please note: The MCPMod package will not be further developed, all future development of the MCP-Mod methodology will be done in the DoseFinding
R-package.
This package provides a novel method to implement cancer subtyping and subtype specific drug targets identification via non-negative matrix tri-factorization. To improve the interpretability, we introduce orthogonal constraint to the row coefficient matrix and column coefficient matrix. To meet the prior knowledge that each subtype should be strongly associated with few gene sets, we introduce sparsity constraint to the association sub-matrix. The average residue was introduced to evaluate the row and column cluster numbers. This is part of the work "Liver Cancer Analysis via Orthogonal Sparse Non-Negative Matrix Tri- Factorization" which will be submitted to BBRC.
This package provides functionality to fit a zero-inflated estimator for small area estimation. This estimator is a combines a linear mixed effects regression model and a logistic mixed effects regression model via a two-stage modeling approach. The estimator's mean squared error is estimated via a parametric bootstrap method. Chandra and others (2012, <doi:10.1080/03610918.2011.598991>) introduce and describe this estimator and mean squared error estimator. White and others (2024+, <doi:10.48550/arXiv.2402.03263>
) describe the applicability of this estimator to estimation of forest attributes and further assess the estimator's properties.
This package provides a suite of utilities for working with the UK Biobank <https://www.ukbiobank.ac.uk/> Nuclear Magnetic Resonance spectroscopy (NMR) metabolomics data <https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=220>. Includes functions for extracting biomarkers from decoded UK Biobank field data, removing unwanted technical variation from biomarker concentrations, computing an extended set of lipid, fatty acid, and cholesterol fractions, and for re-deriving composite biomarkers and ratios after adjusting data for unwanted biological variation. For further details on methods see Ritchie SC et al. Sci Data (2023) <doi:10.1038/s41597-023-01949-y>.
DEsubs is a network-based systems biology package that extracts disease-perturbed subpathways within a pathway network as recorded by RNA-seq experiments. It contains an extensive and customizable framework covering a broad range of operation modes at all stages of the subpathway analysis, enabling a case-specific approach. The operation modes refer to the pathway network construction and processing, the subpathway extraction, visualization and enrichment analysis with regard to various biological and pharmacological features. Its capabilities render it a tool-guide for both the modeler and experimentalist for the identification of more robust systems-level biomarkers for complex diseases.
Implementation of Bayesian multi-task regression models and was developed within the context of imaging genetics. The package can currently fit two models. The Bayesian group sparse multi-task regression model of Greenlaw et al. (2017)<doi:10.1093/bioinformatics/btx215> can be fit with implementation using Gibbs sampling. An extension of this model developed by Song, Ge et al. to accommodate both spatial correlation as well as correlation across brain hemispheres can also be fit using either mean-field variational Bayes or Gibbs sampling. The model can also be used more generally for multivariate (non-imaging) phenotypes with spatial correlation.
Computed tomography (CT) imaging is a powerful tool for understanding the composition of sediment cores. This package streamlines and accelerates the analysis of CT data generated in the context of environmental science. Included are tools for processing raw DICOM images to characterize sediment composition (sand, peat, etc.). Root analyses are also enabled, including measures of external surface area and volumes for user-defined root size classes. For a detailed description of the application of computed tomography imaging for sediment characterization, see: Davey, E., C. Wigand, R. Johnson, K. Sundberg, J. Morris, and C. Roman. (2011) <DOI: 10.1890/10-2037.1>.
The production of certified reference materials (CRMs) requires various statistical tests depending on the task and recorded data to ensure that reported values of CRMs are appropriate. Often these tests are performed according to the procedures described in ISO GUIDE 35:2017'. The eCerto
package contains a Shiny app which provides functionality to load, process, report and backup data recorded during CRM production and facilitates following the recommended procedures. It is described in Lisec et al (2023) <doi:10.1007/s00216-023-05099-3> and can also be accessed online <https://apps.bam.de/shn00/eCerto/>
without package installation.
Implementations of the expected shortfall backtests of Bayer and Dimitriadis (2020) <doi:10.1093/jjfinec/nbaa013> as well as other well known backtests from the literature. Can be used to assess the correctness of forecasts of the expected shortfall risk measure which is e.g. used in the banking and finance industry for quantifying the market risk of investments. A special feature of the backtests of Bayer and Dimitriadis (2020) <doi:10.1093/jjfinec/nbaa013> is that they only require forecasts of the expected shortfall, which is in striking contrast to all other existing backtests, making them particularly attractive for practitioners.
Process automation of point cloud data derived from terrestrial-based technologies such as Terrestrial Laser Scanner (TLS) or Mobile Laser Scanner. FORTLS enables (i) detection of trees and estimation of tree-level attributes (e.g. diameters and heights), (ii) estimation of stand-level variables (e.g. density, basal area, mean and dominant height), (iii) computation of metrics related to important forest attributes estimated in Forest Inventories at stand-level, and (iv) optimization of plot design for combining TLS data and field measured data. Documentation about FORTLS is described in Molina-Valero et al. (2022, <doi:10.1016/j.envsoft.2022.105337>).
This package provides a user-friendly shiny application for Bayesian machine learning analysis of marine species distributions. GLOSSA (Global Species Spatiotemporal Analysis) uses Bayesian Additive Regression Trees (BART; Chipman, George, and McCulloch
(2010) <doi:10.1214/09-AOAS285>) to model species distributions with intuitive workflows for data upload, processing, model fitting, and result visualization. It supports presence-absence and presence-only data (with pseudo-absence generation), spatial thinning, cross-validation, and scenario-based projections. GLOSSA is designed to facilitate ecological research by providing easy-to-use tools for analyzing and visualizing marine species distributions across different spatial and temporal scales.
Cellular responses to perturbations are highly heterogeneous and depend largely on the initial state of cells. Connecting post-perturbation cells via cellular trajectories to untreated cells (e.g. by leveraging metabolic labeling information) enables exploitation of intercellular heterogeneity as a combined knock-down and overexpression screen to identify pathway modulators, termed Heterogeneity-seq (see Berg et al <doi:10.1101/2024.10.28.620481>). This package contains functions to generate cellular trajectories based on scSLAM-seq
(single-cell, thiol-(SH)-linked alkylation of RNA for metabolic labelling sequencing) time courses, functions to identify pathway modulators and to visualize the results.
Combine probabilistic forecasts using CRPS learning algorithms proposed in Berrisch, Ziel (2021) <doi:10.48550/arXiv.2102.00968>
<doi:10.1016/j.jeconom.2021.11.008>. The package implements multiple online learning algorithms like Bernstein online aggregation; see Wintenberger (2014) <doi:10.48550/arXiv.1404.1356>
. Quantile regression is also implemented for comparison purposes. Model parameters can be tuned automatically with respect to the loss of the forecast combination. Methods like predict()
, update()
, plot()
and print()
are available for convenience. This package utilizes the optim C++ library for numeric optimization <https://github.com/kthohr/optim>.
Defines and includes a set of class-based templates for developing and implementing data processing and analysis workflows, with a strong emphasis on statistics and machine learning. The templates can be used and where needed extended to wrap tools and methods from other packages into a common standardised structure to allow for effective and fast integration. Model objects can be combined into sequences, and sequences nested in iterators using overloaded operators to simplify and improve readability of the code. Ontology lookup has been integrated and implemented to provide standardised definitions for methods, inputs and outputs wrapped using the class-based templates.
This package provides various R programming tools for plotting data, including:
calculating and plotting locally smoothed summary function
enhanced versions of standard plots
manipulating colors
calculating and plotting two-dimensional data summaries
enhanced regression diagnostic plots
formula-enabled interface to
stats::lowess
functiondisplaying textual data in plots
balloon plots
plotting "Venn" diagrams
displaying Open-Office style plots
plotting multiple data on same region, with separate axes
plotting means and confidence intervals
spacing points in an x-y plot so they don't overlap
Interpretation of time series data is affected by model choices. Different models can give different or even contradicting estimates of patterns, trends, and mechanisms for the same data--a limitation alleviated by the Bayesian estimator of abrupt change,seasonality, and trend (BEAST) of this package. BEAST seeks to improve time series decomposition by forgoing the "single-best-model" concept and embracing all competing models into the inference via a Bayesian model averaging scheme. It is a flexible tool to uncover abrupt changes (i.e., change-points, breakpoints, structural breaks, or join-points), cyclic variations (e.g., seasonality), and nonlinear trends in time-series observations. BEAST not just tells when changes occur but also quantifies how likely the detected changes are true. It detects not just piecewise linear trends but also arbitrary nonlinear trends. BEAST is applicable to real-valued time series data of all kinds, be it for remote sensing, economics, climate sciences, ecology, and hydrology. Example applications include its use to identify regime shifts in ecological data, map forest disturbance and land degradation from satellite imagery, detect market trends in economic data, pinpoint anomaly and extreme events in climate data, and unravel system dynamics in biological data. Details on BEAST are reported in Zhao et al. (2019) <doi:10.1016/j.rse.2019.04.034>.
The FBED and mmpc variable selection algorithms have been implemented using the distance correlation. The references include: Tsamardinos I., Aliferis C. F. and Statnikov A. (2003). "Time and sample efficient discovery of Markovblankets and direct causal relations". In Proceedings of the ninth ACM SIGKDD international Conference. <doi:10.1145/956750.956838>. Borboudakis G. and Tsamardinos I. (2019). "Forward-backward selection with early dropping". Journal of Machine Learning Research, 20(8): 1--39. <doi:10.48550/arXiv.1705.10770>
. Huo X. and Szekely G.J. (2016). "Fast computing for distance covariance". Technometrics, 58(4): 435--447. <doi:10.1080/00401706.2015.1054435>.
Distributed Online Mean Tests is a powerful tool designed to efficiently process and analyze distributed datasets. It enables users to perform mean tests in an online, distributed manner, making it highly suitable for large-scale data analysis. By leveraging advanced computational techniques, Domean ensures robust and scalable solutions for statistical analysis, particularly in scenarios where data is dispersed across multiple nodes or sources. This package is ideal for researchers and practitioners working with high-dimensional data, providing a flexible and efficient framework for mean testing. The philosophy of Domean is described in Guo G.(2025) <doi:10.1016/j.physa.2024.130308>.
This package contains two functions that are intended to make tuning supervised learning methods easy. The eztune function uses a genetic algorithm or Hooke-Jeeves optimizer to find the best set of tuning parameters. The user can choose the optimizer, the learning method, and if optimization will be based on accuracy obtained through validation error, cross validation, or resubstitution. The function eztune.cv will compute a cross validated error rate. The purpose of eztune_cv is to provide a cross validated accuracy or MSE when resubstitution or validation data are used for optimization because error measures from both approaches can be misleading.
The kernelized version of principal component analysis (KPCA) has proven to be a valid nonlinear alternative for tackling the nonlinearity of biological sample spaces. However, it poses new challenges in terms of the interpretability of the original variables. kpcaIG
aims to provide a tool to select the most relevant variables based on the kernel PCA representation of the data as in Briscik et al. (2023) <doi:10.1186/s12859-023-05404-y>. It also includes functions for 2D and 3D visualization of the original variables (as arrows) into the kernel principal components axes, highlighting the contribution of the most important ones.