Implementing the Block Coordinate Ascent with One-Step Generalized Rosen (BCA1SG) algorithm on the semiparametric models for panel count data, interval-censored survival data, and degradation data. A comprehensive description of the BCA1SG algorithm can be found in Wang et al. (2020) <https://github.com/yudongstat/BCA1SG/blob/master/BCA1SG.pdf>. For details of the semiparametric models for panel count data, interval-censored survival data, and degradation data, please see Wellner and Zhang (2007) <doi:10.1214/009053607000000181>, Huang and Wellner (1997) <ISBN:978-0-387-94992-5>, and Wang and Xu (2010) <doi:10.1198/TECH.2009.08197>, respectively.
All datasets and functions required for the examples and exercises of the book "Data Science for Psychologists" (by Hansjoerg Neth, Konstanz University, 2022), available at <https://bookdown.org/hneth/ds4psy/>. The book and course introduce principles and methods of data science to students of psychology and other biological or social sciences. The ds4psy package primarily provides datasets, but also functions for data generation and manipulation (e.g., of text and time data) and graphics that are used in the book and its exercises. All functions included in ds4psy are designed to be explicit and instructive, rather than efficient or elegant.
This package infers state-recorded gender categories from first names and dates of birth using historical datasets. By using these datasets instead of lists of male and female names, this package is able to more accurately infer the gender of a name, and it is able to report the probability that a name was male or female. GUIDELINES: This method must be used cautiously and responsibly. Please be sure to see the guidelines and warnings about usage in the README or the package documentation. See Blevins and Mullen (2015) <http://www.digitalhumanities.org/dhq/vol/9/3/000223/000223.html>.
Error type I and Optimal critical values to test statistical hypothesis based on Neyman-Pearson Lemma and Likelihood ratio test based on random samples from several distributions. The families of distributions are Bernoulli, Exponential, Geometric, Inverse Normal, Normal, Gamma, Gumbel, Lognormal, Poisson, and Weibull. This package is an ideal resource to help with the teaching of Statistics. The main references for this package are Casella G. and Berger R. (2003,ISBN:0-534-24312-6 , "Statistical Inference. Second Edition", Duxbury Press) and Hogg, R., McKean
, J., and Craig, A. (2019,ISBN:013468699, "Introduction to Mathematical Statistic. Eighth edition", Pearson).
Helps with the thoughtful saving, reading, and management of result files (using rds files). The core functions take a list of parameters that are used to generate a unique hash to save results under. Then, the same parameter list can be used to read those results back in. This is helpful to avoid clunky file naming when running a large number of simulations. Additionally, helper functions are available for compiling a flat file of parameters of saved results, monitoring result usage, and cleaning up unwanted or unused results. For more information, visit the indexr homepage <https://lharris421.github.io/indexr/>.
Implementation of several phenotype-based family genetic risk scores with unified input data and data preparation functions to help facilitate the required data preparation and management. The implemented family genetic risk scores are the extended liability threshold model conditional on family history (LT-FH++) from Pedersen (2022) <doi:10.1016/j.ajhg.2022.01.009> and Pedersen (2023) <https://www.nature.com/articles/s41467-023-41210-z>, Pearson-Aitken Family Genetic Risk Scores (PA-FGRS) from Krebs (2024) <doi:10.1016/j.ajhg.2024.09.009>, and family genetic risk score by Kendler (2021) <doi:10.1001/jamapsychiatry.2021.0336>.
This package implements a methodology for the design and analysis of dose-response studies that combines aspects of multiple comparison procedures and modeling approaches (Bretz, Pinheiro and Branson, 2005, Biometrics 61, 738-748, <doi: 10.1111/j.1541-0420.2005.00344.x>). The package provides tools for the analysis of dose finding trials as well as a variety of tools necessary to plan a trial to be conducted with the MCP-Mod methodology. Please note: The MCPMod package will not be further developed, all future development of the MCP-Mod methodology will be done in the DoseFinding
R-package.
This package provides a novel method to implement cancer subtyping and subtype specific drug targets identification via non-negative matrix tri-factorization. To improve the interpretability, we introduce orthogonal constraint to the row coefficient matrix and column coefficient matrix. To meet the prior knowledge that each subtype should be strongly associated with few gene sets, we introduce sparsity constraint to the association sub-matrix. The average residue was introduced to evaluate the row and column cluster numbers. This is part of the work "Liver Cancer Analysis via Orthogonal Sparse Non-Negative Matrix Tri- Factorization" which will be submitted to BBRC.
This package provides functionality to fit a zero-inflated estimator for small area estimation. This estimator is a combines a linear mixed effects regression model and a logistic mixed effects regression model via a two-stage modeling approach. The estimator's mean squared error is estimated via a parametric bootstrap method. Chandra and others (2012, <doi:10.1080/03610918.2011.598991>) introduce and describe this estimator and mean squared error estimator. White and others (2024+, <doi:10.48550/arXiv.2402.03263>
) describe the applicability of this estimator to estimation of forest attributes and further assess the estimator's properties.
This package provides a suite of utilities for working with the UK Biobank <https://www.ukbiobank.ac.uk/> Nuclear Magnetic Resonance spectroscopy (NMR) metabolomics data <https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=220>. Includes functions for extracting biomarkers from decoded UK Biobank field data, removing unwanted technical variation from biomarker concentrations, computing an extended set of lipid, fatty acid, and cholesterol fractions, and for re-deriving composite biomarkers and ratios after adjusting data for unwanted biological variation. For further details on methods see Ritchie SC et al. Sci Data (2023) <doi:10.1038/s41597-023-01949-y>.
This package provides various R programming tools for plotting data, including:
calculating and plotting locally smoothed summary function
enhanced versions of standard plots
manipulating colors
calculating and plotting two-dimensional data summaries
enhanced regression diagnostic plots
formula-enabled interface to
stats::lowess
functiondisplaying textual data in plots
balloon plots
plotting "Venn" diagrams
displaying Open-Office style plots
plotting multiple data on same region, with separate axes
plotting means and confidence intervals
spacing points in an x-y plot so they don't overlap
Interpretation of time series data is affected by model choices. Different models can give different or even contradicting estimates of patterns, trends, and mechanisms for the same data--a limitation alleviated by the Bayesian estimator of abrupt change,seasonality, and trend (BEAST) of this package. BEAST seeks to improve time series decomposition by forgoing the "single-best-model" concept and embracing all competing models into the inference via a Bayesian model averaging scheme. It is a flexible tool to uncover abrupt changes (i.e., change-points, breakpoints, structural breaks, or join-points), cyclic variations (e.g., seasonality), and nonlinear trends in time-series observations. BEAST not just tells when changes occur but also quantifies how likely the detected changes are true. It detects not just piecewise linear trends but also arbitrary nonlinear trends. BEAST is applicable to real-valued time series data of all kinds, be it for remote sensing, economics, climate sciences, ecology, and hydrology. Example applications include its use to identify regime shifts in ecological data, map forest disturbance and land degradation from satellite imagery, detect market trends in economic data, pinpoint anomaly and extreme events in climate data, and unravel system dynamics in biological data. Details on BEAST are reported in Zhao et al. (2019) <doi:10.1016/j.rse.2019.04.034>.
Implementation of Bayesian multi-task regression models and was developed within the context of imaging genetics. The package can currently fit two models. The Bayesian group sparse multi-task regression model of Greenlaw et al. (2017)<doi:10.1093/bioinformatics/btx215> can be fit with implementation using Gibbs sampling. An extension of this model developed by Song, Ge et al. to accommodate both spatial correlation as well as correlation across brain hemispheres can also be fit using either mean-field variational Bayes or Gibbs sampling. The model can also be used more generally for multivariate (non-imaging) phenotypes with spatial correlation.
Computed tomography (CT) imaging is a powerful tool for understanding the composition of sediment cores. This package streamlines and accelerates the analysis of CT data generated in the context of environmental science. Included are tools for processing raw DICOM images to characterize sediment composition (sand, peat, etc.). Root analyses are also enabled, including measures of external surface area and volumes for user-defined root size classes. For a detailed description of the application of computed tomography imaging for sediment characterization, see: Davey, E., C. Wigand, R. Johnson, K. Sundberg, J. Morris, and C. Roman. (2011) <DOI: 10.1890/10-2037.1>.
The production of certified reference materials (CRMs) requires various statistical tests depending on the task and recorded data to ensure that reported values of CRMs are appropriate. Often these tests are performed according to the procedures described in ISO GUIDE 35:2017'. The eCerto
package contains a Shiny app which provides functionality to load, process, report and backup data recorded during CRM production and facilitates following the recommended procedures. It is described in Lisec et al (2023) <doi:10.1007/s00216-023-05099-3> and can also be accessed online <https://apps.bam.de/shn00/eCerto/>
without package installation.
Implementations of the expected shortfall backtests of Bayer and Dimitriadis (2020) <doi:10.1093/jjfinec/nbaa013> as well as other well known backtests from the literature. Can be used to assess the correctness of forecasts of the expected shortfall risk measure which is e.g. used in the banking and finance industry for quantifying the market risk of investments. A special feature of the backtests of Bayer and Dimitriadis (2020) <doi:10.1093/jjfinec/nbaa013> is that they only require forecasts of the expected shortfall, which is in striking contrast to all other existing backtests, making them particularly attractive for practitioners.
Process automation of point cloud data derived from terrestrial-based technologies such as Terrestrial Laser Scanner (TLS) or Mobile Laser Scanner. FORTLS enables (i) detection of trees and estimation of tree-level attributes (e.g. diameters and heights), (ii) estimation of stand-level variables (e.g. density, basal area, mean and dominant height), (iii) computation of metrics related to important forest attributes estimated in Forest Inventories at stand-level, and (iv) optimization of plot design for combining TLS data and field measured data. Documentation about FORTLS is described in Molina-Valero et al. (2022, <doi:10.1016/j.envsoft.2022.105337>).
Cellular responses to perturbations are highly heterogeneous and depend largely on the initial state of cells. Connecting post-perturbation cells via cellular trajectories to untreated cells (e.g. by leveraging metabolic labeling information) enables exploitation of intercellular heterogeneity as a combined knock-down and overexpression screen to identify pathway modulators, termed Heterogeneity-seq (see Berg et al <doi:10.1101/2024.10.28.620481>). This package contains functions to generate cellular trajectories based on scSLAM-seq
(single-cell, thiol-(SH)-linked alkylation of RNA for metabolic labelling sequencing) time courses, functions to identify pathway modulators and to visualize the results.
Combine probabilistic forecasts using CRPS learning algorithms proposed in Berrisch, Ziel (2021) <doi:10.48550/arXiv.2102.00968>
<doi:10.1016/j.jeconom.2021.11.008>. The package implements multiple online learning algorithms like Bernstein online aggregation; see Wintenberger (2014) <doi:10.48550/arXiv.1404.1356>
. Quantile regression is also implemented for comparison purposes. Model parameters can be tuned automatically with respect to the loss of the forecast combination. Methods like predict()
, update()
, plot()
and print()
are available for convenience. This package utilizes the optim C++ library for numeric optimization <https://github.com/kthohr/optim>.
This package implements a new RNA-Seq analysis method and integrates two modules: a basic model for pairwise comparison and a linear model for complex design. RNA-Seq quantifies gene expression with reads count, which usually consists of conditions (or treatments) and several replicates for each condition. This software infers differential expression directly by the counts difference between conditions. It assumes that the sum counts difference between conditions follow a negative binomial distribution. In addition, ABSSeq
moderates the fold-changes by two steps: the expression level and gene-specific dispersion, that might facilitate the gene ranking by fold-change and visualization.
Distributed Online Mean Tests is a powerful tool designed to efficiently process and analyze distributed datasets. It enables users to perform mean tests in an online, distributed manner, making it highly suitable for large-scale data analysis. By leveraging advanced computational techniques, Domean ensures robust and scalable solutions for statistical analysis, particularly in scenarios where data is dispersed across multiple nodes or sources. This package is ideal for researchers and practitioners working with high-dimensional data, providing a flexible and efficient framework for mean testing. The philosophy of Domean is described in Guo G.(2025) <doi:10.1016/j.physa.2024.130308>.
The FBED and mmpc variable selection algorithms have been implemented using the distance correlation. The references include: Tsamardinos I., Aliferis C. F. and Statnikov A. (2003). "Time and sample efficient discovery of Markovblankets and direct causal relations". In Proceedings of the ninth ACM SIGKDD international Conference. <doi:10.1145/956750.956838>. Borboudakis G. and Tsamardinos I. (2019). "Forward-backward selection with early dropping". Journal of Machine Learning Research, 20(8): 1--39. <doi:10.48550/arXiv.1705.10770>
. Huo X. and Szekely G.J. (2016). "Fast computing for distance covariance". Technometrics, 58(4): 435--447. <doi:10.1080/00401706.2015.1054435>.
This package contains two functions that are intended to make tuning supervised learning methods easy. The eztune function uses a genetic algorithm or Hooke-Jeeves optimizer to find the best set of tuning parameters. The user can choose the optimizer, the learning method, and if optimization will be based on accuracy obtained through validation error, cross validation, or resubstitution. The function eztune.cv will compute a cross validated error rate. The purpose of eztune_cv is to provide a cross validated accuracy or MSE when resubstitution or validation data are used for optimization because error measures from both approaches can be misleading.
The kernelized version of principal component analysis (KPCA) has proven to be a valid nonlinear alternative for tackling the nonlinearity of biological sample spaces. However, it poses new challenges in terms of the interpretability of the original variables. kpcaIG
aims to provide a tool to select the most relevant variables based on the kernel PCA representation of the data as in Briscik et al. (2023) <doi:10.1186/s12859-023-05404-y>. It also includes functions for 2D and 3D visualization of the original variables (as arrows) into the kernel principal components axes, highlighting the contribution of the most important ones.