This package provides a set of tools that enables efficient estimation of penalized Poisson Pseudo Maximum Likelihood regressions, using lasso or ridge penalties, for models that feature one or more sets of high-dimensional fixed effects. The methodology is based on Breinlich, Corradi, Rocha, Ruta, Santos Silva, and Zylkin (2021) <http://hdl.handle.net/10986/35451> and takes advantage of the method of alternating projections of Gaure (2013) <doi:10.1016/j.csda.2013.03.024> for dealing with HDFE, as well as the coordinate descent algorithm of Friedman, Hastie and Tibshirani (2010) <doi:10.18637/jss.v033.i01> for fitting lasso regressions. The package is also able to carry out cross-validation and to implement the plugin lasso of Belloni, Chernozhukov, Hansen and Kozbur (2016) <doi:10.1080/07350015.2015.1102733>.
Epigenome-wide association studies (EWAS) detects a large number of DNA methylation differences, often hundreds of differentially methylated regions and thousands of CpGs
, that are significantly associated with a disease, many are located in non-coding regions. Therefore, there is a critical need to better understand the functional impact of these CpG
methylations and to further prioritize the significant changes. MethReg
is an R package for integrative modeling of DNA methylation, target gene expression and transcription factor binding sites data, to systematically identify and rank functional CpG
methylations. MethReg
evaluates, prioritizes and annotates CpG
sites with high regulatory potential using matched methylation and gene expression data, along with external TF-target interaction databases based on manually curation, ChIP-seq
experiments or gene regulatory network analysis.
Constrained randomization by Raab and Butcher (2001) <doi:10.1002/1097-0258(20010215)20:3%3C351::AID-SIM797%3E3.0.CO;2-C> is suitable for cluster randomized trials (CRTs) with a small number of clusters (e.g., 20 or fewer). The procedure of constrained randomization is based on the baseline values of some cluster-level covariates specified. The intervention effect on the individual outcome can then be analyzed through clustered permutation test introduced by Gail, et al. (1996) <doi:10.1002/(SICI)1097-0258(19960615)15:11%3C1069::AID-SIM220%3E3.0.CO;2-Q>. Motivated from Li, et al. (2016) <doi:10.1002/sim.7410>, the package performs constrained randomization on the baseline values of cluster-level covariates and clustered permutation test on the individual-level outcomes for cluster randomized trials.
Generates Raven like matrices according to different rules and the response list associated to the matrix. The package can generate matrices composed of 4 or 9 cells, along with a response list of 11 elements (the correct response + 10 incorrect responses). The matrices can be generated according to both logical rules (i.e., the relationships between the elements in the matrix are manipulated to create the matrix) and visual-spatial rules (i.e., the visual or spatial characteristics of the elements are manipulated to generate the matrix). The graphical elements of this package are based on the DescTools
package. This package has been developed within the PRIN2020 Project (Prot. 20209WKCLL) titled "Computerized, Adaptive and Personalized Assessment of Executive Functions and Fluid Intelligence" and founded by the Italian Ministry of Education and Research.
This package provides a set of tools to assist statistical programmers in validating Study Data Tabulation Model (SDTM) domain data sets. Statistical programmers are required to validate that a SDTM data set domain has been programmed correctly, per the SDTM Implementation Guide (SDTMIG) by CDISC (<https://www.cdisc.org/standards/foundational/sdtmig>), study specification, and study protocol using a process called double programming. Double programming involves two different programmers independently converting the raw electronic data cut (EDC) data into a SDTM domain data table and comparing their results to ensure accurate standardization of the data. One of these attempts is termed production and the other validation'. Generally, production runs are the official programs for submittals and these are written in SAS'. Validation runs can be programmed in another language, in this case R'.
When we combine gene-editing technology and sequencing technology, we need to reconstruct a lineage tree from alleles generated and calculate the similarity between each pair of groups. FindIndel()
and IndelForm()
function will help you align each read to reference sequence and generate scar form strings respectively. IndelIdents()
function will help you to define a scar form for each cell or read. IndelPlot()
function will help you to visualize the distribution of deletion and insertion. TagProcess()
function will help you to extract indels for each cell or read. TagDist()
function will help you to calculate the similarity between each pair of groups across the indwells they contain. BuildTree()
function will help you to reconstruct a tree. PlotTree()
function will help you to visualize the tree.
An implementation of the statistical methods commonly used for advanced composite materials in aerospace applications. This package focuses on calculating basis values (lower tolerance bounds) for material strength properties, as well as performing the associated diagnostic tests. This package provides functions for calculating basis values assuming several different distributions, as well as providing functions for non-parametric methods of computing basis values. Functions are also provided for testing the hypothesis that there is no difference between strength and modulus data from an alternate sample and that from a "qualification" or "baseline" sample. For a discussion of these statistical methods and their use, see the Composite Materials Handbook, Volume 1 (2012, ISBN: 978-0-7680-7811-4). Additional details about this package are available in the paper by Kloppenborg (2020, <doi:10.21105/joss.02265>).
Recently many new p-value based multiple test procedures have been proposed, and these new methods are more powerful than the widely used Hochberg procedure. These procedures strongly control the familywise error rate (FWER). This is a comprehensive collection of p-value based FWER-control stepwise multiple test procedures, including six procedure families and thirty multiple test procedures. In this collection, the conservative Hochberg procedure, linear time Hommel procedures, asymptotic Rom procedure, Gou-Tamhane-Xi-Rom procedures, and Quick procedures are all developed in recent five years since 2014. The package name "elitism" is an acronym of "e"quipment for "l"ogarithmic and l"i"near "ti"me "s"tepwise "m"ultiple hypothesis testing. See Gou, J. (2022), "Quick multiple test procedures and p-value adjustments", Statistics in Biopharmaceutical Research 14(4), 636-650.
This package provides a new metric named dependency heaviness is proposed that measures the number of additional dependency packages that a parent package brings to its child package and are unique to the dependency packages imported by all other parents. The dependency heaviness analysis is visualized by a customized heatmap. The package is described in <doi:10.1093/bioinformatics/btac449>. We have also performed the dependency heaviness analysis on the CRAN/Bioconductor package ecosystem and the results are implemented as a web-based database which provides comprehensive tools for querying dependencies of individual R packages. The systematic analysis on the CRAN/Bioconductor ecosystem is described in <doi:10.1016/j.jss.2023.111610>. From pkgndep version 2.0.0, the heaviness database includes snapshots of the CRAN/Bioconductor ecosystems for many old R versions.
This package provides a fundamental problem in biomedical research is the low number of observations, mostly due to a lack of available biosamples, prohibitive costs, or ethical reasons. By augmenting a few real observations with artificially generated samples, their analysis could lead to more robust and higher reproducible. One possible solution to the problem is the use of generative models, which are statistical models of data that attempt to capture the entire probability distribution from the observations. Using the variational autoencoder (VAE), a well-known deep generative model, this package is aimed to generate samples with gene expression data, especially for single-cell RNA-seq data. Furthermore, the VAE can use conditioning to produce specific cell types or subpopulations. The conditional VAE (CVAE) allows us to create targeted samples rather than completely random ones.
In addition to modeling the expectation (location) of an outcome, mixed effects location scale models (MELSMs) include submodels on the variance components (scales) directly. This allows models on the within-group variance with mixed effects, and between-group variances with fixed effects. The MELSM can be used to model volatility, intraindividual variance, uncertainty, measurement error variance, and more. Multivariate MELSMs (MMELSMs) extend the model to include multiple correlated outcomes, and therefore multiple locations and scales. The latent multivariate MELSM (LMMELSM) further includes multiple correlated latent variables as outcomes. This package implements two-level mixed effects location scale models on multiple observed or latent outcomes, and between-group variance modeling. Williams, Martin, Liu, and Rast (2020) <doi:10.1027/1015-5759/a000624>. Hedeker, Mermelstein, and Demirtas (2008) <doi:10.1111/j.1541-0420.2007.00924.x>.
Simulation results detailed in Esarey and Menger (2019) <doi:10.1017/psrm.2017.42> demonstrate that cluster adjusted t statistics (CATs) are an effective method for correcting standard errors in scenarios with a small number of clusters. The mmiCATs
package offers a suite of tools for working with CATs. The mmiCATs()
function initiates a shiny web application, facilitating the analysis of data utilizing CATs, as implemented in the cluster.im.glm()
function from the clusterSEs
package. Additionally, the pwr_func_lmer()
function is designed to simplify the process of conducting simulations to compare mixed effects models with CATs models. For educational purposes, the CloseCATs()
function launches a shiny application card game, aimed at enhancing users understanding of the conditions under which CATs should be preferred over random intercept models.
Investigating and visualising Bayesian Additive Regression Tree (BART) (Chipman, H. A., George, E. I., & McCulloch
, R. E. 2010) <doi:10.1214/09-AOAS285> model fits. We construct conventional plots to analyze a modelâ s performance and stability as well as create new tree-based plots to analyze variable importance, interaction, and tree structure. We employ Value Suppressing Uncertainty Palettes (VSUP) to construct heatmaps that display variable importance and interactions jointly using colour scale to represent posterior uncertainty. Our visualisations are designed to work with the most popular BART R packages available, namely BART Rodney Sparapani and Charles Spanbauer and Robert McCulloch
2021 <doi:10.18637/jss.v097.i01>, dbarts (Vincent Dorie 2023) <https://CRAN.R-project.org/package=dbarts>, and bartMachine
(Adam Kapelner and Justin Bleich 2016) <doi:10.18637/jss.v070.i04>.
Various methods for the identification of trend and seasonal components in time series (TS) are provided. Among them is a data-driven locally weighted regression approach with automatically selected bandwidth for equidistant short-memory time series. The approach is a combination / extension of the algorithms by Feng (2013) <doi:10.1080/02664763.2012.740626> and Feng, Y., Gries, T., and Fritz, M. (2020) <doi:10.1080/10485252.2020.1759598> and a brief description of this new method is provided in the package documentation. Furthermore, the package allows its users to apply the base model of the Berlin procedure, version 4.1, as described in Speth (2004) <https://www.destatis.de/DE/Methoden/Saisonbereinigung/BV41-methodenbericht-Heft3_2004.pdf?__blob=publicationFile>
. Permission to include this procedure was kindly provided by the Federal Statistical Office of Germany.
Transferring over a code base from Matlab to R is often a repetitive and inefficient use of time. This package provides a translator for Matlab / Octave code into R code. It does some syntax changes, but most of the heavy lifting is in the function changes since the languages are so similar. Options for different data structures and the functions that can be changed are given. The Matlab code should be mostly in adherence to the standard style guide but some effort has been made to accommodate different number of spaces and other small syntax issues. This will not make the code more R friendly and may not even run afterwards. However, the rudimentary syntax, base function and data structure conversion is done quickly so that the maintainer can focus on changes to the design structure.
Utilize an orthogonality constrained optimization algorithm of Wen & Yin (2013) <DOI:10.1007/s10107-012-0584-1> to solve a variety of dimension reduction problems in the semiparametric framework, such as Ma & Zhu (2012) <DOI:10.1080/01621459.2011.646925>, Ma & Zhu (2013) <DOI:10.1214/12-AOS1072>, Sun, Zhu, Wang & Zeng (2019) <DOI:10.1093/biomet/asy064> and Zhou, Zhu & Zeng (2021) <DOI:10.1093/biomet/asaa087>. The package also implements some existing dimension reduction methods such as hMave
by Xia, Zhang, & Xu (2010) <DOI:10.1198/jasa.2009.tm09372> and partial SAVE by Feng, Wen & Zhu (2013) <DOI:10.1080/01621459.2012.746065>. It also serves as a general purpose optimization solver for problems with orthogonality constraints, i.e., in Stiefel manifold. Parallel computing for approximating the gradient is enabled through OpenMP
'.
Calculates a comprehensive list of features from profile hidden Markov models (HMMs) of proteins. Adapts and ports features for use with HMMs instead of Position Specific Scoring Matrices, in order to take advantage of more accurate multiple sequence alignment by programs such as HHBlits Remmert et al. (2012) <DOI:10.1038/nmeth.1818> and HMMer Eddy (2011) <DOI:10.1371/journal.pcbi.1002195>. Features calculated by this package can be used for protein fold classification, protein structural class prediction, sub-cellular localization and protein-protein interaction, among other tasks. Some examples of features extracted are found in Song et al. (2018) <DOI:10.3390/app8010089>, Jin & Zhu (2021) <DOI:10.1155/2021/8629776>, Lyons et al. (2015) <DOI:10.1109/tnb.2015.2457906> and Saini et al. (2015) <DOI:10.1016/j.jtbi.2015.05.030>.
When comparing single cases to control populations and no parameters are known researchers and clinicians must estimate these with a control sample. This is often done when testing a case's abnormality on some variable or testing abnormality of the discrepancy between two variables. Appropriate frequentist and Bayesian methods for doing this are here implemented, including tests allowing for the inclusion of covariates. These have been developed first and foremost by John Crawford and Paul Garthwaite, e.g. in Crawford and Howell (1998) <doi:10.1076/clin.12.4.482.7241>, Crawford and Garthwaite (2005) <doi:10.1037/0894-4105.19.3.318>, Crawford and Garthwaite (2007) <doi:10.1080/02643290701290146> and Crawford, Garthwaite and Ryan (2011) <doi:10.1016/j.cortex.2011.02.017>. The package is also equipped with power calculators for each method.
This package provides methods for building self-organizing maps (SOMs) with a number of distinguishing features such automatic centroid detection and cluster visualization using starbursts. For more details see the paper "Improved Interpretability of the Unified Distance Matrix with Connected Components" by Hamel and Brown (2011) in <ISBN:1-60132-168-6>. The package provides user-friendly access to two models we construct: (a) a SOM model and (b) a centroid based clustering model. The package also exposes a number of quality metrics for the quantitative evaluation of the map, Hamel (2016) <doi:10.1007/978-3-319-28518-4_4>. Finally, we reintroduced our fast, vectorized training algorithm for SOM with substantial improvements. It is about an order of magnitude faster than the canonical, stochastic C implementation <doi:10.1007/978-3-030-01057-7_60>.
English is the native language for only 5% of the World population. Also, only 17% of us can understand this text. Moreover, the Latin alphabet is the main one for merely 36% of the total. The early computer era, now a very long time ago, was dominated by the US. Due to the proliferation of the internet, smartphones, social media, and other technologies and communication platforms, this is no longer the case. This package replaces base R string functions (such as grep()
, tolower()
, sprintf()
, and strptime()
) with ones that fully support the Unicode standards related to natural language and date-time processing. It also fixes some long-standing inconsistencies, and introduces some new, useful features. Thanks to ICU (International Components for Unicode) and stringi', they are fast, reliable, and portable across different platforms.
Inference procedures accommodate a flexible range of hazard ratio patterns with a two-sample semi-parametric model. This model contains the proportional hazards model and the proportional odds model as sub-models, and accommodates non-proportional hazards situations to the extreme of having crossing hazards and crossing survivor functions. Overall, this package has four major functions: 1) the parameter estimation, namely short-term and long-term hazard ratio parameters; 2) 95 percent and 90 percent point-wise confidence intervals and simultaneous confidence bands for the hazard ratio function; 3) p-value of the adaptive weighted log-rank test; 4) p-values of two lack-of-fit tests for the model. See the included "read_me_first.pdf" for brief instructions. In this version (1.1), there is no need to sort the data before applying this package.
Computation of adherence to medications from Electronic Health care Data and visualization of individual medication histories and adherence patterns. The package implements a set of S3 classes and functions consistent with current adherence guidelines and definitions. It allows the computation of different measures of adherence (as defined in the literature, but also several original ones), their publication-quality plotting, the estimation of event duration and time to initiation, the interactive exploration of patient medication history and the real-time estimation of adherence given various parameter settings. It scales from very small datasets stored in flat CSV files to very large databases and from single-thread processing on mid-range consumer laptops to parallel processing on large heterogeneous computing clusters. It exposes a standardized interface allowing it to be used from other programming languages and platforms, such as Python.
Statistical methods for ROC surface analysis in three-class classification problems for clustered data and in presence of covariates. In particular, the package allows to obtain covariate-specific point and interval estimation for: (i) true class fractions (TCFs) at fixed pairs of thresholds; (ii) the ROC surface; (iii) the volume under ROC surface (VUS); (iv) the optimal pairs of thresholds. Methods considered in points (i), (ii) and (iv) are proposed and discussed in To et al. (2022) <doi:10.1177/09622802221089029>. Referring to point (iv), three different selection criteria are implemented: Generalized Youden Index (GYI), Closest to Perfection (CtP
) and Maximum Volume (MV). Methods considered in point (iii) are proposed and discussed in Xiong et al. (2018) <doi:10.1177/0962280217742539>. Visualization tools are also provided. We refer readers to the articles cited above for all details.
Fits a geographically weighted regression model using zero inflated probability distributions. Has the zero inflated negative binomial distribution (zinb) as default, but also accepts the zero inflated Poisson (zip), negative binomial (negbin) and Poisson distributions. Can also fit the global versions of each regression model. Da Silva, A. R. & De Sousa, M. D. R. (2023). "Geographically weighted zero-inflated negative binomial regression: A general case for count data", Spatial Statistics <doi:10.1016/j.spasta.2023.100790>. Brunsdon, C., Fotheringham, A. S., & Charlton, M. E. (1996). "Geographically weighted regression: a method for exploring spatial nonstationarity", Geographical Analysis, <doi:10.1111/j.1538-4632.1996.tb00936.x>. Yau, K. K. W., Wang, K., & Lee, A. H. (2003). "Zero-inflated negative binomial mixed regression modeling of over-dispersed count data with extra zeros", Biometrical Journal, <doi:10.1002/bimj.200390024>.