This package provides a set of state-of-the-art probabilistic modeling approaches to derive estimates of individual customer lifetime values (CLV). Commonly, probabilistic approaches focus on modelling 3 processes, i.e. individuals attrition, transaction, and spending process. Latent customer attrition models, which are also known as "buy-'til-you-die models", model the attrition as well as the transaction process. They are used to make inferences and predictions about transactional patterns of individual customers such as their future purchase behavior. Moreover, these models have also been used to predict individualsâ long-term engagement in activities such as playing an online game or posting to a social media platform. The spending process is usually modelled by a separate probabilistic model. Combining these results yields in lifetime values estimates for individual customers. This package includes fast and accurate implementations of various probabilistic models for non-contractual settings (e.g., grocery purchases or hotel visits). All implementations support time-invariant covariates, which can be used to control for e.g., socio-demographics. If such an extension has been proposed in literature, we further provide the possibility to control for time-varying covariates to control for e.g., seasonal patterns. Currently, the package includes the following latent attrition models to model individuals attrition and transaction process: [1] Pareto/NBD model (Pareto/Negative-Binomial-Distribution), [2] the Extended Pareto/NBD model (Pareto/Negative-Binomial-Distribution with time-varying covariates), [3] the BG/NBD model (Beta-Gamma/Negative-Binomial-Distribution) and the [4] GGom/NBD (Gamma-Gompertz/Negative-Binomial-Distribution). Further, we provide an implementation of the Gamma/Gamma model to model the spending process of individuals.
Cointegration methods are widely used in empirical macroeconomics and empirical finance. It is well known that in a cointegrating regression the ordinary least squares (OLS) estimator of the parameters is super-consistent, i.e. converges at rate equal to the sample size T. When the regressors are endogenous, the limiting distribution of the OLS estimator is contaminated by so-called second order bias terms, see e.g. Phillips and Hansen (1990) <DOI:10.2307/2297545>. The presence of these bias terms renders inference difficult. Consequently, several modifications to OLS that lead to zero mean Gaussian mixture limiting distributions have been proposed, which in turn make standard asymptotic inference feasible. These methods include the fully modified OLS (FM-OLS) approach of Phillips and Hansen (1990) <DOI:10.2307/2297545>, the dynamic OLS (D-OLS) approach of Phillips and Loretan (1991) <DOI:10.2307/2298004>, Saikkonen (1991) <DOI:10.1017/S0266466600004217> and Stock and Watson (1993) <DOI:10.2307/2951763> and the new estimation approach called integrated modified OLS (IM-OLS) of Vogelsang and Wagner (2014) <DOI:10.1016/j.jeconom.2013.10.015>. The latter is based on an augmented partial sum (integration) transformation of the regression model. IM-OLS is similar in spirit to the FM- and D-OLS approaches, with the key difference that it does not require estimation of long run variance matrices and avoids the need to choose tuning parameters (kernels, bandwidths, lags). However, inference does require that a long run variance be scaled out. This package provides functions for the parameter estimation and inference with all three modified OLS approaches. That includes the automatic bandwidth selection approaches of Andrews (1991) <DOI:10.2307/2938229> and of Newey and West (1994) <DOI:10.2307/2297912> as well as the calculation of the long run variance.
The spectral characteristics of a bivariate series (Marginal Spectra, Coherency- and Phase-Spectrum) determine whether there is a strong presence of short-, medium-, or long-term fluctuations (components of certain frequencies in the spectral representation of the series) in each one of them. These are induced by strong peaks of the marginal spectra of each series at the corresponding frequencies. The spectral characteristics also determine how strongly these short-, medium-, or long-term fluctuations of the two series are correlated between the two series. Information on this is provided by the Coherency spectrum at the corresponding frequencies. Finally, certain fluctuations of the two series may be lagged to each other. Information on this is provided by the Phase spectrum at the corresponding frequencies. The idea in this package is to define a VAR (Vector autoregression) model with desired spectral characteristics by specifying a number of polynomials, required to define the VAR. See Ioannidis(2007) <doi:10.1016/j.jspi.2005.12.013>. These are specified via their roots, instead of via their coefficients. This is an idea borrowed from the Time Series Library of R. Dahlhaus, where it is used for defining ARMA models for univariate time series. This way, one may e.g. specify a VAR inducing a strong presence of long-term fluctuations in series 1 and in series 2, which are weakly correlated, but lagged by a number of time units to each other, while short-term fluctuations in series 1 and in series 2, are strongly present only in one of the two series, while they are strongly correlated to each other between the two series. Simulation from such models allows studying the behavior of data-analysis tools, such as estimation of the spectra, under different circumstances, as e.g. peaks in the spectra, generating bias, induced by leakage.
Calculates power for assessment of intermediate biomarker responses as correlates of risk in the active treatment group in clinical efficacy trials, as described in Gilbert, Janes, and Huang, Power/Sample Size Calculations for Assessing Correlates of Risk in Clinical Efficacy Trials (2016, Statistics in Medicine). The methods differ from past approaches by accounting for the level of clinical treatment efficacy overall and in biomarker response subgroups, which enables the correlates of risk results to be interpreted in terms of potential correlates of efficacy/protection. The methods also account for inter-individual variability of the observed biomarker response that is not biologically relevant (e.g., due to technical measurement error of the laboratory assay used to measure the biomarker response), which is important because power to detect a specified correlate of risk effect size is heavily affected by the biomarker's measurement error. The methods can be used for a general binary clinical endpoint model with a univariate dichotomous, trichotomous, or continuous biomarker response measured in active treatment recipients at a fixed timepoint after randomization, with either case-cohort Bernoulli sampling or case-control without-replacement sampling of the biomarker (a baseline biomarker is handled as a trivial special case). In a specified two-group trial design, the computeN()
function can initially be used for calculating additional requisite design parameters pertaining to the target population of active treatment recipients observed to be at risk at the biomarker sampling timepoint. Subsequently, the power calculation employs an inverse probability weighted logistic regression model fitted by the tps()
function in the osDesign
package. Power results as well as the relationship between the correlate of risk effect size and treatment efficacy can be visualized using various plotting functions. To link power calculations for detecting a correlate of risk and a correlate of treatment efficacy, a baseline immunogenicity predictor (BIP) can be simulated according to a specified classification rule (for dichotomous or trichotomous BIPs) or correlation with the biomarker response (for continuous BIPs), then outputted along with biomarker response data under assignment to treatment, and clinical endpoint data for both treatment and placebo groups.
Targeted differential and global enrichment analysis of taxonomic rank by shared ASVs (Amplicon Sequence Variant), for high-throughput eDNA
sequencing of fungi, bacteria, and metazoan. Actually works in two steps: I) Targeted differential analysis from QIIME2 data and II) Global analysis by Taxon Mann-Whitney U test analysis from targeted analysis (I) (I) Estimate variance-mean dependence in count/abundance ASVs data from high-throughput sequencing assays and test for differential represented ASVs based on a model using the negative binomial distribution. (II) NCBITaxon_MWU uses continuous measure of significance (such as fold-change or -log(p-value)) to identify NCBITaxon that are significantly enriches with either up- or down-represented ASVs. If the measure is binary (0 or 1) the script will perform a typical NCBITaxon enrichment analysis based Fisher's exact test: it will show NCBITaxon over-represented among the ASVs that have 1 as their measure. On the plot, different fonts are used to indicate significance and color indicates enrichment with either up (red) or down (blue) regulated ASVs. No colors are shown for binary measure analysis. The tree on the plot is hierarchical clustering of NCBITaxon based on shared ASVs. Categories with no branch length between them are subsets of each other. The fraction next to the category name indicates the fraction of good ASVs in it; good ASVs are the ones exceeding the arbitrary absValue
cutoff (option in taxon_mwuPlot()
). For Fisher's based test, specify absValue=0.5
. This value does not affect statistics and is used for plotting only. The original idea was for genes differential expression analysis from Wright et al (2015) <doi:10.1186/s12864-015-1540-2>; adapted here for taxonomic analysis. The Anaconda package makes it possible to carry out these analyses by automatically creating several graphs and tables and storing them in specially created subfolders. You will need your QIIME2 pipeline output for each kingdom (eg; Fungi and/or Bacteria and/or Metazoan): i) taxonomy.tsv, ii) taxonomy_RepSeq.tsv
, iii) ASV.tsv and iv) SampleSheet_comparison.txt
(the latter being created by you).
This package provides meta-analysis methods that correct for publication bias and outcome reporting bias. Four methods and a visual tool are currently included in the package.
The p-uniform method as described in van Assen, van Aert, and Wicherts (2015) doi:10.1037/met0000025 can be used for estimating the average effect size, testing the null hypothesis of no effect, and testing for publication bias using only the statistically significant effect sizes of primary studies.
The p-uniform* method as described in van Aert and van Assen (2019) doi:10.31222/osf.io/zqjr9. This method is an extension of the p-uniform method that allows for estimation of the average effect size and the between-study variance in a meta-analysis, and uses both the statistically significant and nonsignificant effect sizes.
The hybrid method as described in van Aert and van Assen (2017) doi:10.3758/s13428-017-0967-6. The hybrid method is a meta-analysis method for combining an original study and replication and while taking into account statistical significance of the original study. The p-uniform and hybrid method are based on the statistical theory that the distribution of p-values is uniform conditional on the population effect size.
The fourth method in the package is the Snapshot Bayesian Hybrid Meta-Analysis Method as described in van Aert and van Assen (2018) doi:10.1371/journal.pone.0175302. This method computes posterior probabilities for four true effect sizes (no, small, medium, and large) based on an original study and replication while taking into account publication bias in the original study. The method can also be used for computing the required sample size of the replication akin to power analysis in null hypothesis significance testing.
The meta-plot is a visual tool for meta-analysis that provides information on the primary studies in the meta-analysis, the results of the meta-analysis, and characteristics of the research on the effect under study (van Assen and others, 2020).
Helper functions to apply the Correcting for Outcome Reporting Bias (CORB) method to correct for outcome reporting bias in a meta-analysis (van Aert & Wicherts, 2020).
The standard linear regression theory whether frequentist or Bayesian is based on an assumed (revealed?) truth (John Tukey) attitude to models. This is reflected in the language of statistical inference which involves a concept of truth, for example confidence intervals, hypothesis testing and consistency. The motivation behind this package was to remove the word true from the theory and practice of linear regression and to replace it by approximation. The approximations considered are the least squares approximations. An approximation is called valid if it contains no irrelevant covariates. This is operationalized using the concept of a Gaussian P-value which is the probability that pure Gaussian noise is better in term of least squares than the covariate. The precise definition given in the paper, it is intuitive and requires only four simple equations. Its overwhelming advantage compared with a standard F P-value is that is is exact and valid whatever the data. In contrast F P-values are only valid for specially designed simulations. Given this a valid approximation is one where all the Gaussian P-values are less than a threshold p0 specified by the statistician, in this package with the default value 0.01. This approximations approach is not only much simpler it is overwhelmingly better than the standard model based approach. The will be demonstrated using six real data sets, four from high dimensional regression and two from vector autoregression. The simplicity and superiority of Gaussian P-values derive from their universal exactness and validity. This is in complete contrast to standard F P-values which are valid only for carefully designed simulations. The function f1st is the most important function. It is a greedy forward selection procedure which results in either just one or no approximations which may however not be valid. If the size is less than than a threshold with default value 21 then an all subset procedure is called which returns the best valid subset. A good default start is f1st(y,x,kmn=15) The best function for returning multiple approximations is f3st which repeatedly calls f1st. For more information see the web site below and the accompanying papers: L. Davies and L. Duembgen, "Covariate Selection Based on a Model-free Approach to Linear Regression with Exact Probabilities", <doi:10.48550/arXiv.2202.01553>
, L. Davies, "An Approximation Based Theory of Linear Regression", 2024, <doi:10.48550/arXiv.2402.09858>
.
Data analysis of proteomics experiments by mass spectrometry is supported by this collection of functions mostly dedicated to the analysis of (bottom-up) quantitative (XIC) data. Fasta-formatted proteomes (eg from UniProt
Consortium <doi:10.1093/nar/gky1049>) can be read with automatic parsing and multiple annotation types (like species origin, abbreviated gene names, etc) extracted. Initial results from multiple software for protein (and peptide) quantitation can be imported (to a common format): MaxQuant
(Tyanova et al 2016 <doi:10.1038/nprot.2016.136>), Dia-NN (Demichev et al 2020 <doi:10.1038/s41592-019-0638-x>), Fragpipe (da Veiga et al 2020 <doi:10.1038/s41592-020-0912-y>), ionbot (Degroeve et al 2021 <doi:10.1101/2021.07.02.450686>), MassChroq
(Valot et al 2011 <doi:10.1002/pmic.201100120>), OpenMS
(Strauss et al 2021 <doi:10.1038/nmeth.3959>), ProteomeDiscoverer
(Orsburn 2021 <doi:10.3390/proteomes9010015>), Proline (Bouyssie et al 2020 <doi:10.1093/bioinformatics/btaa118>), AlphaPept
(preprint Strauss et al <doi:10.1101/2021.07.23.453379>) and Wombat-P (Bouyssie et al 2023 <doi:10.1021/acs.jproteome.3c00636>. Meta-data provided by initial analysis software and/or in sdrf format can be integrated to the analysis. Quantitative proteomics measurements frequently contain multiple NA values, due to physical absence of given peptides in some samples, limitations in sensitivity or other reasons. Help is provided to inspect the data graphically to investigate the nature of NA-values via their respective replicate measurements and to help/confirm the choice of NA-replacement algorithms. Meta-data in sdrf-format (Perez-Riverol et al 2020 <doi:10.1021/acs.jproteome.0c00376>) or similar tabular formats can be imported and included. Missing values can be inspected and imputed based on the concept of NA-neighbours or other methods. Dedicated filtering and statistical testing using the framework of package limma <doi:10.18129/B9.bioc.limma> can be run, enhanced by multiple rounds of NA-replacements to provide robustness towards rare stochastic events. Multi-species samples, as frequently used in benchmark-tests (eg Navarro et al 2016 <doi:10.1038/nbt.3685>, Ramus et al 2016 <doi:10.1016/j.jprot.2015.11.011>), can be run with special options considering such sub-groups during normalization and testing. Subsequently, ROC curves (Hand and Till 2001 <doi:10.1023/A:1010920819831>) can be constructed to compare multiple analysis approaches. As detailed example the data-set from Ramus et al 2016 <doi:10.1016/j.jprot.2015.11.011>) quantified by MaxQuant
, ProteomeDiscoverer
, and Proline is provided with a detailed analysis of heterologous spike-in proteins.
Fifteen tools for bioinformatics processing and analysis of major histocompatibility complex (MHC) data. The functions are tailored for amplicon data sets that have been filtered using the dada2 method (for more information on dada2, visit <https://benjjneb.github.io/dada2/> ), but even other types of data sets can be analyzed. The ReplMatch()
function matches replicates in data sets in order to evaluate genotyping success. The GetReplTable()
and GetReplStats()
functions perform such an evaluation. The CreateFas()
function creates a fasta file with all the sequences in the data set. The CreateSamplesFas()
function creates individual fasta files for each sample in the data set. The DistCalc()
function calculates Grantham, Sandberg, or p-distances from pairwise comparisons of all sequences in a data set, and mean distances of all pairwise comparisons within each sample in a data set. The function additionally outputs five tables with physico-chemical z-descriptor values (based on Sandberg et al. 1998) for each amino acid position in all sequences in the data set. These tables may be useful for further downstream analyses, such as estimation of MHC supertypes. The BootKmeans()
function is a wrapper for the kmeans()
function of the stats package, which allows for bootstrapping. Bootstrapping k-estimates may be desirable in data sets, where e.g. BIC- vs. k-values do not produce clear inflection points ("elbows"). BootKmeans()
performs multiple runs of kmeans()
and estimates optimal k-values based on a user-defined threshold of BIC reduction. The method is an automated and bootstrapped version of visually inspecting elbow plots of BIC- vs. k-values. The ClusterMatch()
function is a tool for evaluating whether different k-means()
clustering models identify similar clusters, and summarize bootstrap model stats as means for different estimated values of k. It is designed to take files produced by the BootKmeans()
function as input, but other data can be analyzed if the descriptions of the required data formats are observed carefully. The PapaDiv()
function compares parent pairs in the data set and calculate their joint MHC diversity, taking into account sequence variants that occur in both parents. The HpltFind()
function infers putative haplotypes from families in the data set. The GetHpltTable()
and GetHpltStats()
functions evaluate the accuracy of the haplotype inference. The CreateHpltOccTable()
function creates a binary (logical) haplotype-sequence occurrence matrix from the output of HpltFind()
, for easy overview of which sequences are present in which haplotypes. The HpltMatch()
function compares haplotypes to help identify overlapping and potentially identical types. The NestTablesXL()
function translates the output from HpltFind()
to an Excel workbook, that provides a convenient overview for evaluation and curating of the inferred putative haplotypes.
File reopening utility.
File reopening utility.
R packages for genetics research.
Simple, native RethinkDB
client.
External jars required for package RWeka'.
The package provides sources of randomness.
Exploit controlled vocabularies organized on tematres servers.
This package provides controls for resource limits.
An XML library in pure Rust.
An XML library in pure Rust.
Pure Rust implementation of the RIPEMD hash functions
Rustls is a modern TLS library written in Rust.
This package provides an interface to the Facebook API.
Enables mapping of country level and gridded user datasets.
This package provides safe Rust bindings to POSIX syscalls.