The explosion of biobank data offers immediate opportunities for gene-environment (GxE
) interaction studies of complex diseases because of the large sample sizes and rich collection in genetic and non-genetic information. However, the extremely large sample size also introduces new computational challenges in GxE
assessment, especially for set-based GxE
variance component (VC) tests, a widely used strategy to boost overall GxE
signals and to evaluate the joint GxE
effect of multiple variants from a biologically meaningful unit (e.g., gene). We present SEAGLE', a Scalable Exact AlGorithm
for Large-scale Set-based GxE
tests, to permit GxE
VC test scalable to biobank data. SEAGLE employs modern matrix computations to achieve the same â exactâ results as the original GxE
VC tests, and does not impose additional assumptions nor relies on approximations. SEAGLE can easily accommodate sample sizes in the order of 10^5, is implementable on standard laptops, and does not require specialized equipment. The accompanying manuscript for this package can be found at Chi, Ipsen, Hsiao, Lin, Wang, Lee, Lu, and Tzeng. (2021+) <arXiv:2105.03228>
.
This package provides a comprehensive set of geostatistical, visual, and analytical methods, in conjunction with the expanded version of the acclaimed J.E. Klovan's mining dataset, are included in klovan'. This makes the package an excellent learning resource for Principal Component Analysis (PCA), Factor Analysis (FA), kriging, and other geostatistical techniques. Originally published in the 1976 book Geological Factor Analysis', the included mining dataset was assembled by Professor J. E. Klovan of the University of Calgary. Being one of the first applications of FA in the geosciences, this dataset has significant historical importance. As a well-regarded and published dataset, it is an excellent resource for demonstrating the capabilities of PCA, FA, kriging, and other geostatistical techniques in geosciences. For those interested in these methods, the klovan datasets provide a valuable and illustrative resource. Note that some methods require the RGeostats package. Please refer to the README or Additional_repositories for installation instructions. This material is based upon research in the Materials Data Science for Stockpile Stewardship Center of Excellence (MDS3-COE), and supported by the Department of Energy's National Nuclear Security Administration under Award Number DE-NA0004104.
This package performs Bayesian posterior inference for deep Gaussian processes following Sauer, Gramacy, and Higdon (2023, <doi:10.48550/arXiv.2012.08015>
). See Sauer (2023, <http://hdl.handle.net/10919/114845>) for comprehensive methodological details and <https://bitbucket.org/gramacylab/deepgp-ex/> for a variety of coding examples. Models are trained through MCMC including elliptical slice sampling of latent Gaussian layers and Metropolis-Hastings sampling of kernel hyperparameters. Vecchia-approximation for faster computation is implemented following Sauer, Cooper, and Gramacy (2023, <doi:10.48550/arXiv.2204.02904>
). Optional monotonic warpings are implemented following Barnett et al. (2024, <doi:10.48550/arXiv.2408.01540>
). Downstream tasks include sequential design through active learning Cohn/integrated mean squared error (ALC/IMSE; Sauer, Gramacy, and Higdon, 2023), optimization through expected improvement (EI; Gramacy, Sauer, and Wycoff, 2022 <doi:10.48550/arXiv.2112.07457>
), and contour location through entropy (Booth, Renganathan, and Gramacy, 2024 <doi:10.48550/arXiv.2308.04420>
). Models extend up to three layers deep; a one layer model is equivalent to typical Gaussian process regression. Incorporates OpenMP
and SNOW parallelization and utilizes C/C++ under the hood.
The Dynamic Time Warping (DTW) distance measure for time series allows non-linear alignments of time series to match similar patterns in time series of different lengths and or different speeds. IncDTW
is characterized by (1) the incremental calculation of DTW (reduces runtime complexity to a linear level for updating the DTW distance) - especially for life data streams or subsequence matching, (2) the vector based implementation of DTW which is faster because no matrices are allocated (reduces the space complexity from a quadratic to a linear level in the number of observations) - for all runtime intensive DTW computations, (3) the subsequence matching algorithm runDTW
, that efficiently finds the k-NN to a query pattern in a long time series, and (4) C++ in the heart. For details about DTW see the original paper "Dynamic programming algorithm optimization for spoken word recognition" by Sakoe and Chiba (1978) <DOI:10.1109/TASSP.1978.1163055>. For details about this package, Dynamic Time Warping and Incremental Dynamic Time Warping please see "IncDTW
: An R Package for Incremental Calculation of Dynamic Time Warping" by Leodolter et al. (2021) <doi:10.18637/jss.v099.i09>.
Predicts categorical or continuous outcomes while concentrating on a number of key points. These are Cross-validation, Accuracy, Regression and Rule of Ten or "one in ten rule" (CARRoT
), and, in addition to it R-squared statistics, prior knowledge on the dataset etc. It performs the cross-validation specified number of times by partitioning the input into training and test set and fitting linear/multinomial/binary regression models to the training set. All regression models satisfying chosen constraints are fitted and the ones with the best predictive power are given as an output. Best predictive power is understood as highest accuracy in case of binary/multinomial outcomes, smallest absolute and relative errors in case of continuous outcomes. For binary case there is also an option of finding a regression model which gives the highest AUROC (Area Under Receiver Operating Curve) value. The option of parallel toolbox is also available. Methods are described in Peduzzi et al. (1996) <doi:10.1016/S0895-4356(96)00236-3> , Rhemtulla et al. (2012) <doi:10.1037/a0029315>, Riley et al. (2018) <doi:10.1002/sim.7993>, Riley et al. (2019) <doi:10.1002/sim.7992>.
This package contains a suite of functions for survival analysis in health economics. These can be used to run survival models under a frequentist (based on maximum likelihood) or a Bayesian approach (both based on Integrated Nested Laplace Approximation or Hamiltonian Monte Carlo). To run the Bayesian models, the user needs to install additional modules (packages), i.e. survHEinla
and survHEhmc
'. These can be installed from <https://giabaio.r-universe.dev/> using install.packages("survHEhmc
", repos = c("https://giabaio.r-universe.dev", "https://cloud.r-project.org")) and install.packages("survHEinla
", repos = c("https://giabaio.r-universe.dev", "https://cloud.r-project.org")) respectively. survHEinla
is based on the package INLA, which is available for download at <https://inla.r-inla-download.org/R/stable/>. The user can specify a set of parametric models using a common notation and select the preferred mode of inference. The results can also be post-processed to produce probabilistic sensitivity analysis and can be used to export the output to an Excel file (e.g. for a Markov model, as often done by modellers and practitioners). <doi:10.18637/jss.v095.i14>.
DMC model simulation detailed in Ulrich, R., Schroeter, H., Leuthold, H., & Birngruber, T. (2015). Automatic and controlled stimulus processing in conflict tasks: Superimposed diffusion processes and delta functions. Cognitive Psychology, 78, 148-174. Ulrich et al. (2015) <doi:10.1016/j.cogpsych.2015.02.005>. Decision processes within choice reaction-time (CRT) tasks are often modelled using evidence accumulation models (EAMs), a variation of which is the Diffusion Decision Model (DDM, for a review, see Ratcliff & McKoon
, 2008). Ulrich et al. (2015) introduced a Diffusion Model for Conflict tasks (DMC). The DMC model combines common features from within standard diffusion models with the addition of superimposed controlled and automatic activation. The DMC model is used to explain distributional reaction time (and error rate) patterns in common behavioural conflict-like tasks (e.g., Flanker task, Simon task). This R-package implements the DMC model and provides functionality to fit the model to observed data. Further details are provided in the following paper: Mackenzie, I.G., & Dudschig, C. (2021). DMCfun: An R package for fitting Diffusion Model of Conflict (DMC) to reaction time and error rate data. Methods in Psychology, 100074. <doi:10.1016/j.metip.2021.100074>.
An implementation of multiple maps t-distributed stochastic neighbor embedding (t-SNE). Multiple maps t-SNE is a method for projecting high-dimensional data into several low-dimensional maps such that non-metric space properties are better preserved than they would be by a single map. Multiple maps t-SNE with only one map is equivalent to standard t-SNE. When projecting onto more than one map, multiple maps t-SNE estimates a set of latent weights that allow each point to contribute to one or more maps depending on similarity relationships in the original data. This implementation is a port of the original Matlab library by Laurens van der Maaten. See Van der Maaten and Hinton (2012) <doi:10.1007/s10994-011-5273-4>. This material is based upon work supported by the United States Air Force and Defense Advanced Research Project Agency (DARPA) under Contract No. FA8750-17-C-0020. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force and Defense Advanced Research Projects Agency. Distribution Statement A: Approved for Public Release; Distribution Unlimited.
The goal of LRcell is to identify specific sub-cell types that drives the changes observed in a bulk RNA-seq differential gene expression experiment. To achieve this, LRcell utilizes sets of cell marker genes acquired from single-cell RNA-sequencing (scRNA-seq
) as indicators for various cell types in the tissue of interest. Next, for each cell type, using its marker genes as indicators, we apply Logistic Regression on the complete set of genes with differential expression p-values to calculate a cell-type significance p-value. Finally, these p-values are compared to predict which one(s) are likely to be responsible for the differential gene expression pattern observed in the bulk RNA-seq experiments. LRcell is inspired by the LRpath[@sartor2009lrpath] algorithm developed by Sartor et al., originally designed for pathway/gene set enrichment analysis. LRcell contains three major components: LRcell analysis, plot generation and marker gene selection. All modules in this package are written in R. This package also provides marker genes in the Prefrontal Cortex (pFC
) human brain region, human PBMC and nine mouse brain regions (Frontal Cortex, Cerebellum, Globus Pallidus, Hippocampus, Entopeduncular, Posterior Cortex, Striatum, Substantia Nigra and Thalamus).
Genetic variants associated with diseases often affect non-coding regions, thus likely having a regulatory role. To understand the effects of genetic variants in these regulatory regions, identifying genes that are modulated by specific regulatory elements (REs) is crucial. The effect of gene regulatory elements, such as enhancers, is often cell-type specific, likely because the combinations of transcription factors (TFs) that are regulating a given enhancer have cell-type specific activity. This TF activity can be quantified with existing tools such as diffTF
and captures differences in binding of a TF in open chromatin regions. Collectively, this forms a gene regulatory network (GRN) with cell-type and data-specific TF-RE and RE-gene links. Here, we reconstruct such a GRN using single-cell or bulk RNAseq and open chromatin (e.g., using ATACseq or ChIPseq
for open chromatin marks) and optionally (Capture) Hi-C data. Our network contains different types of links, connecting TFs to regulatory elements, the latter of which is connected to genes in the vicinity or within the same chromatin domain (TAD). We use a statistical framework to assign empirical FDRs and weights to all links using a permutation-based approach.
Automated generation, running, and interpretation of moderated nonlinear factor analysis models for obtaining scores from observed variables, using the method described by Gottfredson and colleagues (2019) <doi:10.1016/j.addbeh.2018.10.031>. This package creates M-plus input files which may be run iteratively to test two different types of covariate effects on items: (1) latent variable impact (both mean and variance); and (2) differential item functioning. After sequentially testing for all effects, it also creates a final model by including all significant effects after adjusting for multiple comparisons. Finally, the package creates a scoring model which uses the final values of parameter estimates to generate latent variable scores. \n\n This package generates TEMPLATES for M-plus inputs, which can and should be inspected, altered, and run by the user. In addition to being presented without warranty of any kind, the package is provided under the assumption that everyone who uses it is reading, interpreting, understanding, and altering every M-plus input and output file. There is no one right way to implement moderated nonlinear factor analysis, and this package exists solely to save users time as they generate M-plus syntax according to their own judgment.
Inferences about counterfactuals are essential for prediction, answering what if questions, and estimating causal effects. However, when the counterfactuals posed are too far from the data at hand, conclusions drawn from well-specified statistical analyses become based largely on speculation hidden in convenient modeling assumptions that few would be willing to defend. Unfortunately, standard statistical approaches assume the veracity of the model rather than revealing the degree of model-dependence, which makes this problem hard to detect. WhatIf
offers easy-to-apply methods to evaluate counterfactuals that do not require sensitivity testing over specified classes of models. If an analysis fails the tests offered here, then we know that substantive inferences will be sensitive to at least some modeling choices that are not based on empirical evidence, no matter what method of inference one chooses to use. WhatIf
implements the methods for evaluating counterfactuals discussed in Gary King and Langche Zeng, 2006, "The Dangers of Extreme Counterfactuals," Political Analysis 14 (2) <DOI:10.1093/pan/mpj004>; and Gary King and Langche Zeng, 2007, "When Can History Be Our Guide? The Pitfalls of Counterfactual Inference," International Studies Quarterly 51 (March) <DOI:10.1111/j.1468-2478.2007.00445.x>.
Set of tools to generate samples of k-th order statistics and others quantities of interest from new families of distributions. The main references for this package are: C. Kleiber and S. Kotz (2003) Statistical size distributions in economics and actuarial sciences; Gentle, J. (2009), Computational Statistics, Springer-Verlag; Naradajah, S. and Rocha, R. (2016), <DOI:10.18637/jss.v069.i10> and Stasinopoulos, M. and Rigby, R. (2015), <DOI:10.1111/j.1467-9876.2005.00510.x>. The families of distributions are: Benini distributions, Burr distributions, Dagum distributions, Feller-Pareto distributions, Generalized Pareto distributions, Inverse Pareto distributions, The Inverse Paralogistic distributions, Marshall-Olkin G distributions, exponentiated G distributions, beta G distributions, gamma G distributions, Kumaraswamy G distributions, generalized beta G distributions, beta extended G distributions, gamma G distributions, gamma uniform G distributions, beta exponential G distributions, Weibull G distributions, log gamma G I distributions, log gamma G II distributions, exponentiated generalized G distributions, exponentiated Kumaraswamy G distributions, geometric exponential Poisson G distributions, truncated-exponential skew-symmetric G distributions, modified beta G distributions, exponentiated exponential Poisson G distributions, Poisson-inverse gaussian distributions, Skew normal type 1 distributions, Skew student t distributions, Singh-Maddala distributions, Sinh-Arcsinh distributions, Sichel distributions, Zero inflated Poisson distributions.
Unsupervised learning has been widely used in many real-world applications. One of the simplest and most important unsupervised learning models is the Gaussian mixture model (GMM). In this work, we study the multi-task learning problem on GMMs, which aims to leverage potentially similar GMM parameter structures among tasks to obtain improved learning performance compared to single-task learning. We propose a multi-task GMM learning procedure based on the Expectation-Maximization (EM) algorithm that not only can effectively utilize unknown similarity between related tasks but is also robust against a fraction of outlier tasks from arbitrary sources. The proposed procedure is shown to achieve minimax optimal rate of convergence for both parameter estimation error and the excess mis-clustering error, in a wide range of regimes. Moreover, we generalize our approach to tackle the problem of transfer learning for GMMs, where similar theoretical results are derived. Finally, we demonstrate the effectiveness of our methods through simulations and a real data analysis. To the best of our knowledge, this is the first work studying multi-task and transfer learning on GMMs with theoretical guarantees. This package implements the algorithms proposed in Tian, Y., Weng, H., & Feng, Y. (2022) <arXiv:2209.15224>
.
The depmap package is a data package that accesses datsets from the Broad Institute DepMap
cancer dependency study using ExperimentHub
. Datasets from the most current release are available, including RNAI and CRISPR-Cas9 gene knockout screens quantifying the genetic dependency for select cancer cell lines. Additional datasets are also available pertaining to the log copy number of genes for select cell lines, protein expression of cell lines as measured by reverse phase protein lysate microarray (RPPA), Transcript Per Million (TPM) data, as well as supplementary datasets which contain metadata and mutation calls for the other datasets found in the current release. The 19Q3 release adds the drug_dependency dataset, that contains cancer cell line dependency data with respect to drug and drug-candidate compounds. The 20Q2 release adds the proteomic dataset that contains quantitative profiling of proteins via mass spectrometry. This package will be updated on a quarterly basis to incorporate the latest Broad Institute DepMap
Public cancer dependency datasets. All data made available in this package was generated by the Broad Institute DepMap
for research purposes and not intended for clinical use. This data is distributed under the Creative Commons license (Attribution 4.0 International (CC BY 4.0)).
Extensive functions for Lmoments (LMs) and probability-weighted moments (PWMs), distribution parameter estimation, LMs for distributions, LM ratio diagrams, multivariate Lcomoments, and asymmetric (asy) trimmed LMs (TLMs). Maximum likelihood and maximum product spacings estimation are available. Right-tail and left-tail LM censoring by threshold or indicator variable are available. LMs of residual (resid) and reversed (rev) residual life are implemented along with 13 quantile operators for reliability analyses. Exact analytical bootstrap estimates of order statistics, LMs, and LM var-covars are available. Harri-Coble Tau34-squared Normality Test is available. Distributions with L, TL, and added (+) support for right-tail censoring (RC) encompass: Asy Exponential (Exp) Power [L], Asy Triangular [L], Cauchy [TL], Eta-Mu [L], Exp. [L], Gamma [L], Generalized (Gen) Exp Poisson [L], Gen Extreme Value [L], Gen Lambda [L, TL], Gen Logistic [L], Gen Normal [L], Gen Pareto [L+RC, TL], Govindarajulu [L], Gumbel [L], Kappa [L], Kappa-Mu [L], Kumaraswamy [L], Laplace [L], Linear Mean Residual Quantile Function [L], Normal [L], 3p log-Normal [L], Pearson Type III [L], Polynomial Density-Quantile 3 and 4 [L], Rayleigh [L], Rev-Gumbel [L+RC], Rice [L], Singh Maddala [L], Slash [TL], 3p Student t [L], Truncated Exponential [L], Wakeby [L], and Weibull [L].
Various tools dealing with batch effects, in particular enabling the removal of discrepancies between training and test sets in prediction scenarios. Moreover, addon quantile normalization and addon RMA normalization (Kostka & Spang, 2008) is implemented to enable integrating the quantile normalization step into prediction rules. The following batch effect removal methods are implemented: FAbatch, ComBat
, (f)SVA, mean-centering, standardization, Ratio-A and Ratio-G. For each of these we provide an additional function which enables a posteriori ('addon') batch effect removal in independent batches ('test data'). Here, the (already batch effect adjusted) training data is not altered. For evaluating the success of batch effect adjustment several metrics are provided. Moreover, the package implements a plot for the visualization of batch effects using principal component analysis. The main functions of the package for batch effect adjustment are ba()
and baaddon()
which enable batch effect removal and addon batch effect removal, respectively, with one of the seven methods mentioned above. Another important function here is bametric()
which is a wrapper function for all implemented methods for evaluating the success of batch effect removal. For (addon) quantile normalization and (addon) RMA normalization the functions qunormtrain()
, qunormaddon()
, rmatrain()
and rmaaddon()
can be used.
This package provides fast and efficient procedures for Bayesian analysis of Structural Vector Autoregressions. This package estimates a wide range of models, including homo-, heteroskedastic, and non-normal specifications. Structural models can be identified by adjustable exclusion restrictions, time-varying volatility, or non-normality. They all include a flexible three-level equation-specific local-global hierarchical prior distribution for the estimated level of shrinkage for autoregressive and structural parameters. Additionally, the package facilitates predictive and structural analyses such as impulse responses, forecast error variance and historical decompositions, forecasting, verification of heteroskedasticity, non-normality, and hypotheses on autoregressive parameters, as well as analyses of structural shocks, volatilities, and fitted values. Beautiful plots, informative summary functions, and extensive documentation including the vignette by Woźniak (2024) <doi:10.48550/arXiv.2410.15090>
complement all this. The implemented techniques align closely with those presented in Lütkepohl, Shang, Uzeda, & Woźniak (2024) <doi:10.48550/arXiv.2404.11057>
, Lütkepohl & Woźniak (2020) <doi:10.1016/j.jedc.2020.103862>, and Song & Woźniak (2021) <doi:10.1093/acrefore/9780190625979.013.174>. The bsvars package is aligned regarding objects, workflows, and code structure with the R package bsvarSIGNs
by Wang & Woźniak (2024) <doi:10.32614/CRAN.package.bsvarSIGNs>
, and they constitute an integrated toolset.
O-statistics, or overlap statistics, measure the degree of community-level trait overlap. They are estimated by fitting nonparametric kernel density functions to each speciesâ trait distribution and calculating their areas of overlap. For instance, the median pairwise overlap for a community is calculated by first determining the overlap of each species pair in trait space, and then taking the median overlap of each species pair in a community. This median overlap value is called the O-statistic (O for overlap). The Ostats()
function calculates separate univariate overlap statistics for each trait, while the Ostats_multivariate()
function calculates a single multivariate overlap statistic for all traits. O-statistics can be evaluated against null models to obtain standardized effect sizes. Ostats is part of the collaborative Macrosystems Biodiversity Project "Local- to continental-scale drivers of biodiversity across the National Ecological Observatory Network (NEON)." For more information on this project, see the Macrosystems Biodiversity Website (<https://neon-biodiversity.github.io/>). Calculation of O-statistics is described in Read et al. (2018) <doi:10.1111/ecog.03641>, and a teaching module for introducing the underlying biological concepts at an undergraduate level is described in Grady et al. (2018) <http://tiee.esa.org/vol/v14/issues/figure_sets/grady/abstract.html>.
This package provides a framework for simulating spatially explicit genomic data which leverages real cartographic information for programmatic and visual encoding of spatiotemporal population dynamics on real geographic landscapes. Population genetic models are then automatically executed by the SLiM
software by Haller et al. (2019) <doi:10.1093/molbev/msy228> behind the scenes, using a custom built-in simulation SLiM
script. Additionally, fully abstract spatial models not tied to a specific geographic location are supported, and users can also simulate data from standard, non-spatial, random-mating models. These can be simulated either with the SLiM
built-in back-end script, or using an efficient coalescent population genetics simulator msprime by Baumdicker et al. (2022) <doi:10.1093/genetics/iyab229> with a custom-built Python script bundled with the R package. Simulated genomic data is saved in a tree-sequence format and can be loaded, manipulated, and summarised using tree-sequence functionality via an R interface to the Python module tskit by Kelleher et al. (2019) <doi:10.1038/s41588-019-0483-y>. Complete model configuration, simulation and analysis pipelines can be therefore constructed without a need to leave the R environment, eliminating friction between disparate tools for population genetic simulations and data analysis.
This package provides functions for calculating biochemical methane potential (BMP) from laboratory measurements and other types of data processing and prediction useful for biogas research. Raw laboratory measurements for diverse methods (volumetric, manometric, gravimetric, gas density) can be processed to calculate BMP. Theoretical maximum BMP or methane or biogas yield can be predicted from various measures of substrate composition. Molar mass and calculated oxygen demand (COD') can be determined from a chemical formula. Measured gas volume can be corrected for water vapor and to standard (or user-defined) temperature and pressure. Gas quantity can be converted between volume, mass, and moles. A function for planning BMP experiments can consider multiple constraints in suggesting substrate or inoculum quantities, and check for problems. Inoculum and substrate mass can be determined for planning BMP experiments. Finally, a set of first-order models can be fit to measured methane production rate or cumulative yield in order to extract estimates of ultimate yield and kinetic constants. See Hafner et al. (2018) <doi:10.1016/j.softx.2018.06.005> for details. OBA is a web application that provides access to some of the package functionality: <https://biotransformers.shinyapps.io/oba1/>. The Standard BMP Methods website documents the calculations in detail: <https://www.dbfz.de/en/BMP>.
Model-based clustering of multivariate continuous data using Bayesian mixtures of factor analyzers (Papastamoulis (2019) <DOI:10.1007/s11222-019-09891-z> (2018) <DOI:10.1016/j.csda.2018.03.007>). The number of clusters is estimated using overfitting mixture models (Rousseau and Mengersen (2011) <DOI:10.1111/j.1467-9868.2011.00781.x>): suitable prior assumptions ensure that asymptotically the extra components will have zero posterior weight, therefore, the inference is based on the ``alive components. A Gibbs sampler is implemented in order to (approximately) sample from the posterior distribution of the overfitting mixture. A prior parallel tempering scheme is also available, which allows to run multiple parallel chains with different prior distributions on the mixture weights. These chains run in parallel and can swap states using a Metropolis-Hastings move. Eight different parameterizations give rise to parsimonious representations of the covariance per cluster (following Mc Nicholas and Murphy (2008) <DOI:10.1007/s11222-008-9056-0>). The model parameterization and number of factors is selected according to the Bayesian Information Criterion. Identifiability issues related to label switching are dealt by post-processing the simulated output with the Equivalence Classes Representatives algorithm (Papastamoulis and Iliopoulos (2010) <DOI:10.1198/jcgs.2010.09008>, Papastamoulis (2016) <DOI:10.18637/jss.v069.c01>).
This package provides a set of tools to analyze texts. Includes, amongst others, functions for automatic language detection, hyphenation, several indices of lexical diversity (e.g., type token ratio, HD-D/vocd-D, MTLD) and readability (e.g., Flesch, SMOG, LIX, Dale-Chall). Basic import functions for language corpora are also provided, to enable frequency analyses (supports Celex and Leipzig Corpora Collection file formats) and measures like tf-idf. Note: For full functionality a local installation of TreeTagger
is recommended. It is also recommended to not load this package directly, but by loading one of the available language support packages from the l10n repository <https://undocumeantit.github.io/repos/l10n/>. koRpus
also includes a plugin for the R GUI and IDE RKWard, providing graphical dialogs for its basic features. The respective R package rkward cannot be installed directly from a repository, as it is a part of RKWard. To make full use of this feature, please install RKWard from <https://rkward.kde.org> (plugins are detected automatically). Due to some restrictions on CRAN, the full package sources are only available from the project homepage. To ask for help, report bugs, request features, or discuss the development of the package, please subscribe to the koRpus-dev
mailing list (<https://korpusml.reaktanz.de>).
In forensics, it is common and effective practice to analyse glass fragments from the scene and suspects to gain evidence of placing a suspect at the crime scene. This kind of analysis involves comparing the physical and chemical attributes of glass fragments that exist on both the person and at the crime scene, and assessing the significance in a likeness that they share. The package implements the Scott-Knott Modification 2 algorithm (SKM2) (Christopher M. Triggs and James M. Curran and John S. Buckleton and Kevan A.J. Walsh (1997) <doi:10.1016/S0379-0738(96)02037-3> "The grouping problem in forensic glass analysis: a divisive approach", Forensic Science International, 85(1), 1--14) for small sample glass fragment analysis using the refractive index (ri) of a set of glass samples. It also includes an experimental multivariate analog to the Scott-Knott algorithm for similar analysis on glass samples with multiple chemical concentration variables and multiple samples of the same item; testing against the Hotellings T^2 distribution (J.M. Curran and C.M. Triggs and J.R. Almirall and J.S. Buckleton and K.A.J. Walsh (1997) <doi:10.1016/S1355-0306(97)72197-X> "The interpretation of elemental composition measurements from forensic glass evidence", Science & Justice, 37(4), 241--244).