Quantification is a prominent machine learning task that has received an increasing amount of attention in the last years. The objective is to predict the class distribution of a data sample. This package is a collection of machine learning algorithms for class distribution estimation. This package include algorithms from different paradigms of quantification. These methods are described in the paper: A. Maletzke, W. Hassan, D. dos Reis, and G. Batista. The importance of the test set size in quantification assessment. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI20, pages 2640â 2646, 2020. <doi:10.24963/ijcai.2020/366>.
Representation-dependent gene level operations of a genetic algorithm with binary coded genes: Initialization of random binary genes, several gene maps for binary genes, several mutation operators, several crossover operators with 1 and 2 kids, replication pipelines for 1 and 2 kids, and, last but not least, function factories for configuration. See Goldberg, D. E. (1989, ISBN:0-201-15767-5). For crossover operators, see Syswerda, G. (1989, ISBN:1-55860-066-3), Spears, W. and De Jong, K. (1991, ISBN:1-55860-208-9). For mutation operators, see Stanhope, S. A. and Daida, J. M. (1996, ISBN:0-18-201-031-7).
High-throughput experimental data are accumulating exponentially in public databases. However, mining valid scientific discoveries from these abundant resources is hampered by technical artifacts and inherent biological heterogeneity. The former are usually termed "batch effects," and the latter is often modelled by "subtypes." The R package BUScorrect fits a Bayesian hierarchical model, the Batch-effects-correction-with-Unknown-Subtypes model (BUS), to correct batch effects in the presence of unknown subtypes. BUS is capable of (a) correcting batch effects explicitly, (b) grouping samples that share similar characteristics into subtypes, (c) identifying features that distinguish subtypes, and (d) enjoying a linear-order computation complexity.
This is an R package for interfacing with the BIOM format. This package includes basic tools for reading biom-format files, accessing and subsetting data tables from a biom object (which is more complex than a single table), as well as limited support for writing a biom-object back to a biom-format file. The design of this API is intended to match the Python API and other tools included with the biom-format project, but with a decidedly "R flavor" that should be familiar to R users. This includes S4 classes and methods, as well as extensions of common core functions/methods.
This package provides methods for mediation analysis with missing data and non-normal data are implemented. For missing data, four methods are available: Listwise deletion, Pairwise deletion, Multiple imputation, and Two Stage Maximum Likelihood algorithm. For MI and TS-ML, auxiliary variables can be included to handle missing data. For handling non-normal data, bootstrap and two-stage robust methods can be used. Technical details of the methods can be found in Zhang and Wang (2013, <doi:10.1007/s11336-012-9301-5>), Zhang (2014, <doi:10.3758/s13428-013-0424-0>), and Yuan and Zhang (2012, <doi:10.1007/s11336-012-9282-4>).
Implementation of different statistical tools for the description and analysis of gene expression data based on the concept of data depth, namely, the scale curves for visualizing the dispersion of one or various groups of samples (e.g. types of tumors), a rank test to decide whether two groups of samples come from a single distribution and two methods of supervised classification techniques, the DS and TAD methods. All these techniques are based on the Modified Band Depth, which is a recent notion of depth with a low computational cost, what renders it very appropriate for high dimensional data such as gene expression data.
Three functional modules, including genetic features, differential expression analysis and non-additive expression analysis were integrated into the package. And the package is suitable for RNA-seq and small RNA sequencing data. Besides, two methods of non-additive expression analysis were provided. One is the calculation of the additive (a) and dominant (d), the other is the evaluation of expression level dominance by comparing the total expression of the gene in hybrid offspring with the expression level in parents. For non-additive expression analysis of RNA-seq data, it is only applicable to hybrid offspring (including two sub-genomes) species for the time being.
This package provides tools for econometric analysis and economic modelling with the traditional two-input Constant Elasticity of Substitution (CES) function and with nested CES functions with three and four inputs. The econometric estimation can be done by the Kmenta approximation, or non-linear least-squares using various gradient-based or global optimisation algorithms. Some of these algorithms can constrain the parameters to certain ranges, e.g. economically meaningful values. Furthermore, the non-linear least-squares estimation can be combined with a grid-search for the rho-parameter(s). The estimation methods are described in Henningsen et al. (2021) <doi:10.4337/9781788976480.00030>.
Utilizing a combination of machine learning models (Random Forest, Naive Bayes, K-Nearest Neighbor, Support Vector Machines, Extreme Gradient Boosting, and Linear Discriminant Analysis) and a deep Artificial Neural Network model, MBMethPred
can predict medulloblastoma subgroups, including wingless (WNT), sonic hedgehog (SHH), Group 3, and Group 4 from DNA methylation beta values. See Sharif Rahmani E, Lawarde A, Lingasamy P, Moreno SV, Salumets A and Modhukur V (2023), MBMethPred
: a computational framework for the accurate classification of childhood medulloblastoma subgroups using data integration and AI-based approaches. Front. Genet. 14:1233657. <doi: 10.3389/fgene.2023.1233657> for more details.
Includes functions to work with the Mallows and Generalized Mallows Models. The considered distances are Kendall's-tau, Cayley, Hamming and Ulam and it includes functions for making inference, sampling and learning such distributions, some of which are novel in the literature. As a by-product, PerMallows
also includes operations for permutations, paying special attention to those related with the Kendall's-tau, Cayley, Ulam and Hamming distances. It is also possible to generate random permutations at a given distance, or with a given number of inversions, or cycles, or fixed points or even with a given length on LIS (longest increasing subsequence).
This package implements the SparseStep
model for solving regression problems with a sparsity constraint on the parameters. The SparseStep
regression model was proposed in Van den Burg, Groenen, and Alfons (2017) <arXiv:1701.06967>
. In the model, a regularization term is added to the regression problem which approximates the counting norm of the parameters. By iteratively improving the approximation a sparse solution to the regression problem can be obtained. In this package both the standard SparseStep
algorithm is implemented as well as a path algorithm which uses golden section search to determine solutions with different values for the regularization parameter.
Sequential Monte Carlo (SMC) algorithms for fitting a generalised additive mixed model (GAMM) to surface-enhanced resonance Raman spectroscopy (SERRS), using the method of Moores et al. (2016) <arXiv:1604.07299>
. Multivariate observations of SERRS are highly collinear and lend themselves to a reduced-rank representation. The GAMM separates the SERRS signal into three components: a sequence of Lorentzian, Gaussian, or pseudo-Voigt peaks; a smoothly-varying baseline; and additive white noise. The parameters of each component of the model are estimated iteratively using SMC. The posterior distributions of the parameters given the observed spectra are represented as a population of weighted particles.
Stochastic blockmodeling of one-mode and linked networks as implemented in Škulj and Žiberna (2022) <doi:10.1016/j.socnet.2022.02.001>. The optimization is done via CEM (Classification Expectation Maximization) algorithm that can be initialized by random partitions or the results of k-means algorithm. The development of this package is financially supported by the Slovenian Research Agency (<https://www.arrs.si/>) within the research programs P5-0168 and the research projects J7-8279 (Blockmodeling multilevel and temporal networks) and J5-2557 (Comparison and evaluation of different approaches to blockmodeling dynamic networks by simulations with application to Slovenian co-authorship networks).
This package implements an automated binning of numeric variables and factors with respect to a dichotomous target variable. Two approaches are provided: An implementation of fine and coarse classing that merges granular classes and levels step by step. And a tree-like approach that iteratively segments the initial bins via binary splits. Both procedures merge, respectively split, bins based on similar weight of evidence (WOE) values and stop via an information value (IV) based criteria. The package can be used with single variables or an entire data frame. It provides flexible tools for exploring different binning solutions and for deploying them to (new) data.
Pulls together a collection of datasets from Miguel de Carvalho research articles. Including, for example: - de Carvalho (2012) <doi:10.1016/j.jspi.2011.08.016>; - de Carvalho et al (2012) <doi:10.1080/03610926.2012.709905>; - de Carvalho et al (2012) <doi:10.1016/j.econlet.2011.09.007>); - de Carvalho and Davison (2014) <doi:10.1080/01621459.2013.872651>; - de Carvalho and Rua (2017) <doi:10.1016/j.ijforecast.2015.09.004>; - de Carvalho et al (2023) <doi:10.1002/sta4.560>; - de Carvalho et al (2022) <doi:10.1007/s13253-021-00469-9>; - Palacios et al (2024) <doi:10.1214/24-BA1420>.
This package provides two functions that generate source code implementing the predict function of fitted glm objects. In this version, code can be generated for either C or Java'. The idea is to provide a tool for the easy and fast deployment of glm predictive models into production. The source code generated by this package implements two function/methods. One of such functions implements the equivalent to predict(type="response"), while the second implements predict(type="link"). Source code is written to disk as a .c or .java file in the specified path. In the case of c, an .h file is also generated.
Utilities for reading data from the Human Mortality Database (<https://www.mortality.org>), Human Fertility Database (<https://www.humanfertility.org>), and similar databases from the web or locally into an R session as data.frame objects. These are the two most widely used sources of demographic data to study basic demographic change, trends, and develop new demographic methods. Other supported databases at this time include the Human Fertility Collection (<https://www.fertilitydata.org>), The Japanese Mortality Database (<https://www.ipss.go.jp/p-toukei/JMD/index-en.html>), and the Canadian Human Mortality Database (<http://www.bdlc.umontreal.ca/chmd/>). Arguments and data are standardized.
Simulates categorical maps on actual geographical realms, starting from either empty landscapes or landscapes provided by the user (e.g. land use maps). Allows to tweak or create landscapes while retaining a high degree of control on its features, without the hassle of specifying each location attribute. In this it differs from other tools which generate null or neutral landscapes in a theoretical space. The basic algorithm currently implemented uses a simple agent style/cellular automata growth model, with no rules (apart from areas of exclusion) and von Neumann neighbourhood (four cells, aka Rook case). Outputs are raster dataset exportable to any common GIS format.
There are three distinct approaches for phase error correction, they are: a single linear model with a choice of optimization functions, multiple linear models with optimization function choices and a shrinkage-based method. The methodology is based on our new algorithms and various references (Binczyk et al. (2015) <doi:10.1186/1475-925X-14-S2-S5>,Chen et al. (2002) <doi:10.1016/S1090-7807(02)00069-1>, de Brouwer (2009) <doi:10.1016/j.jmr.2009.09.017>, Džakula (2000) <doi:10.1006/jmre.2000.2123>, Ernst (1969) <doi:10.1016/0022-2364(69)90003-1>, Liland et al. (2010) <doi:10.1366/000370210792434350>).
Representing nucleotide modifications in a nucleotide sequence is usually done via special characters from a number of sources. This represents a challenge to work with in R and the Biostrings package. The Modstrings package implements this functionality for RNA and DNA sequences containing modified nucleotides by translating the character internally in order to work with the infrastructure of the Biostrings package. For this the ModRNAString
and ModDNAString
classes and derivates and functions to construct and modify these objects despite the encoding issues are implemenented. In addition the conversion from sequences to list like location information (and the reverse operation) is implemented as well.
The main function calculates confidence intervals (CI) for Mixed Models, utilizing both classical estimators from the lmer()
function in the lme4 package and robust estimators from the rlmer()
function in the robustlmm package, as well as the varComprob()
function in the robustvarComp
package. Three methods are available: the classical Wald method, the wild bootstrap, and the parametric bootstrap. Bootstrap methods offer flexibility in obtaining lower and upper bounds through percentile or BCa methods. More details are given in Mason, F., Cantoni, E., & Ghisletta, P. (2021) <doi:10.5964/meth.6607> and Mason, F., Cantoni, E., & Ghisletta, P. (2024) <doi:10.1037/met0000643>.
Lake morphometry metrics are used by limnologists to understand, among other things, the ecological processes in a lake. Traditionally, these metrics are calculated by hand, with planimeters, and increasingly with commercial GIS products. All of these methods work; however, they are either outdated, difficult to reproduce, or require expensive licenses to use. The lakemorpho package provides the tools to calculate a typical suite of these metrics from an input elevation model and lake polygon. The metrics currently supported are: fetch, major axis, minor axis, major/minor axis ratio, maximum length, maximum width, mean width, maximum depth, mean depth, shoreline development, shoreline length, surface area, and volume.
Numerically solve and plot solutions of a parametric ordinary differential equations model of growth, death, and respiration of macroinvertebrate and algae taxa dependent on pre-defined environmental factors. The model (version 1.0) is introduced in Schuwirth, N. and Reichert, P., (2013) <DOI:10.1890/12-0591.1>. This package includes model extensions and the core functions introduced and used in Schuwirth, N. et al. (2016) <DOI:10.1111/1365-2435.12605>, Kattwinkel, M. et al. (2016) <DOI:10.1021/acs.est.5b04068>, Mondy, C. P., and Schuwirth, N. (2017) <DOI:10.1002/eap.1530>, and Paillex, A. et al. (2017) <DOI:10.1111/fwb.12927>.
cytoKernel
implements a kernel-based score test to identify differentially expressed features in high-dimensional biological experiments. This approach can be applied across many different high-dimensional biological data including gene expression data and dimensionally reduced cytometry-based marker expression data. In this R package, we implement functions that compute the feature-wise p values and their corresponding adjusted p values. Additionally, it also computes the feature-wise shrunk effect sizes and their corresponding shrunken effect size. Further, it calculates the percent of differentially expressed features and plots user-friendly heatmap of the top differentially expressed features on the rows and samples on the columns.