This package performs the analysis of completely randomized experimental designs (CRD), randomized blocks (RBD) and Latin square (LSD), experiments in double and triple factorial scheme (in CRD and RBD), experiments in subdivided plot scheme (in CRD and RBD), subdivided and joint analysis of experiments in CRD and RBD, linear regression analysis, test for two samples. The package performs analysis of variance, ANOVA assumptions and multiple comparison test of means or regression, according to Pimentel-Gomes (2009, ISBN: 978-85-7133-055-9), nonparametric test (Conover, 1999, ISBN: 0471160687), test for two samples, joint analysis of experiments according to Ferreira (2018, ISBN: 978-85-7269-566-4) and generalized linear model (glm) for binomial and Poisson family in CRD and RBD (Carvalho, FJ (2019), <doi:10.14393/ufu.te.2019.1244>). It can also be used to obtain descriptive measures and graphics, in addition to correlations and creative graphics used in agricultural sciences (Agronomy, Zootechnics, Food Science and related areas).
Due to a limited availability of observed high-resolution precipitation records with adequate length, simulations with stochastic precipitation models are used to generate series for subsequent studies [e.g. Khaliq and Cunmae, 1996, <doi:10.1016/0022-1694(95)02894-3>, Vandenberghe et al., 2011, <doi:10.1029/2009WR008388>]. This package contains an R implementation of the original Bartlett-Lewis rectangular pulse model (BLRPM), developed by Rodriguez-Iturbe et al. (1987) <doi:10.1098/rspa.1987.0039>. It contains a function for simulating a precipitation time series based on storms and cells generated by the model with given or estimated model parameters. Additionally BLRPM parameters can be estimated from a given or simulated precipitation time series. The model simulations can be plotted in a three-layer plot including an overview of generated storms and cells by the model (which can also be plotted individually), a continuous step-function and a discrete precipitation time series at a chosen aggregation level.
This package contains a mixture of statistical methods including the MCMC methods to analyze normal mixtures. Additionally, model based clustering methods are implemented to perform classification based on (multivariate) longitudinal (or otherwise correlated) data. The basis for such clustering is a mixture of multivariate generalized linear mixed models. The package is primarily related to the publications Komárek (2009, Comp. Stat. and Data Anal.) <doi:10.1016/j.csda.2009.05.006> and Komárek and Komárková (2014, J. of Stat. Soft.) <doi:10.18637/jss.v059.i12>. It also implements methods published in Komárek and Komárková (2013, Ann. of Appl. Stat.) <doi:10.1214/12-AOAS580>, Hughes, Komárek, Bonnett, Czanner, Garcà a-Fiñana (2017, Stat. in Med.) <doi:10.1002/sim.7397>, Jaspers, Komárek, Aerts (2018, Biom. J.) <doi:10.1002/bimj.201600253> and Hughes, Komárek, Czanner, Garcà a-Fiñana (2018, Stat. Meth. in Med. Res) <doi:10.1177/0962280216674496>.
The debar sequence processing pipeline is designed for denoising high throughput sequencing data for the animal DNA barcode marker cytochrome c oxidase I (COI). The package is designed to detect and correct insertion and deletion errors within sequencer outputs. This is accomplished through comparison of input sequences against a profile hidden Markov model (PHMM) using the Viterbi algorithm (for algorithm details see Durbin et al. 1998, ISBN: 9780521629713). Inserted base pairs are removed and deleted base pairs are accounted for through the introduction of a placeholder character. Since the PHMM is a probabilistic representation of the COI barcode, corrections are not always perfect. For this reason debar censors base pairs adjacent to reported indel sites, turning them into placeholder characters (default is 7 base pairs in either direction, this feature can be disabled). Testing has shown that this censorship results in the correct sequence length being restored, and erroneous base pairs being masked the vast majority of the time (>95%).
This package provides a tool for inferring kinase activity changes from phosphoproteomics data. pKSEA
uses kinase-substrate prediction scores to weight observed changes in phosphopeptide abundance to calculate a phosphopeptide-level contribution score, then sums up these contribution scores by kinase to obtain a phosphoproteome-level kinase activity change score (KAC score). pKSEA
then assesses the significance of changes in predicted substrate abundances for each kinase using permutation testing. This results in a permutation score (pKSEA
significance score) reflecting the likelihood of a similarly high or low KAC from random chance, which can then be interpreted in an analogous manner to an empirically calculated p-value. pKSEA
contains default databases of kinase-substrate predictions from NetworKIN
(NetworKINPred_db
) <http://networkin.info> Horn, et. al (2014) <doi:10.1038/nmeth.2968> and of known kinase-substrate links from PhosphoSitePlus
(KSEAdb) <https://www.phosphosite.org/> Hornbeck PV, et. al (2015) <doi:10.1093/nar/gku1267>.
Transfers/imputes statistics among Spanish spatial polygons (census sections or postal code areas) from different moments in time (2001-2023) without need of spatial files, just linking statistics to the ID codes of the spatial units. The data available in the census sections of a partition/division (cartography) into force in a moment of time is transferred to the census sections of another partition/division employing the geometric approach (also known as areal weighting or polygon overlay). References: Goerlich (2022) <doi:10.12842/WPIVIE_0322>. Pavà a and Cantarino (2017a, b) <doi:10.1111/gean.12112>, <doi:10.1016/j.apgeog.2017.06.021>. Pérez and Pavà a (2024a, b) <doi:10.4995/CARMA2024.2024.17796>, <doi:10.38191/iirr-jorr.24.057>. Acknowledgements: The authors wish to thank Consellerà a de Educación, Universidades y Empleo, Generalitat Valenciana (grant AICO/2021/257), Ministerio de Economà a e Innovación (grant PID2021-128228NB-I00) and Fundación Mapfre for supporting this research.
This package provides functions for analyzing citizens bicycle usage pattern and predicting rental amount on specific conditions. Functions on this package interacts with data on tashudata package, a drat repository. tashudata package contains rental/return history on public bicycle system('Tashu'), weather for 3 years and bicycle station information. To install this data package, see the instructions at <https://github.com/zeee1/Tashu_Rpackage>. top10_stations()
, top10_paths()
function visualizes image showing the most used top 10 stations and paths. daily_bike_rental()
and monthly_bike_rental()
shows daily, monthly amount of bicycle rental. create_train_dataset()
, create_test_dataset()
is data processing function for prediction. Bicycle rental history from 2013 to 2014 is used to create training dataset and that on 2015 is for test dataset. Users can make random-forest prediction model by using create_train_model()
and predict amount of bicycle rental in 2015 by using predict_bike_rental()
.
This package provides a toolkit to predict antimicrobial peptides from protein sequences on a genome-wide scale. It incorporates two support vector machine models ("precursor" and "mature") trained on publicly available antimicrobial peptide data using calculated physico-chemical and compositional sequence properties described in Meher et al. (2017) <doi:10.1038/srep42362>. In order to support genome-wide analyses, these models are designed to accept any type of protein as input and calculation of compositional properties has been optimised for high-throughput use. For best results it is important to select the model that accurately represents your sequence type: for full length proteins, it is recommended to use the default "precursor" model. The alternative, "mature", model is best suited for mature peptide sequences that represent the final antimicrobial peptide sequence after post-translational processing. For details see Fingerhut et al. (2020) <doi:10.1093/bioinformatics/btaa653>. The ampir package is also available via a Shiny based GUI at <https://ampir.marine-omics.net/>.
Miscellaneous functions for (1) data management (e.g., grand-mean and group-mean centering, coding variables and reverse coding items, scale and cluster scores, reading and writing Excel and SPSS files), (2) descriptive statistics (e.g., frequency table, cross tabulation, effect size measures), (3) missing data (e.g., descriptive statistics for missing data, missing data pattern, Little's test of Missing Completely at Random, and auxiliary variable analysis), (4) multilevel data (e.g., multilevel descriptive statistics, within-group and between-group correlation matrix, multilevel confirmatory factor analysis, level-specific fit indices, cross-level measurement equivalence evaluation, multilevel composite reliability, and multilevel R-squared measures), (5) item analysis (e.g., confirmatory factor analysis, coefficient alpha and omega, between-group and longitudinal measurement equivalence evaluation), (6) statistical analysis (e.g., bootstrap confidence intervals, collinearity and residual diagnostics, dominance analysis, between- and within-subject analysis of variance, latent class analysis, t-test, z-test, sample size determination), and (7) functions to interact with Blimp and Mplus'.
We provide the framework to analyze multiresolution partitions (e.g. country, provinces, subdistrict) where each individual data point belongs to only one partition in each layer (e.g. i belongs to subdistrict A, province P, and country Q). We assume that a partition in a higher layer subsumes lower-layer partitions (e.g. a nation is at the 1st layer subsumes all provinces at the 2nd layer). Given N individuals that have a pair of real values (x,y) that generated from independent variable X and dependent variable Y. Each individual i belongs to one partition per layer. Our goal is to find which partitions at which highest level that all individuals in the these partitions share the same linear model Y=f(X) where f is a linear function. The framework deploys the Minimum Description Length principle (MDL) to infer solutions. The publication of this package is at Chainarong Amornbunchornvej, Navaporn Surasvadi, Anon Plangprasopchok, and Suttipong Thajchayapong (2021) <doi:10.1145/3424670>.
The Sequence Read Archive (SRA) is the largest public repository of sequencing data from the next generation of sequencing platforms including Roche 454 GS System, Illumina Genome Analyzer, Applied Biosystems SOLiD
System, Helicos Heliscope, and others. However, finding data of interest can be challenging using current tools. SRAdb is an attempt to make access to the metadata associated with submission, study, sample, experiment and run much more feasible. This is accomplished by parsing all the NCBI SRA metadata into a SQLite database that can be stored and queried locally. Fulltext search in the package make querying metadata very flexible and powerful. fastq and sra files can be downloaded for doing alignment locally. Beside ftp protocol, the SRAdb has funcitons supporting fastp protocol (ascp from Aspera Connect) for faster downloading large data files over long distance. The SQLite database is updated regularly as new data is added to SRA and can be downloaded at will for the most up-to-date metadata.
We provide a computationally efficient and robust implementation of the recently proposed C-JAMP (Copula-based Joint Analysis of Multiple Phenotypes) method (Konigorski et al., 2019, submitted). C-JAMP allows estimating and testing the association of one or multiple predictors on multiple outcomes in a joint model, and is implemented here with a focus on large-scale genome-wide association studies with two phenotypes. The use of copula functions allows modeling a wide range of multivariate dependencies between the phenotypes, and previous results are supporting that C-JAMP can increase the power of association studies to identify associated genetic variants in comparison to existing methods (Konigorski, Yilmaz, Pischon, 2016, <DOI:10.1186/s12919-016-0045-6>; Konigorski, Yilmaz, Bull, 2014, <DOI:10.1186/1753-6561-8-S1-S72>). In addition to the C-JAMP functions, functions are available to generate genetic and phenotypic data, to compute the minor allele frequency (MAF) of genetic markers, and to estimate the phenotypic variance explained by genetic markers.
saseR
is a highly performant and fast framework for aberrant expression and splicing analyses. The main functions are: \itemize\item \code\linkBamtoAspliCounts
- Process BAM files to ASpli counts \item \code\linkconvertASpli
- Get gene, bin or junction counts from ASpli SummarizedExperiment
\item \code\linkcalculateOffsets
- Create an offsets assays for aberrant expression or splicing analysis \item \code\linksaseRfindEncodingDim
- Estimate the optimal number of latent factors to include when estimating the mean expression \item \code\linksaseRfit
- Parameter estimation of the negative binomial distribution and compute p-values for aberrant expression and splicing For information upon how to use these functions, check out our vignette at \urlhttps://github.com/statOmics/saseR/blob/main/vignettes/Vignette.Rmd
and the saseR
paper: Segers, A. et al. (2023). Juggling offsets unlocks RNA-seq tools for fast scalable differential usage, aberrant splicing and expression analyses. bioRxiv
. \urlhttps://doi.org/10.1101/2023.06.29.547014.
Companion package of Carrion-i-Silvestre & Sansó (2023): "Generalized Extreme Value Approximation to the CUMSUMQ Test for Constant Unconditional Variance in Heavy-Tailed Time Series". It implements the Modified Iterative Cumulative Sum of Squares Algorithm, which is an extension of the Iterative Cumulative Sum of Squares (ICSS) Algorithm of Inclan and Tiao (1994), and it checks for changes in the unconditional variance of a time series controlling for the tail index of the underlying distribution. The fourth order moment is estimated non-parametrically to avoid the size problems when the innovations are non-Gaussian (see, Sansó et al., 2004). Critical values and p-values are generated using a Generalized Extreme Value distribution approach. References Carrion-i-Silvestre J.J & Sansó A (2023) <https://www.ub.edu/irea/working_papers/2023/202309.pdf>. Inclan C & Tiao G.C (1994) <doi:10.1080/01621459.1994.10476824>, Sansó A & Aragó V & Carrion-i-Silvestre J.L (2004) <https://dspace.uib.es/xmlui/bitstream/handle/11201/152078/524035.pdf>.
Fits heterogeneous panel data models with interactive effects for linear regression, logistic, count, probit, quantile, and clustering. Based on Ando, T. and Bai, J. (2015) "A simple new test for slope homogeneity in panel data models with interactive effects" <doi: 10.1016/j.econlet.2015.09.019>, Ando, T. and Bai, J. (2015) "Asset Pricing with a General Multifactor Structure" <doi: 10.1093/jjfinex/nbu026> , Ando, T. and Bai, J. (2016) "Panel data models with grouped factor structure under unknown group membership" <doi: 10.1002/jae.2467>, Ando, T. and Bai, J. (2017) "Clustering huge number of financial time series: A panel data approach with high-dimensional predictors and factor structures" <doi: 10.1080/01621459.2016.1195743>, Ando, T. and Bai, J. (2020) "Quantile co-movement in financial markets" <doi: 10.1080/01621459.2018.1543598>, Ando, T., Bai, J. and Li, K. (2021) "Bayesian and maximum likelihood analysis of large-scale panel choice models with unobserved heterogeneity" <doi: 10.1016/j.jeconom.2020.11.013.>.
The running statistics of interest is first extracted using a time window which is slid across the time series, and in each window, the running statistics value is computed. KCP (Kernel Change Point) detection proposed by Arlot et al. (2012) <arXiv:1202.3878>
is then implemented to flag the change points on the running statistics (Cabrieto et al., 2018, <doi:10.1016/j.ins.2018.03.010>). Change points are located by minimizing a variance criterion based on the pairwise similarities between running statistics which are computed via the Gaussian kernel. KCP can locate change points for a given k number of change points. To determine the optimal k, the KCP permutation test is first carried out by comparing the variance of the running statistics extracted from the original data to that of permuted data. If this test is significant, then there is sufficient evidence for at least one change point in the data. Model selection is then used to determine the optimal k>0.
Aster models (Geyer, Wagenius, and Shaw, 2007, <doi:10.1093/biomet/asm030>; Shaw, Geyer, Wagenius, Hangelbroek, and Etterson, 2008, <doi:10.1086/588063>; Geyer, Ridley, Latta, Etterson, and Shaw, 2013, <doi:10.1214/13-AOAS653>) are exponential family regression models for life history analysis. They are like generalized linear models except that elements of the response vector can have different families (e.2g., some Bernoulli, some Poisson, some zero-truncated Poisson, some normal) and can be dependent, the dependence indicated by a graphical structure. Discrete time survival analysis, life table analysis, zero-inflated Poisson regression, and generalized linear models that are exponential family (e.g., logistic regression and Poisson regression with log link) are special cases. Main use is for data in which there is survival over discrete time periods and there is additional data about what happens conditional on survival (e.g., number of offspring). Uses the exponential family canonical parameterization (aster transform of usual parameterization). There are also random effects versions of these models.
Decomposition of time series into trend, seasonal, and remainder components with methods for detecting and characterizing abrupt changes within the trend and seasonal components. BFAST can be used to analyze different types of satellite image time series and can be applied to other disciplines dealing with seasonal or non-seasonal time series, such as hydrology, climatology, and econometrics. The algorithm can be extended to label detected changes with information on the parameters of the fitted piecewise linear models. BFAST monitoring functionality is described in Verbesselt et al. (2010) <doi:10.1016/j.rse.2009.08.014>. BFAST monitor provides functionality to detect disturbance in near real-time based on BFAST'- type models, and is described in Verbesselt et al. (2012) <doi:10.1016/j.rse.2012.02.022>. BFAST Lite approach is a flexible approach that handles missing data without interpolation, and will be described in an upcoming paper. Furthermore, different models can now be used to fit the time series data and detect structural changes (breaks).
The network autocorrelation model (NAM) can be used for studying the degree of social influence regarding an outcome variable based on one or more known networks. The degree of social influence is quantified via the network autocorrelation parameters. In case of a single network, the Bayesian methods of Dittrich, Leenders, and Mulder (2017) <DOI:10.1016/j.socnet.2016.09.002> and Dittrich, Leenders, and Mulder (2019) <DOI:10.1177/0049124117729712> are implemented using a normal, flat, or independence Jeffreys prior for the network autocorrelation. In the case of multiple networks, the Bayesian methods of Dittrich, Leenders, and Mulder (2020) <DOI:10.1177/0081175020913899> are implemented using a multivariate normal prior for the network autocorrelation parameters. Flat priors are implemented for estimating the coefficients. For Bayesian testing of equality and order-constrained hypotheses, the default Bayes factor of Gu, Mulder, and Hoijtink, (2018) <DOI:10.1111/bmsp.12110> is used with the posterior mean and posterior covariance matrix of the NAM parameters based on flat priors as input.
This package provides the function feis()
to estimate fixed effects individual slope (FEIS) models. The FEIS model constitutes a more general version of the often-used fixed effects (FE) panel model, as implemented in the package plm by Croissant and Millo (2008) <doi:10.18637/jss.v027.i02>. In FEIS models, data are not only person demeaned like in conventional FE models, but detrended by the predicted individual slope of each person or group. Estimation is performed by applying least squares lm()
to the transformed data. For more details on FEIS models see Bruederl and Ludwig (2015, ISBN:1446252442); Frees (2001) <doi:10.2307/3316008>; Polachek and Kim (1994) <doi:10.1016/0304-4076(94)90075-2>; Ruettenauer and Ludwig (2020) <doi:10.1177/0049124120926211>; Wooldridge (2010, ISBN:0262294354). To test consistency of conventional FE and random effects estimators against heterogeneous slopes, the package also provides the functions feistest()
for an artificial regression test and bsfeistest()
for a bootstrapped version of the Hausman test.
The Phylogenetic Ornstein-Uhlenbeck Mixed Model (POUMM) allows to estimate the phylogenetic heritability of continuous traits, to test hypotheses of neutral evolution versus stabilizing selection, to quantify the strength of stabilizing selection, to estimate measurement error and to make predictions about the evolution of a phenotype and phenotypic variation in a population. The package implements combined maximum likelihood and Bayesian inference of the univariate Phylogenetic Ornstein-Uhlenbeck Mixed Model, fast parallel likelihood calculation, maximum likelihood inference of the genotypic values at the tips, functions for summarizing and plotting traces and posterior samples, functions for simulation of a univariate continuous trait evolution model along a phylogenetic tree. So far, the package has been used for estimating the heritability of quantitative traits in macroevolutionary and epidemiological studies, see e.g. Bertels et al. (2017) <doi:10.1093/molbev/msx246> and Mitov and Stadler (2018) <doi:10.1093/molbev/msx328>. The algorithm for parallel POUMM likelihood calculation has been published in Mitov and Stadler (2019) <doi:10.1111/2041-210X.13136>.
This package implements a set of routines to perform structured matrix factorization with minimum volume constraints. The NMF procedure decomposes a matrix X into a product C * D. Given conditions such that the matrix C is non-negative and has sufficiently spread columns, then volume minimization of a matrix D delivers a correct and unique, up to a scale and permutation, solution (C, D). This package provides both an implementation of volume-regularized NMF and "anchor-free" NMF, whereby the standard NMF problem is reformulated in the covariance domain. This algorithm was applied in Vladimir B. Seplyarskiy Ruslan A. Soldatov, et al. "Population sequencing data reveal a compendium of mutational processes in the human germ line". Science, 12 Aug 2021. <doi:10.1126/science.aba7408>. This package interacts with data available through the simulatedNMF
package, which is available in a drat repository. To access this data package, see the instructions at <https://github.com/kharchenkolab/vrnmf>. The size of the simulatedNMF
package is approximately 8 MB.
This package provides automated downloading, parsing, cleaning, unit conversion and formatting of Global Surface Summary of the Day ('GSOD') weather data from the from the USA National Centers for Environmental Information ('NCEI'). Units are converted from from United States Customary System ('USCS') units to International System of Units ('SI'). Stations may be individually checked for number of missing days defined by the user, where stations with too many missing observations are omitted. Only stations with valid reported latitude and longitude values are permitted in the final data. Additional useful elements, saturation vapour pressure ('es'), actual vapour pressure ('ea') and relative humidity ('RH') are calculated from the original data using the improved August-Roche-Magnus approximation (Alduchov & Eskridge 1996) and included in the final data set. The resulting metadata include station identification information, country, state, latitude, longitude, elevation, weather observations and associated flags. For information on the GSOD data from NCEI', please see the GSOD readme.txt file available from, <https://www1.ncdc.noaa.gov/pub/data/gsod/readme.txt>.
Implementation of the three-step approach of latent transition cognitive diagnosis model (CDM) with covariates. This approach can be used to assess changes in attribute mastery status and to evaluate the covariate effects on both the initial states and transition probabilities over time using latent logistic regression. Because stepwise approaches often yield biased estimates, correction for classification error probabilities (CEPs) is considered in this approach. The three-step approach for latent transition CDM with covariates involves the following steps: (1) fitting a CDM to the response data without covariates at each time point separately, (2) assigning examinees to latent states at each time point and computing the associated CEPs, and (3) estimating the latent transition CDM with the known CEPs and computing the regression coefficients. The method was proposed in Liang et al. (2023) <doi:10.3102/10769986231163320> and demonstrated using mental health data in Liang et al. (in press; annotated R code and data utilized in this example are available in Mendeley data) <doi:10.17632/kpjp3gnwbt.1>.