The debar sequence processing pipeline is designed for denoising high throughput sequencing data for the animal DNA barcode marker cytochrome c oxidase I (COI). The package is designed to detect and correct insertion and deletion errors within sequencer outputs. This is accomplished through comparison of input sequences against a profile hidden Markov model (PHMM) using the Viterbi algorithm (for algorithm details see Durbin et al. 1998, ISBN: 9780521629713). Inserted base pairs are removed and deleted base pairs are accounted for through the introduction of a placeholder character. Since the PHMM is a probabilistic representation of the COI barcode, corrections are not always perfect. For this reason debar censors base pairs adjacent to reported indel sites, turning them into placeholder characters (default is 7 base pairs in either direction, this feature can be disabled). Testing has shown that this censorship results in the correct sequence length being restored, and erroneous base pairs being masked the vast majority of the time (>95%).
This package provides a tool for inferring kinase activity changes from phosphoproteomics data. pKSEA uses kinase-substrate prediction scores to weight observed changes in phosphopeptide abundance to calculate a phosphopeptide-level contribution score, then sums up these contribution scores by kinase to obtain a phosphoproteome-level kinase activity change score (KAC score). pKSEA then assesses the significance of changes in predicted substrate abundances for each kinase using permutation testing. This results in a permutation score (pKSEA significance score) reflecting the likelihood of a similarly high or low KAC from random chance, which can then be interpreted in an analogous manner to an empirically calculated p-value. pKSEA contains default databases of kinase-substrate predictions from NetworKIN (NetworKINPred_db) <http://networkin.info> Horn, et. al (2014) <doi:10.1038/nmeth.2968> and of known kinase-substrate links from PhosphoSitePlus (KSEAdb) <https://www.phosphosite.org/> Hornbeck PV, et. al (2015) <doi:10.1093/nar/gku1267>.
This package provides functions for analyzing citizens bicycle usage pattern and predicting rental amount on specific conditions. Functions on this package interacts with data on tashudata package, a drat repository. tashudata package contains rental/return history on public bicycle system('Tashu'), weather for 3 years and bicycle station information. To install this data package, see the instructions at <https://github.com/zeee1/Tashu_Rpackage>. top10_stations(), top10_paths() function visualizes image showing the most used top 10 stations and paths. daily_bike_rental() and monthly_bike_rental() shows daily, monthly amount of bicycle rental. create_train_dataset(), create_test_dataset() is data processing function for prediction. Bicycle rental history from 2013 to 2014 is used to create training dataset and that on 2015 is for test dataset. Users can make random-forest prediction model by using create_train_model() and predict amount of bicycle rental in 2015 by using predict_bike_rental().
The Sequence Read Archive (SRA) is the largest public repository of sequencing data from the next generation of sequencing platforms including Roche 454 GS System, Illumina Genome Analyzer, Applied Biosystems SOLiD System, Helicos Heliscope, and others. However, finding data of interest can be challenging using current tools. SRAdb is an attempt to make access to the metadata associated with submission, study, sample, experiment and run much more feasible. This is accomplished by parsing all the NCBI SRA metadata into a SQLite database that can be stored and queried locally. Fulltext search in the package make querying metadata very flexible and powerful. fastq and sra files can be downloaded for doing alignment locally. Beside ftp protocol, the SRAdb has funcitons supporting fastp protocol (ascp from Aspera Connect) for faster downloading large data files over long distance. The SQLite database is updated regularly as new data is added to SRA and can be downloaded at will for the most up-to-date metadata.
This package provides a toolkit to predict antimicrobial peptides from protein sequences on a genome-wide scale. It incorporates two support vector machine models ("precursor" and "mature") trained on publicly available antimicrobial peptide data using calculated physico-chemical and compositional sequence properties described in Meher et al. (2017) <doi:10.1038/srep42362>. In order to support genome-wide analyses, these models are designed to accept any type of protein as input and calculation of compositional properties has been optimised for high-throughput use. For best results it is important to select the model that accurately represents your sequence type: for full length proteins, it is recommended to use the default "precursor" model. The alternative, "mature", model is best suited for mature peptide sequences that represent the final antimicrobial peptide sequence after post-translational processing. For details see Fingerhut et al. (2020) <doi:10.1093/bioinformatics/btaa653>. The ampir package is also available via a Shiny based GUI at <https://ampir.marine-omics.net/>.
We provide the framework to analyze multiresolution partitions (e.g. country, provinces, subdistrict) where each individual data point belongs to only one partition in each layer (e.g. i belongs to subdistrict A, province P, and country Q). We assume that a partition in a higher layer subsumes lower-layer partitions (e.g. a nation is at the 1st layer subsumes all provinces at the 2nd layer). Given N individuals that have a pair of real values (x,y) that generated from independent variable X and dependent variable Y. Each individual i belongs to one partition per layer. Our goal is to find which partitions at which highest level that all individuals in the these partitions share the same linear model Y=f(X) where f is a linear function. The framework deploys the Minimum Description Length principle (MDL) to infer solutions. The publication of this package is at Chainarong Amornbunchornvej, Navaporn Surasvadi, Anon Plangprasopchok, and Suttipong Thajchayapong (2021) <doi:10.1145/3424670>.
Miscellaneous functions for (1) data handling (e.g., grand-mean and group-mean centering, coding variables and reverse coding items, scale and cluster scores, reading and writing Excel and SPSS files), (2) descriptive statistics (e.g., frequency table, cross tabulation, effect size measures), (3) missing data (e.g., descriptive statistics for missing data, missing data pattern, Little's test of Missing Completely at Random, and auxiliary variable analysis), (4) multilevel data (e.g., multilevel descriptive statistics, within-group and between-group correlation matrix, multilevel confirmatory factor analysis, level-specific fit indices, cross-level measurement equivalence evaluation, multilevel composite reliability, and multilevel R-squared measures), (5) item analysis (e.g., confirmatory factor analysis, coefficient alpha and omega, between-group and longitudinal measurement equivalence evaluation), (6) statistical analysis (e.g., bootstrap confidence intervals, collinearity and residual diagnostics, dominance analysis, between- and within-subject analysis of variance, latent class analysis, t-test, z-test, sample size determination), and (7) functions to interact with Blimp and Mplus'.
saseR is a highly performant and fast framework for aberrant expression and splicing analyses. The main functions are: \itemize\item \code\linkBamtoAspliCounts - Process BAM files to ASpli counts \item \code\linkconvertASpli - Get gene, bin or junction counts from ASpli SummarizedExperiment \item \code\linkcalculateOffsets - Create an offsets assays for aberrant expression or splicing analysis \item \code\linksaseRfindEncodingDim - Estimate the optimal number of latent factors to include when estimating the mean expression \item \code\linksaseRfit - Parameter estimation of the negative binomial distribution and compute p-values for aberrant expression and splicing For information upon how to use these functions, check out our vignette at \urlhttps://github.com/statOmics/saseR/blob/main/vignettes/Vignette.Rmd and the saseR paper: Segers, A. et al. (2023). Juggling offsets unlocks RNA-seq tools for fast scalable differential usage, aberrant splicing and expression analyses. bioRxiv. \urlhttps://doi.org/10.1101/2023.06.29.547014.
We provide a computationally efficient and robust implementation of the recently proposed C-JAMP (Copula-based Joint Analysis of Multiple Phenotypes) method (Konigorski et al., 2019, submitted). C-JAMP allows estimating and testing the association of one or multiple predictors on multiple outcomes in a joint model, and is implemented here with a focus on large-scale genome-wide association studies with two phenotypes. The use of copula functions allows modeling a wide range of multivariate dependencies between the phenotypes, and previous results are supporting that C-JAMP can increase the power of association studies to identify associated genetic variants in comparison to existing methods (Konigorski, Yilmaz, Pischon, 2016, <DOI:10.1186/s12919-016-0045-6>; Konigorski, Yilmaz, Bull, 2014, <DOI:10.1186/1753-6561-8-S1-S72>). In addition to the C-JAMP functions, functions are available to generate genetic and phenotypic data, to compute the minor allele frequency (MAF) of genetic markers, and to estimate the phenotypic variance explained by genetic markers.
Aster models (Geyer, Wagenius, and Shaw, 2007, <doi:10.1093/biomet/asm030>; Shaw, Geyer, Wagenius, Hangelbroek, and Etterson, 2008, <doi:10.1086/588063>; Geyer, Ridley, Latta, Etterson, and Shaw, 2013, <doi:10.1214/13-AOAS653>) are exponential family regression models for life history analysis. They are like generalized linear models except that elements of the response vector can have different families (e.2g., some Bernoulli, some Poisson, some zero-truncated Poisson, some normal) and can be dependent, the dependence indicated by a graphical structure. Discrete time survival analysis, life table analysis, zero-inflated Poisson regression, and generalized linear models that are exponential family (e.g., logistic regression and Poisson regression with log link) are special cases. Main use is for data in which there is survival over discrete time periods and there is additional data about what happens conditional on survival (e.g., number of offspring). Uses the exponential family canonical parameterization (aster transform of usual parameterization). There are also random effects versions of these models.
Companion package of Carrion-i-Silvestre & Sansó (2023): "Generalized Extreme Value Approximation to the CUMSUMQ Test for Constant Unconditional Variance in Heavy-Tailed Time Series". It implements the Modified Iterative Cumulative Sum of Squares Algorithm, which is an extension of the Iterative Cumulative Sum of Squares (ICSS) Algorithm of Inclan and Tiao (1994), and it checks for changes in the unconditional variance of a time series controlling for the tail index of the underlying distribution. The fourth order moment is estimated non-parametrically to avoid the size problems when the innovations are non-Gaussian (see, Sansó et al., 2004). Critical values and p-values are generated using a Generalized Extreme Value distribution approach. References Carrion-i-Silvestre J.J & Sansó A (2023) <https://www.ub.edu/irea/working_papers/2023/202309.pdf>. Inclan C & Tiao G.C (1994) <doi:10.1080/01621459.1994.10476824>, Sansó A & Aragó V & Carrion-i-Silvestre J.L (2004) <https://dspace.uib.es/xmlui/bitstream/handle/11201/152078/524035.pdf>.
Fits heterogeneous panel data models with interactive effects for linear regression, logistic, count, probit, quantile, and clustering. Based on Ando, T. and Bai, J. (2015) "A simple new test for slope homogeneity in panel data models with interactive effects" <doi: 10.1016/j.econlet.2015.09.019>, Ando, T. and Bai, J. (2015) "Asset Pricing with a General Multifactor Structure" <doi: 10.1093/jjfinex/nbu026> , Ando, T. and Bai, J. (2016) "Panel data models with grouped factor structure under unknown group membership" <doi: 10.1002/jae.2467>, Ando, T. and Bai, J. (2017) "Clustering huge number of financial time series: A panel data approach with high-dimensional predictors and factor structures" <doi: 10.1080/01621459.2016.1195743>, Ando, T. and Bai, J. (2020) "Quantile co-movement in financial markets" <doi: 10.1080/01621459.2018.1543598>, Ando, T., Bai, J. and Li, K. (2021) "Bayesian and maximum likelihood analysis of large-scale panel choice models with unobserved heterogeneity" <doi: 10.1016/j.jeconom.2020.11.013.>.
The running statistics of interest is first extracted using a time window which is slid across the time series, and in each window, the running statistics value is computed. KCP (Kernel Change Point) detection proposed by Arlot et al. (2012) <arXiv:1202.3878> is then implemented to flag the change points on the running statistics (Cabrieto et al., 2018, <doi:10.1016/j.ins.2018.03.010>). Change points are located by minimizing a variance criterion based on the pairwise similarities between running statistics which are computed via the Gaussian kernel. KCP can locate change points for a given k number of change points. To determine the optimal k, the KCP permutation test is first carried out by comparing the variance of the running statistics extracted from the original data to that of permuted data. If this test is significant, then there is sufficient evidence for at least one change point in the data. Model selection is then used to determine the optimal k>0.
Decomposition of time series into trend, seasonal, and remainder components with methods for detecting and characterizing abrupt changes within the trend and seasonal components. BFAST can be used to analyze different types of satellite image time series and can be applied to other disciplines dealing with seasonal or non-seasonal time series, such as hydrology, climatology, and econometrics. The algorithm can be extended to label detected changes with information on the parameters of the fitted piecewise linear models. BFAST monitoring functionality is described in Verbesselt et al. (2010) <doi:10.1016/j.rse.2009.08.014>. BFAST monitor provides functionality to detect disturbance in near real-time based on BFAST'- type models, and is described in Verbesselt et al. (2012) <doi:10.1016/j.rse.2012.02.022>. BFAST Lite approach is a flexible approach that handles missing data without interpolation, and will be described in an upcoming paper. Furthermore, different models can now be used to fit the time series data and detect structural changes (breaks).
The network autocorrelation model (NAM) can be used for studying the degree of social influence regarding an outcome variable based on one or more known networks. The degree of social influence is quantified via the network autocorrelation parameters. In case of a single network, the Bayesian methods of Dittrich, Leenders, and Mulder (2017) <DOI:10.1016/j.socnet.2016.09.002> and Dittrich, Leenders, and Mulder (2019) <DOI:10.1177/0049124117729712> are implemented using a normal, flat, or independence Jeffreys prior for the network autocorrelation. In the case of multiple networks, the Bayesian methods of Dittrich, Leenders, and Mulder (2020) <DOI:10.1177/0081175020913899> are implemented using a multivariate normal prior for the network autocorrelation parameters. Flat priors are implemented for estimating the coefficients. For Bayesian testing of equality and order-constrained hypotheses, the default Bayes factor of Gu, Mulder, and Hoijtink, (2018) <DOI:10.1111/bmsp.12110> is used with the posterior mean and posterior covariance matrix of the NAM parameters based on flat priors as input.
This package provides the function feis() to estimate fixed effects individual slope (FEIS) models. The FEIS model constitutes a more general version of the often-used fixed effects (FE) panel model, as implemented in the package plm by Croissant and Millo (2008) <doi:10.18637/jss.v027.i02>. In FEIS models, data are not only person demeaned like in conventional FE models, but detrended by the predicted individual slope of each person or group. Estimation is performed by applying least squares lm() to the transformed data. For more details on FEIS models see Bruederl and Ludwig (2015, ISBN:1446252442); Frees (2001) <doi:10.2307/3316008>; Polachek and Kim (1994) <doi:10.1016/0304-4076(94)90075-2>; Ruettenauer and Ludwig (2020) <doi:10.1177/0049124120926211>; Wooldridge (2010, ISBN:0262294354). To test consistency of conventional FE and random effects estimators against heterogeneous slopes, the package also provides the functions feistest() for an artificial regression test and bsfeistest() for a bootstrapped version of the Hausman test.
With satin functions, visualisation, data extraction and further analysis like producing climatologies from several images, and anomalies of satellite derived ocean data can be easily done. Reading functions can import a user defined geographical extent of data stored in netCDF files. Currently supported ocean data sources include NASA's Oceancolor web page <https://oceancolor.gsfc.nasa.gov/>, sensors VIIRS-SNPP; MODIS-Terra; MODIS-Aqua; and SeaWiFS. Available variables from this source includes chlorophyll concentration, sea surface temperature (SST), and several others. Data sources specific for SST that can be imported too includes Pathfinder AVHRR <https://www.ncei.noaa.gov/products/avhrr-pathfinder-sst> and GHRSST <https://www.ghrsst.org/>. In addition, ocean productivity data produced by Oregon State University can also be handled previous conversion from HDF4 to HDF5 format. Many other ocean variables can be processed by importing netCDF data files from two European Union's Copernicus Marine Service databases <https://marine.copernicus.eu/>, namely Global Ocean Physical Reanalysis and Global Ocean Biogeochemistry Hindcast.
This package implements a set of routines to perform structured matrix factorization with minimum volume constraints. The NMF procedure decomposes a matrix X into a product C * D. Given conditions such that the matrix C is non-negative and has sufficiently spread columns, then volume minimization of a matrix D delivers a correct and unique, up to a scale and permutation, solution (C, D). This package provides both an implementation of volume-regularized NMF and "anchor-free" NMF, whereby the standard NMF problem is reformulated in the covariance domain. This algorithm was applied in Vladimir B. Seplyarskiy Ruslan A. Soldatov, et al. "Population sequencing data reveal a compendium of mutational processes in the human germ line". Science, 12 Aug 2021. <doi:10.1126/science.aba7408>. This package interacts with data available through the simulatedNMF package, which is available in a drat repository. To access this data package, see the instructions at <https://github.com/kharchenkolab/vrnmf>. The size of the simulatedNMF package is approximately 8 MB.
This package performs distance sampling simulations. dsims repeatedly generates instances of a user defined population within a given survey region. It then generates realisations of a survey design and simulates the detection process. The data are then analysed so that the results can be compared for accuracy and precision across all replications. This process allows users to optimise survey designs for their specific set of survey conditions. The effects of uncertainty in population distribution or parameters can be investigated under a number of simulations so that users can be confident that they have achieved a robust survey design before deploying vessels into the field. The distance sampling designs used in this package from dssd are detailed in Chapter 7 of Advanced Distance Sampling, Buckland et. al. (2008, ISBN-13: 978-0199225873). General distance sampling methods are detailed in Introduction to Distance Sampling: Estimating Abundance of Biological Populations, Buckland et. al. (2004, ISBN-13: 978-0198509271). Find out more about estimating animal/plant abundance with distance sampling at <https://distancesampling.org/>.
This data package contains four datasets of quantitative PCR (qPCR) amplification curves that were used as supplementary data in the research article by Sisti et al. (2010), <doi:10.1186/1471-2105-11-186>. The primary dataset comprises a ten-fold dilution series spanning copy numbers from 3.14 Ã 10^7 to 3.14 Ã 10^2, with twelve replicates per concentration. These samples are based on a pGEM-T Promega plasmid containing a 104 bp fragment of the mitochondrial gene NADH dehydrogenase 1 (MT-ND1), amplified using the ND1/ND2 primer pair. The remaining three datasets contain qPCR results in the presence of specific PCR inhibitors: tannic acid, immunoglobulin G (IgG), and quercetin, respectively, to assess their effects on the amplification process. These datasets are useful for researchers interested in PCR kinetics. The original raw data file is available as Additional File 1: <https://static-content.springer.com/esm/art%3A10.1186%2F1471-2105-11-186/MediaObjects/12859_2009_3643_MOESM1_ESM.XLS>.
Mahalanobis-Taguchi (MT) system is a collection of multivariate analysis methods developed for the field of quality engineering. MT system consists of two families depending on their purpose. One is a family of Mahalanobis-Taguchi (MT) methods (in the broad sense) for diagnosis (see Woodall, W. H., Koudelik, R., Tsui, K. L., Kim, S. B., Stoumbos, Z. G., and Carvounis, C. P. (2003) <doi:10.1198/004017002188618626>) and the other is a family of Taguchi (T) methods for forecasting (see Kawada, H., and Nagata, Y. (2015) <doi:10.17929/tqs.1.12>). The MT package contains three basic methods for the family of MT methods and one basic method for the family of T methods. The MT method (in the narrow sense), the Mahalanobis-Taguchi Adjoint (MTA) methods, and the Recognition-Taguchi (RT) method are for the MT method and the two-sided Taguchi (T1) method is for the family of T methods. In addition, the Ta and Tb methods, which are the improved versions of the T1 method, are included.
This package provides a collection of functions to test and estimate Seemingly Unrelated Regression (usually called SUR) models, with spatial structure, by maximum likelihood and three-stage least squares. The package estimates the most common spatial specifications, that is, SUR with Spatial Lag of X regressors (called SUR-SLX), SUR with Spatial Lag Model (called SUR-SLM), SUR with Spatial Error Model (called SUR-SEM), SUR with Spatial Durbin Model (called SUR-SDM), SUR with Spatial Durbin Error Model (called SUR-SDEM), SUR with Spatial Autoregressive terms and Spatial Autoregressive Disturbances (called SUR-SARAR), SUR-SARAR with Spatial Lag of X regressors (called SUR-GNM) and SUR with Spatially Independent Model (called SUR-SIM). The methodology of these models can be found in next references Minguez, R., Lopez, F.A., and Mur, J. (2022) <doi:10.18637/jss.v104.i11> Mur, J., Lopez, F.A., and Herrera, M. (2010) <doi:10.1080/17421772.2010.516443> Lopez, F.A., Mur, J., and Angulo, A. (2014) <doi:10.1007/s00168-014-0624-2>.
An implementation for high-dimensional time series analysis methods, including factor model for vector time series proposed by Lam and Yao (2012) <doi:10.1214/12-AOS970> and Chang, Guo and Yao (2015) <doi:10.1016/j.jeconom.2015.03.024>, martingale difference test proposed by Chang, Jiang and Shao (2023) <doi:10.1016/j.jeconom.2022.09.001>, principal component analysis for vector time series proposed by Chang, Guo and Yao (2018) <doi:10.1214/17-AOS1613>, cointegration analysis proposed by Zhang, Robinson and Yao (2019) <doi:10.1080/01621459.2018.1458620>, unit root test proposed by Chang, Cheng and Yao (2022) <doi:10.1093/biomet/asab034>, white noise test proposed by Chang, Yao and Zhou (2017) <doi:10.1093/biomet/asw066>, CP-decomposition for matrix time series proposed by Chang et al. (2023) <doi:10.1093/jrsssb/qkac011> and Chang et al. (2024) <doi:10.48550/arXiv.2410.05634>, and statistical inference for spectral density matrix proposed by Chang et al. (2022) <doi:10.48550/arXiv.2212.13686>.
Standardises and facilitates the use of eleven established stability properties that have been used to assess systemsâ responses to press or pulse disturbances at different ecological levels (e.g. population, community). There are two sets of functions. The first set corresponds to functions that measure stability at any level of organisation, from individual to community and can be applied to a time series of a systemâ s state variables (e.g., body mass, population abundance, or species diversity). The properties included in this set are: invariability, resistance, extent and rate of recovery, persistence, and overall ecological vulnerability. The second set of functions can be applied to Jacobian matrices. The functions in this set measure the stability of a community at short and long time scales. In the short term, the communityâ s response is measured by maximal amplification, reactivity and initial resilience (i.e. initial rate of return to equilibrium). In the long term, stability can be measured as asymptotic resilience and intrinsic stochastic invariability. Figueiredo et al. (2025) <doi:10.32942/X2M053>.