This package provides a system for extracting news from Chilean media, specifically through Web Scapping from Chilean media. The package allows for news searches using search phrases and date filters, and returns the results in a structured format, ready for analysis. Additionally, it includes functions to clean the extracted data, visualize it, and store it in databases. All of this can be done automatically, facilitating the collection and analysis of relevant information from Chilean media.
Pulls together a collection of datasets from Miguel de Carvalho research articles. Including, for example: - de Carvalho (2012) <doi:10.1016/j.jspi.2011.08.016>; - de Carvalho et al (2012) <doi:10.1080/03610926.2012.709905>; - de Carvalho et al (2012) <doi:10.1016/j.econlet.2011.09.007>); - de Carvalho and Davison (2014) <doi:10.1080/01621459.2013.872651>; - de Carvalho and Rua (2017) <doi:10.1016/j.ijforecast.2015.09.004>; - de Carvalho et al (2023) <doi:10.1002/sta4.560>; - de Carvalho et al (2022) <doi:10.1007/s13253-021-00469-9>; - Palacios et al (2024) <doi:10.1214/24-BA1420>.
The Datasaurus Dozen is a set of datasets with the same summary statistics. They retain the same summary statistics despite having radically different distributions. The datasets represent a larger and quirkier object lesson that is typically taught via Anscombe's Quartet (available in the 'datasets' package). Anscombe's Quartet contains four very different distributions with the same summary statistics and as such highlights the value of visualisation in understanding data, over and above summary statistics. As well as being an engaging variant on the Quartet, the data is generated in a novel way. The simulated annealing process used to derive datasets from the original Datasaurus is detailed in "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing" doi:10.1145/3025453.3025912.
This package provides a system designed for detecting concept drift in streaming datasets. It offers a comprehensive suite of statistical methods to detect concept drift, including methods for monitoring changes in data distributions over time. The package supports several tests, such as Drift Detection Method (DDM), Early Drift Detection Method (EDDM), Hoeffding Drift Detection Methods (HDDM_A, HDDM_W), Kolmogorov-Smirnov test-based Windowing (KSWIN) and Page Hinkley (PH) tests. The methods implemented in this package are based on established research and have been demonstrated to be effective in real-time data analysis. For more details on the methods, please check to the following sources. KobyliÅ ska et al. (2023) <doi:10.48550/arXiv.2308.11446>
, S. Kullback & R.A. Leibler (1951) <doi:10.1214/aoms/1177729694>, Gama et al. (2004) <doi:10.1007/978-3-540-28645-5_29>, Baena-Garcia et al. (2006) <https://www.researchgate.net/publication/245999704_Early_Drift_Detection_Method>, Frà as-Blanco et al. (2014) <https://ieeexplore.ieee.org/document/6871418>, Raab et al. (2020) <doi:10.1016/j.neucom.2019.11.111>, Page (1954) <doi:10.1093/biomet/41.1-2.100>, Montiel et al. (2018) <https://jmlr.org/papers/volume19/18-251/18-251.pdf>.
Creating, and refining data nuggets. Data nuggets reduce a large dataset into a small collection of nuggets of data, each containing a center (location), weight (importance), and scale (variability) parameter. Data nugget centers are created by choosing observations in the dataset which are as equally spaced apart as possible. Data nugget weights are created by counting the number observations closest to a given data nugget center. We then say the data nugget contains these observations and the data nugget center is recalculated as the mean of these observations. Data nugget scales are created by calculating the trace of the covariance matrix of the observations contained within a data nugget divided by the dimension of the dataset. Data nuggets are refined by splitting data nuggets which have scales or shapes (defined as the ratio of the two largest eigenvalues of the covariance matrix of the observations contained within the data nugget) Reference paper: [1] Beavers, T. E., Cheng, G., Duan, Y., Cabrera, J., Lubomirski, M., Amaratunga, D., & Teigler, J. E. (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure. Journal of Computational and Graphical Statistics, 1-21. [2] Cherasia, K. E., Cabrera, J., Fernholz, L. T., & Fernholz, R. (2022). Data Nuggets in Supervised Learning. \emphIn Robust and Multivariate Statistical Methods: Festschrift in Honor of David E. Tyler (pp. 429-449). Cham: Springer International Publishing.
Profiles datasets (collecting statistics and informative summaries about that data) on data frames and ODBC tables: maximum, minimum, mean, standard deviation, nulls, distinct values, data patterns, data/format frequencies.
Companion to the book "An Introduction to Clustering with R" by P. Giordani, M.B. Ferraro and F. Martella (Springer, Singapore, 2020). The datasets are used in some case studies throughout the text.
Read, construct and write CDISC (Clinical Data Interchange Standards Consortium) Dataset JSON (JavaScript
Object Notation) files, while validating per the Dataset JSON schema file, as described in CDISC (2023) <https://www.cdisc.org/standards/data-exchange/dataset-json>.
Allows you to define rules which can be used to verify a given dataset. The package acts as a thin wrapper around more powerful data packages such as dplyr', data.table', arrow', and DBI ('SQL'), which do the heavy lifting.
Open, read data from and modify Data Packages. Data Packages are an open standard for bundling and describing data sets (<https://datapackage.org>). When data is read from a Data Package care is taken to convert the data as much a possible to R appropriate data types. The package can be extended with plugins for additional data types.
This package provides a collection of widely used univariate data sets of various applied domains on applications of distribution theory. The functions allow researchers and practitioners to quickly, easily, and efficiently access and use these data sets. The data are related to different applied domains and as follows: Bio-medical, survival analysis, medicine, reliability analysis, hydrology, actuarial science, operational research, meteorology, extreme values, quality control, engineering, finance, sports and economics. The total 100 data sets are documented along with associated references for further details and uses.
This package provides a tool developed with the Golem framework which provides an easier way to check cells differences between two data frames. The user provides two data frames for comparison, selects IDs variables identifying each row of input data, then clicks a button to perform the comparison. Several R package functions are used to describe the data and perform the comparison in the server of the application. The main ones are comparedf()
from arsenal and skim()
from skimr'. For more details see the description of comparedf()
from the arsenal package and that of skim()
from the skimr package.
Access Datastream content through <https://product.datastream.com/dswsclient/Docs/Default.aspx>., our historical financial database with over 35 million individual instruments or indicators across all major asset classes, including over 19 million active economic indicators. It features 120 years of data, across 175 countries â the information you need to interpret market trends, economic cycles, and the impact of world events. Data spans bond indices, bonds, commodities, convertibles, credit default swaps, derivatives, economics, energy, equities, equity indices, ESG, estimates, exchange rates, fixed income, funds, fundamentals, interest rates, and investment trusts. Unique content includes I/B/E/S Estimates, Worldscope Fundamentals, point-in-time data, and Reuters Polls. Alongside the content, sit a set of powerful analytical tools for exploring relationships between different asset types, with a library of customizable analytical functions. In-house timeseries can also be uploaded using the package to comingle with Datastream maintained datasets, use with these analytical tools and displayed in Datastreamâ s flexible charting facilities in Microsoft Office.
Automated data exploration process for analytic tasks and predictive modeling, so that users could focus on understanding data and extracting insights. The package scans and analyzes each variable, and visualizes them with typical graphical techniques. Common data processing methods are also available to treat and format data.
Easy comparison of two tabular data objects in R. Specifically designed to show differences between two sets of data in a useful way that should make it easier to understand the differences, and if necessary, help you work out how to remedy them. Aims to offer a more useful output than all.equal()
when your two data sets do not match, but isn't intended to replace all.equal()
as a way to test for equality.
Data screening is an important first step of any statistical analysis. dataReporter
auto generates a customizable data report with a thorough summary of the checks and the results that a human can use to identify possible errors. It provides an extendable suite of test for common potential errors in a dataset. See Petersen AH, Ekstrøm CT (2019). "dataMaid
: Your Assistant for Documenting Supervised Data Quality Screening in R." _Journal of Statistical Software_, *90*(6), 1-38 <doi:10.18637/jss.v090.i06> for more information.
This package provides a framework to help construct R data packages in a reproducible manner. Potentially time consuming processing of raw data sets into analysis ready data sets is done in a reproducible manner and decoupled from the usual R CMD build process so that data sets can be processed into R objects in the data package and the data package can then be shared, built, and installed by others without the need to repeat computationally costly data processing. The package maintains data provenance by turning the data processing scripts into package vignettes, as well as enforcing documentation and version checking of included data objects. Data packages can be version controlled on GitHub
', and used to share data for manuscripts, collaboration and reproducible research.
Graphical interface for loading datasets in RStudio from all installed (including unloaded) packages, also includes command line interfaces.
Collection of functions to help retrieve U.S. Geological Survey and U.S. Environmental Protection Agency water quality and hydrology data from web services.
Utilities for handling dates and times, such as selecting particular days of the week or month, formatting timestamps as required by RSS feeds, or converting timestamp representations of other software (such as MATLAB and Excel') to R. The package is lightweight (no dependencies, pure R implementations) and relies only on R's standard classes to represent dates and times ('Date and POSIXt'); it aims to provide efficient implementations, through vectorisation and the use of R's native numeric representations of timestamps where possible.
This package provides a metapackage that brings together a curated collection of R packages containing domain-specific datasets. It includes time series data, educational metrics, crime records, medical datasets, and oncology research data. Designed to provide researchers, analysts, educators, and data scientists with centralized access to structured and well-documented datasets, this metapackage facilitates reproducible research, data exploration, and teaching applications across a wide range of domains. Included packages: - timeSeriesDataSets
': Time series data from economics, finance, energy, and healthcare. - educationR
': Datasets related to education, learning outcomes, and school metrics. - crimedatasets': Datasets on global and local crime and criminal behavior. - MedDataSets
': Datasets related to medicine, public health, treatments, and clinical trials. - OncoDataSets
': Datasets focused on cancer research, survival, genetics, and biomarkers.
Set of tools aimed at processing meteorological data, converting hourly recorded data to daily, monthly and annual data.
This package creates a data dictionary from any dataframe or tibble in your R environment. You can opt to add variable labels. You can write the object directly to Excel.
Validate dataset by columns and rows using convenient predicates inspired by assertr package. Generate good looking HTML report or print console output to display in logs of your data processing pipeline.