Missing data imputation based on the missForest
algorithm (Stekhoven, Daniel J (2012) <doi:10.1093/bioinformatics/btr597>) with adaptations for prediction settings. The function missForest()
is used to impute a (training) dataset with missing values and to learn imputation models that can be later used for imputing new observations. The function missForestPredict()
is used to impute one or multiple new observations (test set) using the models learned on the training data. For more details see Albu, E., Gao, S., Wynants, L., & Van Calster, B. (2024). missForestPredict--Missing
data imputation for prediction settings <doi:10.48550/arXiv.2407.03379>
.
When multiple Cox proportional hazard models are performed on clinical data (month or year and status) and a set of differential expressions of genes, the results (Hazard risks, z-scores and p-values) can be used to create gene-expression signatures. Weights are calculated using the survival p-values of genes and are utilized to calculate expression values of the signature across the selected genes in all patients in a cohort. A Single or multiple univariate or multivariate Cox proportional hazard survival analyses of the patients in one cohort can be performed by using the gene-expression signature and visualized using our survival plots.
This package provides an extensive and curated collection of datasets related to the digestive system, stomach, intestines, liver, pancreas, and associated diseases. This package includes clinical trials, observational studies, experimental datasets, cohort data, and case series involving gastrointestinal disorders such as gastritis, ulcers, pancreatitis, liver cirrhosis, colon cancer, colorectal conditions, Helicobacter pylori infection, irritable bowel syndrome, intestinal infections, and post-surgical outcomes. The datasets support educational, clinical, and research applications in gastroenterology, public health, epidemiology, and biomedical sciences. Designed for researchers, clinicians, data scientists, students, and educators interested in digestive diseases, the package facilitates reproducible analysis, modeling, and hypothesis testing using real-world and historical data.
Automated flagging of common spatial and temporal errors in biological and paleontological collection data, for the use in conservation, ecology and paleontology. Includes automated tests to easily flag (and exclude) records assigned to country or province centroid, the open ocean, the headquarters of the Global Biodiversity Information Facility, urban areas or the location of biodiversity institutions (museums, zoos, botanical gardens, universities). Furthermore identifies per species outlier coordinates, zero coordinates, identical latitude/longitude and invalid coordinates. Also implements an algorithm to identify data sets with a significant proportion of rounded coordinates. Especially suited for large data sets. The reference for the methodology is: Zizka et al. (2019) <doi:10.1111/2041-210X.13152>.
Dominance analysis is a method that allows to compare the relative importance of predictors in multiple regression models: ordinary least squares, generalized linear models, hierarchical linear models, beta regression and dynamic linear models. The main principles and methods of dominance analysis are described in Budescu, D. V. (1993) <doi:10.1037/0033-2909.114.3.542> and Azen, R., & Budescu, D. V. (2003) <doi:10.1037/1082-989X.8.2.129> for ordinary least squares regression. Subsequently, the extensions for multivariate regression, logistic regression and hierarchical linear models were described in Azen, R., & Budescu, D. V. (2006) <doi:10.3102/10769986031002157>, Azen, R., & Traxel, N. (2009) <doi:10.3102/1076998609332754> and Luo, W., & Azen, R. (2013) <doi:10.3102/1076998612458319>, respectively.
Estimation and testing methods for dependently truncated data. Semi-parametric methods are based on Emura et al. (2011)<Stat Sinica 21:349-67>, Emura & Wang (2012)<doi:10.1016/j.jmva.2012.03.012>, and Emura & Murotani (2015)<doi:10.1007/s11749-015-0432-8>. Parametric approaches are based on Emura & Konno (2012)<doi:10.1007/s00362-014-0626-2> and Emura & Pan (2017)<doi:10.1007/s00362-017-0947-z>. A regression approach is based on Emura & Wang (2016)<doi:10.1007/s10463-015-0526-9>. Quasi-independence tests are based on Emura & Wang (2010)<doi:10.1016/j.jmva.2009.07.006>. Right-truncated data for Japanese male centenarians are given by Emura & Murotani (2015)<doi:10.1007/s11749-015-0432-8>.
Targets parameters that solve Ordinary Differential Equations (ODEs) driven by a vector of cumulative hazard functions. The package provides a method for estimating these parameters using an estimator defined by a corresponding Stochastic Differential Equation (SDE) system driven by cumulative hazard estimates. By providing cumulative hazard estimates as input, the package gives estimates of the parameter as output, along with pointwise (co)variances derived from an asymptotic expression. Examples of parameters that can be targeted in this way include the survival function, the restricted mean survival function, cumulative incidence functions, among others; see Ryalen, Stensrud, and Røysland (2018) <doi:10.1093/biomet/asy035>, and further applications in Stensrud, Røysland, and Ryalen (2019) <doi:10.1111/biom.13102> and Ryalen et al. (2021) <doi:10.1093/biostatistics/kxab009>.
This package contains the functions to use the econometric methods in the paper Bryzgalova, Huang, and Julliard (2023) <doi:10.1111/jofi.13197>. In this package, we provide a novel Bayesian framework for analyzing linear asset pricing models: simple, robust, and applicable to high-dimensional problems. For a stand-alone model, we provide functions including BayesianFM()
and BayesianSDF()
to deliver reliable price of risk estimates for both tradable and nontradable factors. For competing factors and possibly nonnested models, we provide functions including continuous_ss_sdf()
, continuous_ss_sdf_v2()
, and dirac_ss_sdf_pvalue()
to analyze high-dimensional models. If you use this package, please cite the paper. We are thankful to Yunan Ding and Jingtong Zhang for their research assistance. Any errors or omissions are the responsibility of the authors.
This package implements variable screening techniques for ultra-high dimensional regression settings. Techniques for independent (iid) data, varying-coefficient models, and longitudinal data are implemented. The package currently contains three screen functions: screenIID()
, screenLD()
and screenVCM()
, and six methods for simulating dataset: simulateDCSIS()
, simulateLD
, simulateMVSIS()
, simulateMVSISNY()
, simulateSIRS()
and simulateVCM()
. The package is based on the work of Li-Ping ZHU, Lexin LI, Runze LI, and Li-Xing ZHU (2011) <DOI:10.1198/jasa.2011.tm10563>, Runze LI, Wei ZHONG, & Liping ZHU (2012) <DOI:10.1080/01621459.2012.695654>, Jingyuan LIU, Runze LI, & Rongling WU (2014) <DOI:10.1080/01621459.2013.850086> Hengjian CUI, Runze LI, & Wei ZHONG (2015) <DOI:10.1080/01621459.2014.920256>, and Wanghuan CHU, Runze LI and Matthew REIMHERR (2016) <DOI:10.1214/16-AOAS912>.
Computing elliptical joint confidence regions at a specified confidence level. It provides the flexibility to estimate either classical or robust confidence regions, which can be visualized in 2D or 3D plots. The classical approach assumes normality and uses the mean and covariance matrix to define the confidence regions. Alternatively, the robustified version employs estimators like minimum covariance determinant (MCD) and M-estimator, making them less sensitive to outliers and departures from normality. Furthermore, the functions allow users to group the dataset based on categorical variables and estimate separate confidence regions for each group. This capability is particularly useful for exploring potential differences or similarities across subgroups within a dataset. Varmuza and Filzmoser (2009, ISBN:978-1-4200-5947-2). Johnson and Wichern (2007, ISBN:0-13-187715-1). Raymaekers and Rousseeuw (2019) <DOI:10.1080/00401706.2019.1677270>.
Automatically returns 36 logistic models including 23 individual models and 13 ensembles of models of logistic data. The package also returns 10 plots, 5 tables, and a summary report. The package automatically builds all 36 models, reports all results, and provides graphics to show how the models performed. This can be used for a wide range of data sets. The package includes medical data (the Pima Indians data set), and information about the performance of Lebron James. The package can be used to analyze many other examples, such as stock market data. The package automatically returns many values for each model, such as True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate, Positive Predictive Value, Negative Predictive Value, F1 Score, Area Under the Curve. The package also returns 36 Receiver Operating Characteristic (ROC) curves for each of the 36 models.
The modified Adult Treatment Panel -III guidelines (ATP-III) proposed by American Heart Association (AHA) and National Heart, Lung and Blood Institute (NHLBI) are used widely for the clinical diagnosis of Metabolic Syndrome. The AHA-NHLBI criteria advise using parameters such as waist circumference (WC), systolic blood pressure (SBP), diastolic blood pressure (DBP), fasting plasma glucose (FPG), triglycerides (TG) and high-density lipoprotein cholesterol (HDLC) for diagnosis of metabolic syndrome. Each parameter has to be interpreted based on the proposed cut-offs, making the diagnosis slightly complex and error-prone. This package is developed by incorporating the modified ATP-III guidelines, and it will aid in the easy and quick diagnosis of metabolic syndrome in busy healthcare settings and also for research purposes. The modified ATP-III-AHA-NHLBI criteria for the diagnosis is described by Grundy et al ., (2005) <doi:10.1161/CIRCULATIONAHA.105.169404>.
Load and analyze updated time series worldwide data of reported cases for the Novel Coronavirus Disease (COVID-19) from different sources, including the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) data repository <https://github.com/CSSEGISandData/COVID-19>
, "Our World in Data" <https://github.com/owid/> among several others. The datasets reporting the COVID-19 cases are available in two main modalities, as a time series sequences and aggregated data for the last day with greater spatial resolution. Several analysis, visualization and modelling functions are available in the package that will allow the user to compute and visualize total number of cases, total number of changes and growth rate globally or for an specific geographical location, while at the same time generating models using these trends; generate interactive visualizations and generate Susceptible-Infected-Recovered (SIR) model for the disease spread.
Management problems of deterministic and stochastic projects. It obtains the duration of a project and the appropriate slack for each activity in a deterministic context. In addition it obtains a schedule of activities time (Castro, Gómez & Tejada (2007) <doi:10.1016/j.orl.2007.01.003>). It also allows the management of resources. When the project is done, and the actual duration for each activity is known, then it can know how long the project is delayed and make a fair delivery of the delay between each activity (Bergantiños, Valencia-Toledo & Vidal-Puga (2018) <doi:10.1016/j.dam.2017.08.012>). In a stochastic context it can estimate the average duration of the project and plot the density of this duration, as well as, the density of the early and last times of the chosen activities. As in the deterministic case, it can make a distribution of the delay generated by observing the project already carried out.
Generally, soil functionality is characterized by its capability to sustain microbial activity, nutritional element supply, structural stability and aid for crop production. Since soil functions can be linked to 80% of ecosystem services, conservation of degraded land should strive to restore not only the capacity of soil to sustain flora but also ecosystem provisions. The primary ecosystem services of soil are carbon sequestration, food or biomass production, provision of microbial habitat, nutrient recycling. However, the actual magnitude of soil functions provided by agricultural land uses has never been quantified. Nutrient supply capacity (NSC) is a measure of nutrient dynamics in restored land uses. Carbon accumulation proficiency (CAP) is a measure of ecosystem carbon sequestration. Biological activity index (BAI) is the average of responses of all enzyme activities in treated land over control/reference land. The CAP parameter investigates how land uses may affect carbon flows, retention, and sequestration. The CAP provides a signal for C cycles, flows, and the systems relative operational supremacy.
Automated backtesting of multiple portfolios over multiple datasets of stock prices in a rolling-window fashion. Intended for researchers and practitioners to backtest a set of different portfolios, as well as by a course instructor to assess the students in their portfolio design in a fully automated and convenient manner, with results conveniently formatted in tables and plots. Each portfolio design is easily defined as a function that takes as input a window of the stock prices and outputs the portfolio weights. Multiple portfolios can be easily specified as a list of functions or as files in a folder. Multiple datasets can be conveniently extracted randomly from different markets, different time periods, and different subsets of the stock universe. The results can be later assessed and ranked with tables based on a number of performance criteria (e.g., expected return, volatility, Sharpe ratio, drawdown, turnover rate, return on investment, computational time, etc.), as well as plotted in a number of ways with nice barplots and boxplots.
This package provides functions to read and write neuroimaging data in various file formats, with a focus on FreeSurfer
<http://freesurfer.net/> formats. This includes, but is not limited to, the following file formats: 1) MGH/MGZ format files, which can contain multi-dimensional images or other data. Typically they contain time-series of three-dimensional brain scans acquired by magnetic resonance imaging (MRI). They can also contain vertex-wise measures of surface morphometry data. The MGH format is named after the Massachusetts General Hospital, and the MGZ format is a compressed version of the same format. 2) FreeSurfer
morphometry data files in binary curv format. These contain vertex-wise surface measures, i.e., one scalar value for each vertex of a brain surface mesh. These are typically values like the cortical thickness or brain surface area at each vertex. 3) Annotation file format. This contains a brain surface parcellation derived from a cortical atlas. 4) Surface file format. Contains a brain surface mesh, given by a list of vertices and a list of faces.
Landsat satellites collect important data about global forest conditions. Documentation about Landsat's role in forest disturbance estimation is available at the site <https://landsat.gsfc.nasa.gov/>. By constrained quadratic B-splines, this package delivers an optimal shape-restricted trajectory to a time series of Landsat imagery for the purpose of modeling annual forest disturbance dynamics to behave in an ecologically sensible manner assuming one of seven possible "shapes", namely, flat, decreasing, one-jump (decreasing, jump up, decreasing), inverted vee (increasing then decreasing), vee (decreasing then increasing), linear increasing, and double-jump (decreasing, jump up, decreasing, jump up, decreasing). The main routine selects the best shape according to the minimum Bayes information criterion (BIC) or the cone information criterion (CIC), which is defined as the log of the estimated predictive squared error. The package also provides parameters summarizing the temporal pattern including year(s) of inflection, magnitude of change, pre- and post-inflection rates of growth or recovery. In addition, it contains routines for converting a flat map of disturbance agents to time-series disturbance maps and a graphical routine displaying the fitted trajectory of Landsat imagery.
This package provides functions for modeling, comparing, and visualizing photosynthetic light response curves using established mechanistic and empirical models like the rectangular hyperbola Michaelis-Menton based models ((eq1 (Baly (1935) <doi:10.1098/rspb.1935.0026>)) (eq2 (Kaipiainenn (2009) <doi:10.1134/S1021443709040025>)) (eq3 (Smith (1936) <doi:10.1073/pnas.22.8.504>))), hyperbolic tangent based models ((eq4 (Jassby & Platt (1976) <doi:10.4319/LO.1976.21.4.0540>)) (eq5 (Abe et al. (2009) <doi:10.1111/j.1444-2906.2008.01619.x>))), the non-rectangular hyperbola model (eq6 (Prioul & Chartier (1977) <doi:10.1093/oxfordjournals.aob.a085354>)), exponential based models ((eq8 (Webb et al. (1974) <doi:10.1007/BF00345747>)), (eq9 (Prado & de Moraes (1997) <doi:10.1007/BF02982542>))), and finally the Ye model (eq11 (Ye (2007) <doi:10.1007/s11099-007-0110-5>)). Each of these nonlinear least squares models are commonly used to express photosynthetic response under changing light conditions and has been well supported in the literature, but distinctions in each mathematical model represent moderately different assumptions about physiology and trait relationships which ultimately produce different calculated functional trait values. These models were all thoughtfully discussed and curated by Lobo et al. (2013) <doi:10.1007/s11099-013-0045-y> to express the importance of selecting an appropriate model for analysis, and methods were established in Davis et al. (in review) to evaluate the impact of analytical choice in phylogenetic analysis of the function-valued traits. Gas exchange data on 28 wild sunflower species from Davis et al.are included as an example data set here.
Rust-ported Browserslist.
This package is a Rust library for fast non cryptographic random number generator.
Unicode-aware in-place string reversal.
Rust bindings for hawktracer profiling library.
Providing the container for the DockerParallel
package.