--- title: "PubChemR Workflow: Retrieval to Modeling Table" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{PubChemR Workflow: Retrieval to Modeling Table} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette demonstrates a reproducible pattern for working with PubChem data: 1. Identify compounds. 2. Retrieve selected properties. 3. Build a model-ready table. 4. Convert to a modeling matrix. ```{r setup} library(PubChemR) library(dplyr) ``` ## 1) Resolve CIDs from names ```{r eval=TRUE} ids <- get_cids(c("aspirin", "ibuprofen", "caffeine"), namespace = "name") CIDs(ids) ``` ## 2) Retrieve a compact property panel ```{r eval=TRUE} props <- get_properties( properties = c("MolecularWeight", "MolecularFormula", "XLogP", "TPSA"), identifier = c("aspirin", "ibuprofen", "caffeine"), namespace = "name" ) prop_tbl <- retrieve(props, .combine.all = TRUE, .to.data.frame = TRUE) prop_tbl ``` ## 3) Prepare a modeling-ready table ```{r eval=TRUE} model_tbl <- prop_tbl %>% mutate( MolecularWeight = as.numeric(MolecularWeight), XLogP = as.numeric(XLogP), TPSA = as.numeric(TPSA) ) model_tbl ``` ## 4) Convert to model matrix form ```{r eval=TRUE} mm <- pc_model_matrix( model_tbl, id_cols = c("CID"), na_fill = 0 ) mm ``` ## 5) Benchmark harness for scale and CI gates ```{r eval=FALSE} thresholds <- list( elapsed_sec = c(`10` = 30, `1000` = 300, `100000` = 3600), failed_chunk_ratio = c(`10` = 0, `1000` = 0.01, `100000` = 0.05) ) probe <- function(ids) { pc_request( domain = "compound", namespace = "cid", identifier = 2244, operation = "property/MolecularWeight", output = "JSON", cache = FALSE ) } bench <- pc_benchmark_harness( fn = probe, ids = rep(2244, 100000), scenario_sizes = c(10, 1000, 100000), chunk_sizes = 1000, thresholds = thresholds, report_path = file.path(tempdir(), "pubchemr-benchmark.md"), report_format = "markdown" ) bench$summary ``` The nightly workflow `live-pubchem-smoke.yml` runs this harness against live PubChem, publishes artifacts, and maintains calibrated threshold recommendations from rolling history. For high-throughput workflows, combine this pattern with deterministic identifier sets and saved intermediate outputs for reproducibility.