---
title: "PubChemR Workflow: Retrieval to Modeling Table"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{PubChemR Workflow: Retrieval to Modeling Table}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This vignette demonstrates a reproducible pattern for working with PubChem data:

1. Identify compounds.
2. Retrieve selected properties.
3. Build a model-ready table.
4. Convert to a modeling matrix.

```{r setup}
library(PubChemR)
library(dplyr)
```

## 1) Resolve CIDs from names

```{r eval=TRUE}
ids <- get_cids(c("aspirin", "ibuprofen", "caffeine"), namespace = "name")
CIDs(ids)
```

## 2) Retrieve a compact property panel

```{r eval=TRUE}
props <- get_properties(
  properties = c("MolecularWeight", "MolecularFormula", "XLogP", "TPSA"),
  identifier = c("aspirin", "ibuprofen", "caffeine"),
  namespace = "name"
)

prop_tbl <- retrieve(props, .combine.all = TRUE, .to.data.frame = TRUE)
prop_tbl
```

## 3) Prepare a modeling-ready table

```{r eval=TRUE}
model_tbl <- prop_tbl %>%
  mutate(
    MolecularWeight = as.numeric(MolecularWeight),
    XLogP = as.numeric(XLogP),
    TPSA = as.numeric(TPSA)
  )

model_tbl
```

## 4) Convert to model matrix form

```{r eval=TRUE}
mm <- pc_model_matrix(
  model_tbl,
  id_cols = c("CID"),
  na_fill = 0
)

mm
```

## 5) Benchmark harness for scale and CI gates

```{r eval=FALSE}
thresholds <- list(
  elapsed_sec = c(`10` = 30, `1000` = 300, `100000` = 3600),
  failed_chunk_ratio = c(`10` = 0, `1000` = 0.01, `100000` = 0.05)
)

probe <- function(ids) {
  pc_request(
    domain = "compound",
    namespace = "cid",
    identifier = 2244,
    operation = "property/MolecularWeight",
    output = "JSON",
    cache = FALSE
  )
}

bench <- pc_benchmark_harness(
  fn = probe,
  ids = rep(2244, 100000),
  scenario_sizes = c(10, 1000, 100000),
  chunk_sizes = 1000,
  thresholds = thresholds,
  report_path = file.path(tempdir(), "pubchemr-benchmark.md"),
  report_format = "markdown"
)

bench$summary
```

The nightly workflow `live-pubchem-smoke.yml` runs this harness against live PubChem,
publishes artifacts, and maintains calibrated threshold recommendations from rolling history.

For high-throughput workflows, combine this pattern with deterministic identifier sets
and saved intermediate outputs for reproducibility.