| Title: | Leakage-Safe Modeling and Auditing for Genomic and Clinical Data |
|---|---|
| Description: | Prevents and detects information leakage in biomedical machine learning. Provides leakage-resistant split policies (subject-grouped, batch-blocked, study leave-out, time-ordered), guarded preprocessing (train-only imputation, normalization, filtering, feature selection), cross-validated fitting with common learners, permutation-gap auditing, batch and fold association tests, and duplicate detection. |
| Authors: | Selcuk Korkmaz [aut, cre] (ORCID: <https://orcid.org/0000-0003-4632-6850>) |
| Maintainer: | Selcuk Korkmaz <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.3.8 |
| Built: | 2026-05-21 19:34:22 UTC |
| Source: | https://github.com/selcukorkmaz/bioleak |
Consume a 'split_spec' produced by splitGraph and build a
corresponding LeakSplits object via make_split_plan.
The spec supplies the grouping/blocking/ordering assignments; the caller
supplies the observation frame (features + outcome), joined on sample id.
as_leaksplits(spec, data, outcome, sample_id_col = "sample_id", v = 5, ...)as_leaksplits(spec, data, outcome, sample_id_col = "sample_id", v = 5, ...)
spec |
A |
data |
A data.frame (or SummarizedExperiment) containing at least one
identifier column matching |
outcome |
Name of the outcome column in |
sample_id_col |
Name of the sample-id column in |
v |
Number of CV folds to request from |
... |
Additional arguments forwarded to |
The mapping from spec$constraint_mode to
make_split_plan(mode=) is:
"subject" -> "subject_grouped"
"batch" -> "batch_blocked"
"study" -> "study_loocv"
"time" -> "time_series"
"composite" -> "combined"
Blocking variables declared on the spec (batch_group,
study_group) and ordering (order_rank) are forwarded
automatically when relevant.
A LeakSplits object.
Convert LeakSplits to an rsample resample set
as_rsample(x, data = NULL, ...)as_rsample(x, data = NULL, ...)
x |
LeakSplits object created by [make_split_plan()]. |
data |
Optional data.frame used to populate rsample splits. When NULL, the stored 'coldata' from 'x' is used (if available). |
... |
Additional arguments passed to methods (unused). |
An rsample rset object compatible with tidymodels workflows.
The returned object is a tibble with class rset containing:
splitsList-column of rsplit objects, each with
analysis (training indices) and assessment (test indices).
idCharacter column with fold identifiers (e.g., "Fold1").
id2Character column with repeat identifiers (e.g., "Repeat1") when multiple repeats are present; otherwise absent.
The object also carries attributes for group, batch,
study, time (when available from the original LeakSplits),
and bioLeak_mode indicating the original splitting mode. This allows
the splits to be used with tune::tune_grid(), rsample::fit_resamples(),
and other tidymodels functions.
if (requireNamespace("rsample", quietly = TRUE)) { df <- data.frame( subject = rep(1:10, each = 2), outcome = rbinom(20, 1, 0.5), x1 = rnorm(20), x2 = rnorm(20) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 5) rset <- as_rsample(splits, data = df) }if (requireNamespace("rsample", quietly = TRUE)) { df <- data.frame( subject = rep(1:10, each = 2), outcome = rbinom(20, 1, 0.5), x1 = rnorm(20), x2 = rnorm(20) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 5) rset <- as_rsample(splits, data = df) }
Returns the batch/study chi-squared association data frame stored in a ['LeakAudit'] object. Columns include the metadata column, repeat, chi-squared statistic, degrees of freedom, p-value, and Cramer's V.
audit_batch_assoc(audit) ## S4 method for signature 'LeakAudit' audit_batch_assoc(audit)audit_batch_assoc(audit) ## S4 method for signature 'LeakAudit' audit_batch_assoc(audit)
audit |
A ['LeakAudit'] object returned by [audit_leakage()]. |
Implemented as an S4 generic with a method for ['LeakAudit']; visible via 'methods(class = "LeakAudit")'.
A 'data.frame' with one row per (metadata column, repeat).
[LeakClasses], [audit_leakage()]
Returns the near-duplicate sample pairs data frame stored in a ['LeakAudit'] object. Each row is a unique (row_a, row_b) pair above the configured cosine-similarity threshold that crossed train/test partitions in at least one fold.
audit_duplicates(audit) ## S4 method for signature 'LeakAudit' audit_duplicates(audit)audit_duplicates(audit) ## S4 method for signature 'LeakAudit' audit_duplicates(audit)
audit |
A ['LeakAudit'] object returned by [audit_leakage()]. |
Implemented as an S4 generic with a method for ['LeakAudit']; visible via 'methods(class = "LeakAudit")'.
A 'data.frame' with one row per detected near-duplicate pair.
[LeakClasses], [audit_leakage()]
Returns the auxiliary information list stored in a ['LeakAudit'] object. The list typically contains multivariate-target-scan results, configuration flags, permutation-test diagnostics, and provenance metadata.
audit_info(audit) ## S4 method for signature 'LeakAudit' audit_info(audit)audit_info(audit) ## S4 method for signature 'LeakAudit' audit_info(audit)
audit |
A ['LeakAudit'] object returned by [audit_leakage()]. |
Implemented as an S4 generic with a method for ['LeakAudit']; visible via 'methods(class = "LeakAudit")'.
A named 'list'.
[LeakClasses], [audit_leakage()]
Computes a post-hoc leakage audit for a resampled model fit. The audit (1) runs a permutation-gap test comparing observed cross-validated performance to a label-permutation null (by default refitting when data are available; otherwise using fixed predictions), (2) tests whether fold assignments are associated with batch or study metadata (confounding by design), (3) scans features for unusually strong outcome proxies, and (4) flags duplicate or near-duplicate samples in a reference feature matrix.
The returned [LeakAudit] summarizes these diagnostics. It relies on the stored predictions, splits, and optional metadata; it does not refit models unless 'perm_refit = TRUE' (or 'perm_refit = "auto"' with a valid 'perm_refit_spec'). Results are conditional on the chosen metric and supplied metadata/features and should be interpreted as diagnostics, not proof of leakage or its absence.
audit_leakage( fit, metric = c("auc", "pr_auc", "accuracy", "macro_f1", "log_loss", "rmse", "cindex"), B = 200, perm_stratify = FALSE, perm_refit = "auto", perm_refit_auto_max = 200, perm_refit_spec = NULL, perm_mode = NULL, time_block = c("circular", "stationary"), block_len = NULL, include_z = TRUE, ci_method = c("if", "bootstrap"), boot_B = 400, parallel = FALSE, seed = 1, return_perm = TRUE, batch_cols = NULL, coldata = NULL, X_ref = NULL, target_scan = TRUE, target_scan_multivariate = TRUE, target_scan_multivariate_B = 100, target_scan_multivariate_components = 10, target_scan_multivariate_interactions = TRUE, target_threshold = 0.9, target_p_adjust = c("none", "BH", "BY", "holm", "bonferroni"), target_alpha = 0.05, feature_space = c("raw", "rank"), sim_method = c("cosine", "pearson"), sim_threshold = 0.995, nn_k = 50, max_pairs = 5000, duplicate_scope = c("train_test", "all"), learner = NULL, strict_align = FALSE )audit_leakage( fit, metric = c("auc", "pr_auc", "accuracy", "macro_f1", "log_loss", "rmse", "cindex"), B = 200, perm_stratify = FALSE, perm_refit = "auto", perm_refit_auto_max = 200, perm_refit_spec = NULL, perm_mode = NULL, time_block = c("circular", "stationary"), block_len = NULL, include_z = TRUE, ci_method = c("if", "bootstrap"), boot_B = 400, parallel = FALSE, seed = 1, return_perm = TRUE, batch_cols = NULL, coldata = NULL, X_ref = NULL, target_scan = TRUE, target_scan_multivariate = TRUE, target_scan_multivariate_B = 100, target_scan_multivariate_components = 10, target_scan_multivariate_interactions = TRUE, target_threshold = 0.9, target_p_adjust = c("none", "BH", "BY", "holm", "bonferroni"), target_alpha = 0.05, feature_space = c("raw", "rank"), sim_method = c("cosine", "pearson"), sim_threshold = 0.995, nn_k = 50, max_pairs = 5000, duplicate_scope = c("train_test", "all"), learner = NULL, strict_align = FALSE )
fit |
A [LeakFit] object from [fit_resample()] containing cross-validated predictions and split metadata. If predictions include learner IDs for multiple models, you must supply 'learner' to select one; if learner IDs are absent, the audit uses all predictions and may mix learners. |
metric |
Character scalar. One of '"auc"', '"pr_auc"', '"accuracy"', '"macro_f1"', '"log_loss"', '"rmse"', or '"cindex"'. Defaults to '"auc"'. This controls the observed performance statistic, the permutation null, and the sign of the reported gap. |
B |
Integer scalar. Number of permutations used to build the null distribution (default 200). Larger values reduce Monte Carlo error but increase runtime. |
perm_stratify |
Logical scalar or '"auto"'. If TRUE, permutations are stratified within each fold (factor levels; numeric outcomes are binned into quantiles when enough non-missing values are available). If FALSE, no stratification is used. Defaults to FALSE. Stratification only applies when 'coldata' supplies the outcome; otherwise labels are shuffled within each fold. |
perm_refit |
Logical scalar or '"auto"'. If FALSE, permutations keep predictions fixed and shuffle labels (association test). If TRUE, each permutation refits the model on permuted outcomes using 'perm_refit_spec'. Refit-based permutations are slower but better approximate a full null distribution. The default is '"auto"', which refits only when 'perm_refit_spec' is provided and 'B' is less than or equal to 'perm_refit_auto_max'; otherwise it falls back to fixed-prediction permutations. |
perm_refit_auto_max |
Integer scalar. Maximum 'B' allowed for 'perm_refit = "auto"' to trigger refitting. Defaults to 200. |
perm_refit_spec |
List of inputs used when 'perm_refit = TRUE'. Required elements: 'x' (data used for fitting) and 'learner' (parsnip model_spec, workflow, or legacy learner). Optional elements: 'outcome' (defaults to 'fit@outcome'), 'preprocess', 'learner_args', 'custom_learners', 'class_weights', 'positive_class', and 'parallel'. Survival outcomes are not supported for refit-based permutations. |
perm_mode |
Optional character scalar to override the permutation mode used for restricted shuffles. One of '"subject_grouped"', '"batch_blocked"', '"study_loocv"', or '"time_series"'. Defaults to the split metadata when available (including rsample-derived modes). |
time_block |
Character scalar, '"circular"' or '"stationary"'. Controls block permutation for 'time_series' splits; ignored for other split modes. Default is '"circular"'. |
block_len |
Integer scalar or NULL. Block length for time-series permutations. NULL selects 'max(5, floor(0.1 * fold_size))'. Larger values preserve more temporal structure and yield a more conservative null. |
include_z |
Logical scalar. If TRUE (default), include the z-score for the permutation gap when a standard error is available; if FALSE, 'z' is NA. |
ci_method |
Character scalar, '"if"' or '"bootstrap"'. Controls how the standard error and confidence interval for the permutation gap are estimated. Default is '"if"'. '"if"' uses an influence-function estimate when available; '"bootstrap"' resamples permutation values 'boot_B' times. Failed estimates yield NA. |
boot_B |
Integer scalar. Number of bootstrap resamples when 'ci_method = "bootstrap"' (default 400). Larger values are more stable but slower. |
parallel |
Logical scalar. If TRUE and 'future.apply' is available, permutations run in parallel. Results should match sequential execution. Default is FALSE. |
seed |
Integer scalar. Random seed used for permutations and bootstrap resampling; changing it changes the randomization but not the observed metric. Default is 1. |
return_perm |
Logical scalar. If TRUE (default), stores the permutation distribution in 'audit@perm_values'. Set FALSE to reduce memory use. |
batch_cols |
Character vector. Names of 'coldata' columns to test for association with fold assignment. If NULL, defaults to any of '"batch"', '"plate"', '"center"', '"site"', '"study"' found in 'coldata'. Changing this controls which batch tests appear in 'batch_assoc'. |
coldata |
Optional data.frame of sample-level metadata. Rows must align to prediction ids via row names, a 'row_id' column, or row order. Used to build restricted permutations (when the outcome column is present), compute batch associations, and supply outcomes for target scans. If NULL, uses 'fit@splits@info$coldata' when available. If alignment fails, restricted permutations are disabled with a warning. |
X_ref |
Optional numeric matrix/data.frame (samples x features). Used for duplicate detection and the target leakage scan. If NULL, uses 'fit@info$X_ref' when available. Rows must align to sample ids (split order) via row names, a 'row_id' column, or row order; misalignment disables these checks. |
target_scan |
Logical scalar. If TRUE (default), computes per-feature outcome associations on 'X_ref' and flags proxy features; if FALSE, or if 'X_ref'/outcomes are unavailable, 'target_assoc' is empty. Not available for survival outcomes. |
target_scan_multivariate |
Logical scalar. If TRUE (default), fits a simple multivariate/interaction model on 'X_ref' using the stored splits and reports a permutation-based score/p-value. This is slower and only implemented for binomial and gaussian tasks. |
target_scan_multivariate_B |
Integer scalar. Number of permutations for the multivariate scan (default 100). Larger values stabilize the p-value. |
target_scan_multivariate_components |
Integer scalar. Maximum number of principal components used in the multivariate scan (default 10). |
target_scan_multivariate_interactions |
Logical scalar. If TRUE (default), adds pairwise interactions among the top components in the multivariate scan. |
target_threshold |
Numeric scalar in (0,1). Threshold applied to the association score used to flag proxy features. Higher values are stricter. Default is 0.9. |
target_p_adjust |
Character scalar. Multiple-testing correction method applied to finite 'target_assoc$p_value' values. One of '"none"' (default), '"BH"', '"BY"', '"holm"', or '"bonferroni"'. Adds columns 'p_value_adj' and 'flag_fdr' to 'target_assoc'. |
target_alpha |
Numeric scalar in (0,1). Significance level used for 'flag_fdr' when 'target_p_adjust != "none"'. Default is 0.05. |
feature_space |
Character scalar, '"raw"' or '"rank"'. If '"rank"', each row of 'X_ref' is rank-transformed before similarity calculations. This affects duplicate detection only. Default is '"raw"'. |
sim_method |
Character scalar, '"cosine"' or '"pearson"'. Similarity metric for duplicate detection. '"pearson"' row-centers before cosine. Default is '"cosine"'. |
sim_threshold |
Numeric scalar in (0,1). Similarity cutoff for reporting duplicate pairs (default 0.995). Higher values yield fewer pairs. |
nn_k |
Integer scalar. For large datasets ('n > 3000') with 'RANN' installed, checks only the nearest 'nn_k' neighbors per row. Larger values increase sensitivity but slow the search. Ignored when full comparisons are used. Default is 50. |
max_pairs |
Integer scalar. Maximum number of duplicate pairs returned. If more pairs are found, only the most similar are kept. This does not affect permutation results. Default is 5000. |
duplicate_scope |
Character scalar. One of '"train_test"' (default) or '"all"'. '"train_test"' retains only near-duplicate pairs that appear in train vs test in at least one repeat; '"all"' reports all near-duplicate pairs in 'X_ref' regardless of fold assignment. |
learner |
Optional character scalar. When predictions include multiple learner IDs, selects the learner to audit. If NULL and multiple learners are present, the function errors; if predictions lack learner IDs, this argument is ignored with a warning. Default is NULL. |
strict_align |
Logical scalar. If TRUE, errors instead of warning when
|
The 'permutation_gap' slot reports 'metric_obs', 'perm_mean', 'perm_sd', 'gap', 'z', 'p_value', and 'n_perm'. The gap is defined as 'metric_obs - perm_mean' for metrics where higher is better (AUC, PR-AUC, accuracy, macro-F1, C-index) and 'perm_mean - metric_obs' for RMSE/log-loss. By default, 'perm_refit = "auto"' refits models when refit data are available and 'B' is not too large; otherwise it keeps predictions fixed and shuffles labels. Fixed-prediction permutations quantify prediction-label association rather than a full refit null. Set 'perm_refit = FALSE' to force fixed predictions, or 'perm_refit = TRUE' (with 'perm_refit_spec') to always refit.
'batch_assoc' contains chi-square tests between fold assignment and each 'batch_cols' variable ('stat', 'df', 'pval', 'cramer_v'). 'target_assoc' reports feature-wise outcome associations on 'X_ref'; numeric features use AUC (binomial), 'eta_sq' (multiclass), or correlation (gaussian), while categorical features use Cramer's V (binomial/multiclass) or 'eta_sq' from a one-way ANOVA (gaussian). The 'score' column is the scaled effect size used for heuristic flagging ('flag = score >= target_threshold'). When 'target_p_adjust != "none"', finite 'p_value' entries also receive multiplicity-adjusted 'p_value_adj' and 'flag_fdr = (p_value_adj <= target_alpha)'. The univariate target leakage scan can miss multivariate proxies, interaction leakage, or features not included in 'X_ref'. The multivariate scan (enabled by default for supported tasks) adds a model-based proxy check but still only covers features present in 'X_ref'.
Duplicate detection compares rows of 'X_ref' using the chosen 'sim_method' (cosine on L2-normalized rows, or Pearson via row-centering), optionally after rank transformation ('feature_space = "rank"'). By default, 'duplicate_scope = "train_test"' filters to pairs that appear in train vs test in at least one repeat; set 'duplicate_scope = "all"' to include within-fold duplicates. The 'duplicates' slot returns index pairs and similarity values for near-duplicate samples. Only duplicates present in 'X_ref' can be detected, and checks are skipped if inputs cannot be aligned to splits.
A LeakAudit S4 object containing:
fitThe LeakFit object that was audited.
permutation_gapOne-row data.frame from the permutation-gap
test with columns: metric_obs (observed cross-validated metric),
perm_mean (mean of permuted metrics), perm_sd (standard
deviation), gap (observed minus permuted mean, or vice versa for
loss metrics), z (standardized gap), p_value
(permutation p-value), and n_perm (number of permutations). A
large positive gap and small p-value suggest the model captures signal
beyond random label assignment.
perm_valuesNumeric vector of length B containing
the metric value from each permutation. Useful for plotting the null
distribution. Empty if return_perm = FALSE.
batch_assocData.frame of chi-square association tests
between fold assignment and batch/study metadata, with columns:
variable, stat (chi-square statistic), df
(degrees of freedom), pval, and cramer_v (effect size).
Small p-values indicate potential confounding by design.
target_assocData.frame of per-feature outcome associations
with columns: feature, type ("numeric" or
"categorical"), metric (AUC, correlation, eta_sq, or
Cramer's V depending on task), value, score (scaled
effect size), p_value, n, and flag (TRUE if
score >= target_threshold). Flagged features may indicate
target leakage.
duplicatesData.frame of near-duplicate sample pairs with
columns: i, j (row indices in X_ref), sim
(similarity value), and cross_fold (whether the pair spans
train vs test). Duplicates across folds can inflate performance.
trailList capturing audit parameters and intermediate
results for reproducibility, including metric, B,
seed, perm_stratify, perm_refit, and timing info.
infoList with additional metadata including multivariate
scan results when target_scan_multivariate = TRUE.
Use summary() to print a human-readable report. For
programmatic access to slot contents, the recommended interface
is the set of S4 accessor methods registered for LeakAudit:
audit_perm_gap(audit) – permutation-gap
test data frame (the permutation_gap slot).
audit_batch_assoc(audit) – batch / study
chi-squared association data frame (batch_assoc).
audit_target_assoc(audit) – per-predictor
target-association scan results (target_assoc).
audit_duplicates(audit) – near-duplicate
sample-pair data frame (duplicates).
audit_info(audit) – auxiliary information
list including the multivariate target-scan results
(info).
The remaining slots (fit, perm_values,
trail) do not have dedicated accessors; the full list of
accessor methods for the class is available through
methods(class = "LeakAudit").
set.seed(1) df <- data.frame( subject = rep(1:6, each = 2), outcome = rbinom(12, 1, 0.5), x1 = rnorm(12), x2 = rnorm(12) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) audit <- audit_leakage(fit, metric = "auc", B = 10, X_ref = df[, c("x1", "x2")])set.seed(1) df <- data.frame( subject = rep(1:6, each = 2), outcome = rbinom(12, 1, 0.5), x1 = rnorm(12), x2 = rnorm(12) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) audit <- audit_leakage(fit, metric = "auc", B = 10, X_ref = df[, c("x1", "x2")])
Runs [audit_leakage()] separately for each learner recorded in a [LeakFit] and returns a named list of [LeakAudit] objects. Use this when a single fit contains predictions for multiple models and you want model-specific audits. If predictions do not include learner IDs, only a single audit can be run and requesting multiple learners is an error.
audit_leakage_by_learner( fit, metric = c("auc", "pr_auc", "accuracy", "macro_f1", "log_loss", "rmse", "cindex"), learners = NULL, parallel_learners = FALSE, mc.cores = NULL, ... )audit_leakage_by_learner( fit, metric = c("auc", "pr_auc", "accuracy", "macro_f1", "log_loss", "rmse", "cindex"), learners = NULL, parallel_learners = FALSE, mc.cores = NULL, ... )
fit |
A [LeakFit] object produced by [fit_resample()]. It must contain predictions and split metadata. Learner IDs must be present in predictions to audit multiple models. |
metric |
Character scalar. One of '"auc"', '"pr_auc"', '"accuracy"', '"macro_f1"', '"log_loss"', '"rmse"', or '"cindex"'. Controls which metric is audited for each learner. |
learners |
Character vector or NULL. If NULL (default), audits all learners found in predictions. If provided, must match learner IDs stored in the predictions. Supplying more than one learner requires learner IDs. |
parallel_learners |
Logical scalar. If TRUE, runs per-learner audits in parallel using 'future.apply' (if installed). This changes runtime but not the audit results. |
mc.cores |
Integer scalar or NULL. Number of workers used when 'parallel_learners = TRUE'. Defaults to the minimum of available cores and the number of learners. |
... |
Additional named arguments forwarded to [audit_leakage()] for each learner. These control the audit itself. Common options include: 'B' (integer permutations), 'perm_stratify' (logical or '"auto"'), 'perm_refit' (logical), 'perm_refit_spec' (list), 'time_block' (character), 'block_len' (integer or NULL), 'include_z' (logical), 'ci_method' (character), 'boot_B' (integer), 'parallel' (logical), 'seed' (integer), 'return_perm' (logical), 'batch_cols' (character vector), 'coldata' (data.frame), 'X_ref' (matrix/data.frame), 'target_scan' (logical), 'target_threshold' (numeric), 'target_p_adjust' (character), 'target_alpha' (numeric), 'feature_space' (character), 'sim_method' (character), 'sim_threshold' (numeric), 'nn_k' (integer), 'max_pairs' (integer), and 'duplicate_scope' (character). See [audit_leakage()] for full definitions; changing these values changes each learner's audit. |
A named list of LeakAudit objects, where each
element is keyed by the learner ID (character string). Each
LeakAudit object contains the same slots as described in
audit_leakage: fit, permutation_gap,
perm_values, batch_assoc, target_assoc,
duplicates, trail, and info. Use names() to
retrieve learner IDs, and access individual audits with [[learner_id]]
or $learner_id. Each audit reflects the performance and diagnostics
for that specific learner's predictions.
set.seed(1) df <- data.frame( subject = rep(1:6, each = 2), outcome = factor(rep(c(0, 1), 6)), x1 = rnorm(12), x2 = rnorm(12) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = data.frame(y = y, x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) custom$glm2 <- custom$glm fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = c("glm", "glm2"), custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) audits <- audit_leakage_by_learner(fit, metric = "auc", B = 10, perm_stratify = FALSE) names(audits)set.seed(1) df <- data.frame( subject = rep(1:6, each = 2), outcome = factor(rep(c(0, 1), 6)), x1 = rnorm(12), x2 = rnorm(12) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = data.frame(y = y, x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) custom$glm2 <- custom$glm fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = c("glm", "glm2"), custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) audits <- audit_leakage_by_learner(fit, metric = "auc", B = 10, perm_stratify = FALSE) names(audits)
Returns the permutation-gap test data frame stored in a ['LeakAudit'] object. Columns include the observed metric, permuted-null mean and SD, gap, z-score, and permutation p-value.
audit_perm_gap(audit) ## S4 method for signature 'LeakAudit' audit_perm_gap(audit)audit_perm_gap(audit) ## S4 method for signature 'LeakAudit' audit_perm_gap(audit)
audit |
A ['LeakAudit'] object returned by [audit_leakage()]. |
Implemented as an S4 generic with a method for ['LeakAudit']; visible via 'methods(class = "LeakAudit")'.
A 'data.frame' with one row per (mechanism class, repeat), summarising the permutation-gap test.
[LeakClasses], [audit_leakage()], [audit_target_assoc()]
Creates an HTML report that summarizes a leakage audit for a resampled model. The report is built from a [LeakAudit] (or created from a [LeakFit]) and presents: cross-validated metric summaries, a label-permutation association test of the chosen performance metric (auto-refit when refit data are available; otherwise fixed predictions), batch or study association tests between metadata and predictions, confounder sensitivity plots, calibration checks for binomial tasks, a target leakage scan based on feature-outcome similarity (with multivariate scan enabled by default for supported tasks), and duplicate detection across training and test folds. The output is a self-contained HTML file with tables and plots for these checks plus the audit parameters used.
audit_report( audit, output_file = "bioLeak_audit_report.html", output_dir = tempdir(), quiet = TRUE, open = FALSE, ... )audit_report( audit, output_file = "bioLeak_audit_report.html", output_dir = tempdir(), quiet = TRUE, open = FALSE, ... )
audit |
A [LeakAudit] object from [audit_leakage()] or a [LeakFit] object from [fit_resample()]. If a [LeakAudit] is supplied, the report uses its stored results verbatim. If a [LeakFit] is supplied, 'audit_report()' first computes a new audit via [audit_leakage(...)]; the fit must contain predictions and split metadata. When multiple learners were fit, pass a 'learner' argument via '...' to select a single model. |
output_file |
Character scalar. File name for the HTML report. Defaults to '"bioLeak_audit_report.html"'. If a relative name is provided, it is created inside 'output_dir'. Changing this value only changes the file name, not the audit content. |
output_dir |
Character scalar. Directory path where the report is written. Defaults to [tempdir()]. The directory must exist or be creatable by 'rmarkdown::render()'. Changing this value only changes the output location. |
quiet |
Logical scalar passed to 'rmarkdown::render()'. Defaults to 'TRUE'. When 'FALSE', knitting output and warnings are printed to the console. This does not change audit results. |
open |
Logical scalar. Defaults to 'FALSE'. When 'TRUE', opens the generated report in a browser via [utils::browseURL()]. This does not change the report contents. |
... |
Additional named arguments forwarded to [audit_leakage()] only when 'audit' is a [LeakFit]. These control how the audit is computed and therefore change the report. Typical examples include 'metric' (character), 'B' (integer), 'perm_stratify' (logical), 'batch_cols' (character vector), 'X_ref' (matrix/data.frame), 'sim_method' (character), and 'duplicate_scope' (character). When omitted, [audit_leakage()] defaults are used. Ignored when 'audit' is already a [LeakAudit]. |
The report does not refit models or reprocess data unless 'perm_refit' triggers refitting ('TRUE' or '"auto"' with a valid 'perm_refit_spec'); it otherwise inspects the predictions and metadata stored in the input. Results are conditional on the provided splits, selected metric, and any feature matrix supplied to [audit_leakage()]. The univariate target leakage scan can miss multivariate proxies, interaction leakage, or features not included in 'X_ref'; the multivariate scan (enabled by default for supported tasks) adds a model-based check but still only uses features in 'X_ref'. A non-significant result does not prove the absence of leakage, especially with small 'B' or incomplete metadata. Rendering requires the 'rmarkdown' package and 'ggplot2' for plots.
Character string containing the absolute file path to the generated
HTML report. The report is a self-contained HTML file that can be opened
in any web browser. It includes sections for: cross-validated metric
summaries, label-permutation test results (gap, p-value), batch/study
association tests, confounder sensitivity analysis, calibration diagnostics
(for binomial tasks), target leakage scan results, and duplicate detection
findings. The path can be used with browseURL to open
the report programmatically.
set.seed(1) df <- data.frame( subject = rep(1:6, each = 2), outcome = factor(rep(c(0, 1), 6)), x1 = rnorm(12), x2 = rnorm(12) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = data.frame(y = y, x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) audit <- audit_leakage(fit, metric = "auc", B = 5, perm_stratify = FALSE) if (requireNamespace("rmarkdown", quietly = TRUE) && requireNamespace("ggplot2", quietly = TRUE) && isTRUE(rmarkdown::pandoc_available("1.12.3"))) { out_file <- audit_report(audit, output_dir = tempdir(), quiet = TRUE) out_file }set.seed(1) df <- data.frame( subject = rep(1:6, each = 2), outcome = factor(rep(c(0, 1), 6)), x1 = rnorm(12), x2 = rnorm(12) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = data.frame(y = y, x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) audit <- audit_leakage(fit, metric = "auc", B = 5, perm_stratify = FALSE) if (requireNamespace("rmarkdown", quietly = TRUE) && requireNamespace("ggplot2", quietly = TRUE) && isTRUE(rmarkdown::pandoc_available("1.12.3"))) { out_file <- audit_report(audit, output_dir = tempdir(), quiet = TRUE) out_file }
Returns the target-association data frame stored in a ['LeakAudit'] object. Each row is one predictor with its association score (rescaled AUC; '|AUC - 0.5| * 2'), threshold-based flag, and (where applicable) p-value.
audit_target_assoc(audit) ## S4 method for signature 'LeakAudit' audit_target_assoc(audit)audit_target_assoc(audit) ## S4 method for signature 'LeakAudit' audit_target_assoc(audit)
audit |
A ['LeakAudit'] object returned by [audit_leakage()]. |
Implemented as an S4 generic with a method for ['LeakAudit']; visible via 'methods(class = "LeakAudit")'.
A 'data.frame' with one row per predictor.
[LeakClasses], [audit_leakage()]
Runs a reproducible grid of simulation scenarios across modalities, leakage mechanisms, and split modes using [simulate_leakage_suite()]. This function is designed as a benchmarking harness to quantify detection rates and performance inflation under controlled settings.
benchmark_leakage_suite( modalities = c("omics", "imaging_tabular", "ehr_tabular"), leakages = c("none", "subject_overlap", "batch_confounded", "peek_norm", "lookahead"), modes = c("subject_grouped", "batch_blocked", "time_series"), learner = c("glmnet", "ranger"), seeds = 1:5, B = 200, alpha = 0.05, parallel = FALSE )benchmark_leakage_suite( modalities = c("omics", "imaging_tabular", "ehr_tabular"), leakages = c("none", "subject_overlap", "batch_confounded", "peek_norm", "lookahead"), modes = c("subject_grouped", "batch_blocked", "time_series"), learner = c("glmnet", "ranger"), seeds = 1:5, B = 200, alpha = 0.05, parallel = FALSE )
modalities |
Character vector selecting predefined modality profiles. Supported values: '"omics"', '"imaging_tabular"', '"ehr_tabular"'. |
leakages |
Character vector of leakage mechanisms passed to [simulate_leakage_suite()]. |
modes |
Character vector of split modes passed to [simulate_leakage_suite()]. |
learner |
Character scalar. '"glmnet"' (default) or '"ranger"'. |
seeds |
Integer vector of Monte Carlo seeds. |
B |
Integer scalar. Number of permutations per scenario. |
alpha |
Numeric scalar in (0, 1). Detection threshold applied to permutation p-values. |
parallel |
Logical scalar. If TRUE, evaluates scenarios in parallel when 'future.apply' is available. |
A data.frame with one row per simulation seed/scenario and columns: 'modality', 'leakage', 'mode', 'seed', observed metric, gap, p-value, and a logical 'detected' flag. A scenario-level summary is attached as 'attr(x, "summary")'.
Computes reliability curve summaries and calibration metrics for a binomial [LeakFit] using out-of-fold predictions.
calibration_summary(fit, bins = 10, min_bin_n = 5, learner = NULL)calibration_summary(fit, bins = 10, min_bin_n = 5, learner = NULL)
fit |
A [LeakFit] object from [fit_resample()]. |
bins |
Integer number of probability bins for the calibration curve. |
min_bin_n |
Minimum samples per bin used in plotting; bins smaller than this are retained in the output but can be filtered by the caller. |
learner |
Optional character scalar. When predictions include multiple learners, selects the learner to summarize. |
A list with a 'curve' data.frame and a one-row 'metrics' data.frame containing ECE, MCE, and Brier score.
set.seed(42) df <- data.frame( subject = rep(1:15, each = 2), outcome = factor(rep(c(0, 1), 15)), x1 = rnorm(30), x2 = rnorm(30) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) cal <- calibration_summary(fit, bins = 5) cal$metricsset.seed(42) df <- data.frame( subject = rep(1:15, each = 2), outcome = factor(rep(c(0, 1), 15)), x1 = rnorm(30), x2 = rnorm(30) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) cal <- calibration_summary(fit, bins = 5) cal$metrics
Verifies that a LeakSplits object satisfies the
expected no-overlap constraints for one or more grouping columns. For each
fold, the function checks that no group-level value appearing in the test
set is also present in the training set.
check_split_overlap(splits, coldata = NULL, cols = NULL, stop_on_fail = TRUE)check_split_overlap(splits, coldata = NULL, cols = NULL, stop_on_fail = TRUE)
splits |
A |
coldata |
A data.frame of sample metadata. When |
cols |
Character vector of column names to check for overlap. When
|
stop_on_fail |
Logical; if |
A data.frame with one row per fold-by-column combination and
columns fold, repeat_id, col, n_overlap
(number of overlapping group values), and pass (logical).
Invisible. Raises an error if any fold fails and stop_on_fail = TRUE.
Computes performance metrics within confounder strata to surface potential confounding. Requires aligned metadata in 'coldata'.
confounder_sensitivity( fit, confounders = NULL, metric = NULL, min_n = 10, coldata = NULL, numeric_bins = 4, learner = NULL, strict_align = FALSE )confounder_sensitivity( fit, confounders = NULL, metric = NULL, min_n = 10, coldata = NULL, numeric_bins = 4, learner = NULL, strict_align = FALSE )
fit |
A [LeakFit] object from [fit_resample()]. |
confounders |
Character vector of columns in 'coldata' to evaluate. Defaults to common batch/study identifiers when available. |
metric |
Metric name to compute within each stratum. Defaults to the first metric used in the fit (or task defaults if unavailable). |
min_n |
Minimum samples per stratum; smaller strata return NA metrics. |
coldata |
Optional data.frame of sample metadata. Defaults to 'fit@splits@info$coldata' when available. |
numeric_bins |
Integer number of quantile bins for numeric confounders with many unique values. |
learner |
Optional character scalar. When predictions include multiple learners, selects the learner to summarize. |
strict_align |
Logical scalar. If TRUE, errors when coldata cannot be aligned by row names or IDs and would fall back to row-order matching. Default is FALSE. |
A data.frame with per-confounder, per-level metrics and counts.
set.seed(42) df <- data.frame( subject = rep(1:15, each = 2), outcome = factor(rep(c(0, 1), 15)), batch = factor(rep(c("A", "B", "C"), 10)), x1 = rnorm(30), x2 = rnorm(30) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) confounder_sensitivity(fit, confounders = "batch", coldata = df)set.seed(42) df <- data.frame( subject = rep(1:15, each = 2), outcome = factor(rep(c(0, 1), 15)), batch = factor(rep(c("A", "B", "C"), 10)), x1 = rnorm(30), x2 = rnorm(30) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) confounder_sensitivity(fit, confounders = "batch", coldata = df)
Computes per-learner confidence intervals for each metric column in a per-fold metrics data.frame. Supports the standard normal/t approach and the Nadeau-Bengio (2003) corrected variance for repeated K-fold CV.
cv_ci( metrics_df, level = 0.95, method = c("normal", "nadeau_bengio"), n_train = NULL, n_test = NULL )cv_ci( metrics_df, level = 0.95, method = c("normal", "nadeau_bengio"), n_train = NULL, n_test = NULL )
metrics_df |
Data.frame with columns |
level |
Confidence level (default 0.95). |
method |
One of |
n_train |
Average number of training samples per fold. Used only when
|
n_test |
Average number of test samples per fold. Used only when
|
A data.frame with learner and, for each metric, columns
<metric>_mean, <metric>_sd, <metric>_ci_lo, and
<metric>_ci_hi.
Compares a naive (potentially leaky) cross-validation pipeline against a guarded (leakage-protected) pipeline and quantifies leakage-induced performance inflation using the Leakage Sensitivity Index (LSI).
delta_lsi( fit_leaky, fit_guarded, metric = "auc", exchangeability = c("iid", "by_group", "within_batch", "blocked_time"), learner = NULL, higher_is_better = NULL, block_size = NULL, M_boot = 2000L, M_flip = 10000L, strict = FALSE, return_details = FALSE, seed = 42L, ... )delta_lsi( fit_leaky, fit_guarded, metric = "auc", exchangeability = c("iid", "by_group", "within_batch", "blocked_time"), learner = NULL, higher_is_better = NULL, block_size = NULL, M_boot = 2000L, M_flip = 10000L, strict = FALSE, return_details = FALSE, seed = 42L, ... )
fit_leaky |
A |
fit_guarded |
A |
metric |
Character. Performance metric to compare. Must appear in
|
exchangeability |
Character. Exchangeability assumption for the
sign-flip test. One of |
learner |
Optional character. Learner name to select from multi-learner
fits. If |
higher_is_better |
Logical or |
block_size |
Integer or |
M_boot |
Integer. Number of bootstrap samples for BCa CI (default 2000). |
M_flip |
Integer. Maximum Monte Carlo samples for sign-flip test when R_eff > 15 (default 10000). |
strict |
Logical. If |
return_details |
Logical. If |
seed |
Integer. Random seed for bootstrap and sign-flip test. |
... |
Unused. Reserved for deprecated aliases such as
|
For each fit, per-fold metric values are extracted from fit@metrics
(or recomputed from fit@predictions if necessary). Fold test-set
sizes are used as weights to aggregate fold metrics into per-repeat
estimates . The repeat-level delta
captures leakage-induced performance inflation for each CV repeat, where
for higher-is-better metrics (e.g., AUC) and
for lower-is-better metrics (e.g., RMSE), so that
always indicates the naive pipeline is more optimistic than the guarded one.
The delta_lsi point estimate is the Huber M-estimator (k = 1.345)
applied to , which is robust to occasional outlier
repeats. delta_metric is the arithmetic mean of
for easy interpretation in the original metric's units.
Pairing requires that fit_leaky and fit_guarded share
identical fold structures (same test-set membership per fold) in
addition to the same number of repeats. When repeat counts match but fold
structures differ, a warning is issued and the fits are treated as unpaired.
When (equal, paired repeats), a sign-flip
randomization test (Phipson & Smyth, 2010) is performed: under
(no leakage) the sign of each is exchangeable.
All sign combinations are enumerated exactly for
(no continuity correction); Monte Carlo sampling is used
for larger with the Phipson & Smyth (2010) correction.
BCa bootstrap confidence intervals (Efron, 1987) require
.
"A_full_inference"R_eff >= 20: point + BCa CI + sign-flip p-value; inference_ok = TRUE
"B_signflip_ci"10 <= R_eff < 20: point + sign-flip p-value + BCa CI
"C_signflip"5 <= R_eff < 10: point + sign-flip p-value (no CI)
"D_insufficient"R_eff < 5 or unpaired: point estimate only
A LeakDeltaLSI object. Use
summary() to print a formatted report. For programmatic
access to inflation estimates, confidence intervals, p-values,
tier labels, and the per-repeat fold-level deltas, use the S4
accessor methods registered for LeakDeltaLSI:
dlsi_metric(dlsi) – raw mean of
per-repeat metric differences (the delta_metric slot).
dlsi_robust(dlsi) – Huber-robust point
estimate (the delta_lsi slot).
dlsi_ci(dlsi, which = "robust" | "metric") –
BCa bootstrap confidence interval for either the robust or
the raw estimate.
dlsi_p_value(dlsi) – sign-flip
randomization-test p-value.
dlsi_tier(dlsi) – inference-tier label
("A_full_inference", "B_signflip_ci",
"C_signflip", or "D_insufficient").
dlsi_R_eff(dlsi) – effective number of
paired repeats contributing to the inference.
dlsi_repeats(dlsi, which = "naive" | "guarded")
– per-repeat metric data frame for the requested pipeline.
The full list of accessor methods for the class is available
through methods(class = "LeakDeltaLSI").
audit_leakage, fit_resample,
LeakDeltaLSI
Returns the bias-corrected and accelerated (BCa) bootstrap confidence interval stored in a ['LeakDeltaLSI'] object. By default returns the interval for the Huber-robust delta estimate; set 'which = "metric"' to return the interval for the raw metric difference instead.
dlsi_ci(dlsi, which = c("robust", "metric")) ## S4 method for signature 'LeakDeltaLSI' dlsi_ci(dlsi, which = c("robust", "metric"))dlsi_ci(dlsi, which = c("robust", "metric")) ## S4 method for signature 'LeakDeltaLSI' dlsi_ci(dlsi, which = c("robust", "metric"))
dlsi |
A ['LeakDeltaLSI'] object returned by [delta_lsi()]. |
which |
Either ‘"robust"' (default) for the Huber estimate’s confidence interval, or ‘"metric"' for the raw arithmetic mean’s confidence interval. |
Implemented as an S4 generic with a method for ['LeakDeltaLSI']; visible via 'methods(class = "LeakDeltaLSI")'.
A length-two numeric vector 'c(lower, upper)'. Returns 'c(NA_real_, NA_real_)' when the interval is not computed (for example, when the inference tier did not include CIs).
[LeakClasses], [delta_lsi()], [dlsi_robust()], [dlsi_metric()]
Returns the arithmetic mean of the per-repeat raw metric differences (leaky minus guarded) stored in a ['LeakDeltaLSI'] object.
dlsi_metric(dlsi) ## S4 method for signature 'LeakDeltaLSI' dlsi_metric(dlsi)dlsi_metric(dlsi) ## S4 method for signature 'LeakDeltaLSI' dlsi_metric(dlsi)
dlsi |
A ['LeakDeltaLSI'] object returned by [delta_lsi()]. |
Implemented as an S4 generic with a method for ['LeakDeltaLSI']; visible via 'methods(class = "LeakDeltaLSI")'.
A length-one numeric scalar.
[LeakClasses], [delta_lsi()], [dlsi_robust()], [dlsi_ci()]
Returns the paired sign-flip randomization-test p-value stored in a ['LeakDeltaLSI'] object.
dlsi_p_value(dlsi) ## S4 method for signature 'LeakDeltaLSI' dlsi_p_value(dlsi)dlsi_p_value(dlsi) ## S4 method for signature 'LeakDeltaLSI' dlsi_p_value(dlsi)
dlsi |
A ['LeakDeltaLSI'] object returned by [delta_lsi()]. |
Implemented as an S4 generic with a method for ['LeakDeltaLSI']; visible via 'methods(class = "LeakDeltaLSI")'.
A length-one numeric scalar in '[0, 1]', or 'NA_real_' when the inference tier did not include hypothesis testing.
[LeakClasses], [delta_lsi()], [dlsi_tier()]
Returns the effective number of paired repeats 'R_eff' stored in a ['LeakDeltaLSI'] object. This is the count of repeats that contribute to the inference; it equals the smaller of the leaky and guarded fits' repeat counts when the comparison is paired.
dlsi_R_eff(dlsi) ## S4 method for signature 'LeakDeltaLSI' dlsi_R_eff(dlsi)dlsi_R_eff(dlsi) ## S4 method for signature 'LeakDeltaLSI' dlsi_R_eff(dlsi)
dlsi |
A ['LeakDeltaLSI'] object returned by [delta_lsi()]. |
Implemented as an S4 generic with a method for ['LeakDeltaLSI']; visible via 'methods(class = "LeakDeltaLSI")'.
A length-one integer.
[LeakClasses], [delta_lsi()], [dlsi_tier()]
Returns the per-repeat metric data frame for one of the two pipelines stored in a ['LeakDeltaLSI'] object. The naive (or leaky) pipeline's repeats are returned by default; set ‘which = "guarded"' to return the guarded pipeline’s repeats.
dlsi_repeats(dlsi, which = c("naive", "guarded")) ## S4 method for signature 'LeakDeltaLSI' dlsi_repeats(dlsi, which = c("naive", "guarded"))dlsi_repeats(dlsi, which = c("naive", "guarded")) ## S4 method for signature 'LeakDeltaLSI' dlsi_repeats(dlsi, which = c("naive", "guarded"))
dlsi |
A ['LeakDeltaLSI'] object returned by [delta_lsi()]. |
which |
Either '"naive"' (default) for the naive/leaky pipeline's per-repeat data frame, or '"guarded"' for the guarded pipeline's per-repeat data frame. |
Implemented as an S4 generic with a method for ['LeakDeltaLSI']; visible via 'methods(class = "LeakDeltaLSI")'.
A 'data.frame' with one row per repeat.
[LeakClasses], [delta_lsi()]
Returns the Huber-robust point estimate of the per-repeat delta values stored in a ['LeakDeltaLSI'] object.
dlsi_robust(dlsi) ## S4 method for signature 'LeakDeltaLSI' dlsi_robust(dlsi)dlsi_robust(dlsi) ## S4 method for signature 'LeakDeltaLSI' dlsi_robust(dlsi)
dlsi |
A ['LeakDeltaLSI'] object returned by [delta_lsi()]. |
Implemented as an S4 generic with a method for ['LeakDeltaLSI']; visible via 'methods(class = "LeakDeltaLSI")'.
A length-one numeric scalar.
[LeakClasses], [delta_lsi()], [dlsi_metric()], [dlsi_ci()]
Returns the inference tier label stored in a ['LeakDeltaLSI'] object. Possible values are '"A_full_inference"' ('R_eff >= 20'), '"B_signflip_ci"' ('R_eff >= 10'), '"C_signflip"' ('R_eff >= 5'), or '"D_insufficient"' ('R_eff < 5').
dlsi_tier(dlsi) ## S4 method for signature 'LeakDeltaLSI' dlsi_tier(dlsi)dlsi_tier(dlsi) ## S4 method for signature 'LeakDeltaLSI' dlsi_tier(dlsi)
dlsi |
A ['LeakDeltaLSI'] object returned by [delta_lsi()]. |
Implemented as an S4 generic with a method for ['LeakDeltaLSI']; visible via 'methods(class = "LeakDeltaLSI")'.
A length-one character string giving the tier label.
[LeakClasses], [delta_lsi()], [dlsi_R_eff()]
Returns the per-fold metric data frame stored in a ['LeakFit'] object. Each row is one (fold, repeat, learner) combination; columns include the requested metric values such as 'auc' and any task-specific performance scores.
fit_metrics(fit) ## S4 method for signature 'LeakFit' fit_metrics(fit)fit_metrics(fit) ## S4 method for signature 'LeakFit' fit_metrics(fit)
fit |
A ['LeakFit'] object returned by [fit_resample()]. |
Implemented as an S4 generic with a method for ['LeakFit']; visible via 'methods(class = "LeakFit")'.
A 'data.frame' with one row per (fold, repeat, learner) combination.
[LeakClasses], [fit_resample()], [audit_perm_gap()]
set.seed(1) df <- data.frame( subject = rep(1:6, each = 2), outcome = factor(rep(c("a","b"), 6), levels = c("a","b")), x1 = rnorm(12), x2 = rnorm(12) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3) ## Not run: fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = parsnip::logistic_reg() |> parsnip::set_engine("glm"), metrics = "auc") fit_metrics(fit) ## End(Not run)set.seed(1) df <- data.frame( subject = rep(1:6, each = 2), outcome = factor(rep(c("a","b"), 6), levels = c("a","b")), x1 = rnorm(12), x2 = rnorm(12) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3) ## Not run: fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = parsnip::logistic_reg() |> parsnip::set_engine("glm"), metrics = "auc") fit_metrics(fit) ## End(Not run)
Performs cross-validated model training and evaluation using leakage-protected preprocessing (guard_fit) and user-specified learners.
fit_resample( x, outcome, splits, preprocess = list(impute = list(method = "median"), normalize = list(method = "zscore"), filter = list(var_thresh = 0, iqr_thresh = 0), fs = list(method = "none")), learner = c("glmnet", "ranger"), learner_args = list(), custom_learners = list(), metrics = c("auc", "pr_auc", "accuracy"), class_weights = NULL, positive_class = NULL, classification_threshold = 0.5, parallel = FALSE, refit = TRUE, seed = 1, split_cols = "auto", store_refit_data = TRUE )fit_resample( x, outcome, splits, preprocess = list(impute = list(method = "median"), normalize = list(method = "zscore"), filter = list(var_thresh = 0, iqr_thresh = 0), fs = list(method = "none")), learner = c("glmnet", "ranger"), learner_args = list(), custom_learners = list(), metrics = c("auc", "pr_auc", "accuracy"), class_weights = NULL, positive_class = NULL, classification_threshold = 0.5, parallel = FALSE, refit = TRUE, seed = 1, split_cols = "auto", store_refit_data = TRUE )
x |
SummarizedExperiment or matrix/data.frame |
outcome |
outcome column name (if x is SE or data.frame), or a length-2 character vector of time/event column names for survival outcomes. |
splits |
LeakSplits object from make_split_plan(), or an 'rsample' rset/rsplit. |
preprocess |
list(impute, normalize, filter=list(...), fs) or a
'recipes::recipe' object. When a recipe is supplied, the guarded preprocessing
pipeline is bypassed and the recipe is prepped on training data only.
Recipe/workflow leakage guardrails run before fitting; configure policy via
|
learner |
parsnip model_spec (or list of model_spec objects) describing the model(s) to fit, or a 'workflows::workflow'. For legacy use, a character vector of learner names (e.g., "glmnet", "ranger") or custom learner IDs is still supported. |
learner_args |
list of additional arguments passed to legacy learners (ignored when 'learner' is a parsnip model_spec). |
custom_learners |
named list of custom learner definitions used only
with legacy character learners. Each entry
must contain |
metrics |
named list of metric functions, vector of metric names, or a 'yardstick::metric_set'. When a yardstick metric set (or list of yardstick metric functions) is supplied, metrics are computed using yardstick with the positive class set to the second factor level. |
class_weights |
optional named numeric vector of weights for binomial or multiclass outcomes |
positive_class |
optional value indicating the positive class for binomial outcomes.
When set, the outcome levels are reordered so that |
classification_threshold |
Numeric threshold in |
parallel |
logical, use future.apply for multicore execution |
refit |
logical, if TRUE retrain final model on full data |
seed |
integer, for reproducibility |
split_cols |
Optional named list/character vector or '"auto"' (default) overriding group/batch/study/time column names when 'splits' is an rsample object and its attributes are missing. '"auto"' falls back to common metadata column names (e.g., 'group', 'subject', 'batch', 'study', 'time'). Supported names are 'group', 'batch', 'study', and 'time'. |
store_refit_data |
Logical; when TRUE (default), stores the original data and learner configuration inside the fit to enable refit-based permutation tests without manual 'perm_refit_spec' setup. |
Preprocessing is fit on the training fold and applied to the test fold,
preventing leakage from global imputation, scaling, or feature selection.
When a 'recipes::recipe' or 'workflows::workflow' is supplied, the recipe is
prepped on the training fold and baked on the test fold.
For data.frame or matrix inputs, columns used to define splits
(outcome, group, batch, study, time) are excluded from the predictor matrix.
Use learner_args to pass model-specific arguments, either as a named
list keyed by learner or a single list applied to all learners. For custom
learners, learner_args[[name]] may be a list with fit and
predict sublists to pass distinct arguments to each stage. For binomial
tasks, predictions and metrics assume the positive class is the second factor
level; use positive_class to control this. Use
classification_threshold to change the probability cutoff used for
class labels and accuracy. Parsnip learners must support
probability predictions for binomial metrics (AUC/PR-AUC/accuracy) and
multiclass log-loss when requested.
A LeakFit S4 object containing:
splitsThe LeakSplits object used for resampling.
metricsData.frame of per-fold, per-learner performance
metrics with columns fold, learner, and one column per
requested metric.
metric_summaryData.frame summarizing metrics across folds
for each learner with columns learner, and <metric>_mean
and <metric>_sd for each requested metric.
auditData.frame with per-fold audit information including
fold, n_train, n_test, learner, and
features_final (number of features after preprocessing).
predictionsList of data.frames containing out-of-fold
predictions with columns id (sample identifier), truth
(true outcome), pred (predicted value or probability), fold,
and learner. For classification tasks, includes pred_class.
For multiclass, includes per-class probability columns.
preprocessList of preprocessing state objects from each fold, storing imputation parameters, normalization statistics, and feature selection results.
learnersList of fitted model objects from each fold.
outcomeCharacter string naming the outcome variable.
taskCharacter string indicating the task type
("binomial", "multiclass", "gaussian", or
"survival").
feature_namesCharacter vector of feature names after preprocessing.
infoList of additional metadata including hash,
metrics_used, class_weights, positive_class,
sample_ids, fold_status, refit, final_model (refitted model if
refit = TRUE), final_preprocess, learner_names,
and perm_refit_spec (for permutation-based audits).
Use summary() to print a formatted report. For
programmatic access to slot contents, the recommended interface
is the S4 accessor method registered for LeakFit:
fit_metrics(fit) – per-fold metric
data frame (the metrics slot).
Slots without dedicated accessors (predictions,
info, audit, learners, preprocess,
feature_names, outcome, task,
splits, metric_summary) are read directly via
the standard @ operator when needed; the list of all
accessor methods for the class is available through
methods(class = "LeakFit").
set.seed(1) df <- data.frame( subject = rep(1:10, each = 2), outcome = rbinom(20, 1, 0.5), x1 = rnorm(20), x2 = rnorm(20) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 5) # glmnet learner (requires glmnet package) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glmnet", metrics = "auc") summary(fit) # Custom learner (logistic regression) - no extra packages needed custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit2 <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "accuracy") summary(fit2)set.seed(1) df <- data.frame( subject = rep(1:10, each = 2), outcome = rbinom(20, 1, 0.5), x1 = rnorm(20), x2 = rnorm(20) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 5) # glmnet learner (requires glmnet package) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glmnet", metrics = "auc") summary(fit) # Custom learner (logistic regression) - no extra packages needed custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit2 <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "accuracy") summary(fit2)
Converts character/logical columns to factors and aligns factor levels with
a training-time levels_map. Adds a dummy level when a column has only
one observed level so that downstream one-hot encoding retains a column.
guard_ensure_levels(df, levels_map = NULL, dummy_prefix = "__dummy__")guard_ensure_levels(df, levels_map = NULL, dummy_prefix = "__dummy__")
df |
data.frame to normalize factor levels. |
levels_map |
optional named list of factor levels learned from training data. |
dummy_prefix |
prefix used when adding a dummy level to single-level factors. |
List with elements data (data.frame) and levels (named list of levels).
Builds and fits a guarded preprocessing pipeline on training data, then constructs a transformer for consistent application to new data.
guard_fit( X, y = NULL, steps = list(), task = c("binomial", "multiclass", "gaussian", "survival") )guard_fit( X, y = NULL, steps = list(), task = c("binomial", "multiclass", "gaussian", "survival") )
X |
matrix/data.frame of predictors (training). |
y |
Optional outcome for supervised feature selection. |
steps |
List of configuration options (see Details). |
task |
"binomial", "multiclass", "gaussian", or "survival". |
The pipeline applies, in order:
Winsorization (optional) to limit outliers.
Imputation learned on training data only.
Normalization (z-score or robust).
Variance/IQR filtering.
Feature selection (optional; t-test, lasso, PCA).
All statistics are estimated on the training data and re-used for new data.
An object of class "GuardFit" with elements 'transform', 'state', 'p_out', and 'steps'.
[predict_guard()]
x <- data.frame(a = c(1, 2, NA), b = c(3, 4, 5)) fit <- guard_fit(x, y = c(1, 2, 3), steps = list(impute = list(method = "median")), task = "gaussian") fit$transform(x)x <- data.frame(a = c(1, 2, NA), b = c(3, 4, 5)) fit <- guard_fit(x, y = c(1, 2, 3), steps = list(impute = list(method = "median")), task = "gaussian") fit$transform(x)
Maps bioLeak guard preprocessing steps (impute, normalize, filter, fs) to their closest recipes equivalents. Requires the recipes package. Steps that have no direct recipe equivalent are skipped with a warning.
guard_to_recipe(steps, formula, training_data)guard_to_recipe(steps, formula, training_data)
steps |
A named list of guard preprocessing steps, e.g.,
|
formula |
A model formula (e.g., |
training_data |
A data.frame used to initialize the recipe. |
Mapping:
impute$method = "median": step_impute_median(all_numeric_predictors())
impute$method = "knn": step_impute_knn(all_predictors(), neighbors = k)
impute$method = "missForest" or "mice": Warning + step_impute_median() fallback
normalize$method = "zscore": step_normalize(all_numeric_predictors())
normalize$method = "robust": Warning + step_normalize() fallback
normalize$method = "none": No step added
filter$var_thresh > 0: step_nzv(all_numeric_predictors())
fs$method = "pca": step_pca(all_numeric_predictors(), num_comp = ncomp)
fs$method = "ttest" or "lasso": Warning, skipped (no recipe equivalent)
A recipes::recipe object with the mapped steps added.
Fits imputation parameters on the training data only, then applies the same
guarded transformation to the test data. This function is a thin wrapper
around the guarded preprocessing used by fit_resample().
Output is the transformed feature matrix used by the guarded pipeline
(categorical variables are one-hot encoded).
impute_guarded( train, test, method = c("median", "knn", "missForest", "none"), constant_value = 0, k = 5, seed = 123, winsor = TRUE, winsor_thresh = 3, parallel = FALSE, return_outliers = FALSE, vars = NULL )impute_guarded( train, test, method = c("median", "knn", "missForest", "none"), constant_value = 0, k = 5, seed = 123, winsor = TRUE, winsor_thresh = 3, parallel = FALSE, return_outliers = FALSE, vars = NULL )
train |
data frame (training set) |
test |
data frame (test set) |
method |
one of "median", "knn", "missForest", or "none" |
constant_value |
unused; retained for backward compatibility |
k |
number of neighbors for kNN imputation (if method = "knn") |
seed |
unused; retained for backward compatibility. Set seed before calling this function if reproducibility is needed. |
winsor |
logical; apply MAD-based winsorization before imputation |
winsor_thresh |
numeric; MAD cutoff (default = 3) |
parallel |
logical; unused (kept for compatibility) |
return_outliers |
logical; unused (outlier flags not returned) |
vars |
optional character vector; impute only selected variables |
A list (S3 class "LeakImpute") with elements train,
test, model, method, summary, and outliers.
[fit_resample()], [predict_guard()]
train <- data.frame(x = c(1, 2, NA, 4), y = c(NA, 1, 1, 0)) test <- data.frame(x = c(NA, 5), y = c(1, NA)) imp <- impute_guarded(train, test, method = "median", winsor = FALSE) imp$train imp$testtrain <- data.frame(x = c(1, 2, NA, 4), y = c(NA, 1, 1, 0)) test <- data.frame(x = c(NA, 5), y = c(1, NA)) imp <- impute_guarded(train, test, method = "median", winsor = FALSE) imp$train imp$test
These classes capture splits, model fits, and audit diagnostics produced by
make_split_plan(), fit_resample(), and audit_leakage().
## S4 method for signature 'LeakDeltaLSI' show(object)## S4 method for signature 'LeakDeltaLSI' show(object)
object |
A |
An S4 object of the respective class.
modeSplitting mode. One of "subject_grouped", "batch_blocked", "study_loocv", "time_series", or "combined".
indicesList of resampling descriptors (train/test indices when available)
info(LeakSplits) Metadata associated with the split plan (mode, coldata, hash, etc.)
splitsA ['LeakSplits'] object used for resampling
metricsModel performance metrics per resample
metric_summarySummary of metrics across resamples
auditAudit information per resample
predictionsList of prediction objects
preprocessPreprocessing steps used during fitting
learnersLearner definitions used in the pipeline
outcomeOutcome variable name
taskModeling task name
feature_namesFeature names included in the model
info(LeakFit) Metadata about the model fit (sample IDs, timings, provenance, etc.)
fitA ['LeakFit'] object used to generate the audit
permutation_gapData frame summarising permutation gaps
perm_valuesNumeric vector of permutation-based scores
batch_assocData frame of batch associations
target_assocData frame of feature-wise outcome associations
duplicatesData frame detailing duplicate records
trailList capturing audit trail information
info(LeakAudit) Metadata about the audit (mechanism summary, settings, provenance, etc.)
metricPerformance metric compared between pipelines
exchangeabilityExchangeability assumption used for the sign-flip test
tierInference tier label based on effective number of repeats
strictWhether strict mode was requested
R_effEffective number of paired repeats available for inference
delta_lsiHuber-robust point estimate of repeat-level metric difference
delta_lsi_ciBCa 95% CI for delta_lsi (NA when R_eff < 10)
delta_metricArithmetic mean of repeat-level metric differences
delta_metric_ciBCa 95% CI for delta_metric (NA when R_eff < 10)
p_valueSign-flip randomization test p-value (NA when R_eff < 5 or unpaired)
inference_okTRUE when tier A (R_eff >= 20, paired, finite p and CI)
folds_naivePer-fold data frame for the naive pipeline
folds_guardedPer-fold data frame for the guarded pipeline
repeats_naivePer-repeat aggregate data frame for the naive pipeline
repeats_guardedPer-repeat aggregate data frame for the guarded pipeline
info(LeakDeltaLSI) Metadata including R_naive, R_guarded, paired status, and block details
The S4 accessor method
[fit_metrics()] returns the per-fold metric data frame. It is
listed under methods(class = "LeakFit") alongside the
show and summary methods.
plot(<LeakFit>) dispatches
to [plot_fold_balance()] by default. Use the which
argument to switch to one of "overlap",
"calibration" (binary outcomes only), "time_acf"
(time-ordered splits), or "confounder_sensitivity".
The S4 accessor methods
[audit_perm_gap()], [audit_batch_assoc()], [audit_target_assoc()],
[audit_duplicates()], and [audit_info()] return the corresponding
slots. They are listed under methods(class = "LeakAudit")
alongside the show and summary methods.
plot(<LeakAudit>)
dispatches to [plot_perm_distribution()], rendering the
permutation-null distribution with the observed metric and
permuted mean marked.
The S4 accessor methods
[dlsi_metric()], [dlsi_robust()], [dlsi_ci()], [dlsi_p_value()],
[dlsi_tier()], [dlsi_R_eff()], and [dlsi_repeats()] return the
corresponding components. They are listed under
methods(class = "LeakDeltaLSI") alongside the show
and summary methods.
plot(<LeakDeltaLSI>)
dispatches to [plot_dlsi_repeats()], rendering the per-repeat
scatter with the Huber-robust point estimate, the
arithmetic mean, and the BCa bootstrap confidence interval band.
[make_split_plan()], [fit_resample()], [audit_leakage()]
[fit_resample()], [fit_metrics()], [plot_fold_balance()]
[audit_leakage()], [audit_report()], [audit_perm_gap()], [plot_perm_distribution()]
[delta_lsi()], [dlsi_metric()], [dlsi_ci()], [plot_dlsi_repeats()]
Generates leakage-safe cross-validation splits for common biomedical setups:
subject-grouped, batch-blocked, study leave-one-out, and time-series
rolling-origin. Supports repeats, optional stratification, nested inner CV,
and optional prediction horizon/purge/embargo gaps for time series. Note that splits store
explicit indices, which can be memory-intensive for large n and many
repeats.
make_split_plan( x, outcome = NULL, mode = c("subject_grouped", "batch_blocked", "study_loocv", "time_series", "combined"), group = NULL, batch = NULL, study = NULL, time = NULL, primary_axis = NULL, secondary_axis = NULL, constraints = NULL, v = 5, repeats = 1, stratify = FALSE, nested = FALSE, seed = 1, horizon = 0, purge = 0, embargo = 0, progress = TRUE, compact = FALSE, strict = TRUE )make_split_plan( x, outcome = NULL, mode = c("subject_grouped", "batch_blocked", "study_loocv", "time_series", "combined"), group = NULL, batch = NULL, study = NULL, time = NULL, primary_axis = NULL, secondary_axis = NULL, constraints = NULL, v = 5, repeats = 1, stratify = FALSE, nested = FALSE, seed = 1, horizon = 0, purge = 0, embargo = 0, progress = TRUE, compact = FALSE, strict = TRUE )
x |
SummarizedExperiment or data.frame/matrix (samples x features).
If SummarizedExperiment, metadata are taken from colData(x). If data.frame,
metadata are taken from x (columns referenced by |
outcome |
character, outcome column name (used for stratification). |
mode |
one of "subject_grouped","batch_blocked","study_loocv","time_series","combined". |
group |
subject/group id column (for subject_grouped). Required when mode is 'subject_grouped'; use 'group = "row_id"' to explicitly request sample-wise CV. |
batch |
batch/plate/center column (for batch_blocked). |
study |
study id column (for study_loocv). |
time |
time column (numeric or POSIXct) for time_series. |
primary_axis |
List with elements |
secondary_axis |
List with elements |
constraints |
A list of constraint specifications for |
v |
integer, number of folds (k) or rolling partitions. |
repeats |
integer, number of repeats (>=1) for non-LOOCV modes. |
stratify |
logical, keep outcome proportions similar across folds.
For grouped modes, stratification is applied at the group level (by
majority class per group) if |
nested |
logical, whether to attach inner CV splits (per outer fold)
using the same |
seed |
integer seed. |
horizon |
numeric (>=0), minimal time gap for time_series so that the training set only contains samples with time < min(test_time) when horizon = 0, and time <= min(test_time) - horizon otherwise. |
purge |
numeric (>=0), additional gap removed immediately before each time-series test block. |
embargo |
numeric (>=0), additional exclusion window anchored at the end
of each time-series test block. Training rows with
|
progress |
logical, print progress for large jobs. |
compact |
logical; store fold assignments instead of explicit train/test
indices to reduce memory usage for large datasets. Not supported when
|
strict |
logical; deprecated and ignored. 'subject_grouped' always requires a non-NULL 'group'. |
A LeakSplits S4 object containing:
modeCharacter string indicating the splitting mode
("subject_grouped", "batch_blocked", "study_loocv",
or "time_series").
indicesList of fold descriptors, each containing
train (integer vector of training indices), test
(integer vector of test indices), fold (fold number), and
repeat_id (repeat identifier). When compact = TRUE,
indices are stored as fold assignments instead.
infoList of metadata including outcome, v,
repeats, seed, grouping columns (group,
batch, study, time), stratify,
nested, horizon, purge, embargo,
summary (data.frame of fold
sizes), hash (reproducibility checksum), inner
(nested inner splits if nested = TRUE), and coldata
(sample metadata).
Use the show method to print a summary; downstream access
to the indices and metadata is normally done through the
functions that consume a LeakSplits (for example
fit_resample) rather than by reading slots directly.
set.seed(1) df <- data.frame( subject = rep(1:10, each = 2), outcome = rbinom(20, 1, 0.5), x1 = rnorm(20), x2 = rnorm(20) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 5)set.seed(1) df <- data.frame( subject = rep(1:10, each = 2), outcome = rbinom(20, 1, 0.5), x1 = rnorm(20), x2 = rnorm(20) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 5)
Visualizes observed outcome rates versus predicted probabilities across bins to diagnose calibration (binomial tasks only). Requires ggplot2.
plot_calibration(fit, bins = 10, min_bin_n = 5, learner = NULL)plot_calibration(fit, bins = 10, min_bin_n = 5, learner = NULL)
fit |
LeakFit. |
bins |
Number of probability bins to use. |
min_bin_n |
Minimum samples per bin shown in the plot. |
learner |
Optional character scalar. When predictions include multiple learners, selects the learner to summarize. |
A list containing the calibration curve, metrics, and a ggplot object.
if (requireNamespace("ggplot2", quietly = TRUE)) { set.seed(42) df <- data.frame( subject = rep(1:15, each = 2), outcome = factor(rep(c(0, 1), 15)), x1 = rnorm(30), x2 = rnorm(30) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) plot_calibration(fit, bins = 5) }if (requireNamespace("ggplot2", quietly = TRUE)) { set.seed(42) df <- data.frame( subject = rep(1:15, each = 2), outcome = factor(rep(c(0, 1), 15)), x1 = rnorm(30), x2 = rnorm(30) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) plot_calibration(fit, bins = 5) }
Shows performance metrics across confounder strata to assess sensitivity to batch/study effects. Requires ggplot2.
plot_confounder_sensitivity( fit, confounders = NULL, metric = NULL, min_n = 10, coldata = NULL, numeric_bins = 4, learner = NULL )plot_confounder_sensitivity( fit, confounders = NULL, metric = NULL, min_n = 10, coldata = NULL, numeric_bins = 4, learner = NULL )
fit |
LeakFit. |
confounders |
Character vector of columns in 'coldata' to evaluate. |
metric |
Metric name to compute within each stratum. |
min_n |
Minimum samples per stratum to display. |
coldata |
Optional data.frame of sample metadata. |
numeric_bins |
Number of quantile bins for numeric confounders. |
learner |
Optional character scalar. When predictions include multiple learners, selects the learner to summarize. |
A list containing the sensitivity table and a ggplot object.
if (requireNamespace("ggplot2", quietly = TRUE)) { set.seed(42) df <- data.frame( subject = rep(1:15, each = 2), outcome = factor(rep(c(0, 1), 15)), batch = factor(rep(c("A", "B", "C"), 10)), x1 = rnorm(30), x2 = rnorm(30) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) plot_confounder_sensitivity(fit, confounders = "batch", coldata = df) }if (requireNamespace("ggplot2", quietly = TRUE)) { set.seed(42) df <- data.frame( subject = rep(1:15, each = 2), outcome = factor(rep(c(0, 1), 15)), batch = factor(rep(c("A", "B", "C"), 10)), x1 = rnorm(30), x2 = rnorm(30) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) plot_confounder_sensitivity(fit, confounders = "batch", coldata = df) }
values from a LeakDeltaLSI objectVisualises the per-repeat metric differences (leaky minus guarded) for a
LeakDeltaLSI object, overlaid with the robust Huber
point estimate, the arithmetic mean, and the BCa bootstrap confidence
interval. This is the diagnostic shown as Figure 4 panel (b) of the
manuscript. Requires ggplot2.
plot_dlsi_repeats(dlsi)plot_dlsi_repeats(dlsi)
dlsi |
A |
A list with the per-repeat deltas, the robust and arithmetic-mean estimates, the BCa confidence interval, and the ggplot object.
Displays a bar chart of class counts per fold. For binomial tasks, it also
overlays the positive proportion to diagnose stratification issues. The
positive class is taken from fit@info$positive_class when available;
otherwise the second factor level is used. For multiclass tasks, the plot
shows per-class counts without a proportion line. Only available for
classification tasks. Requires ggplot2.
plot_fold_balance(fit)plot_fold_balance(fit)
fit |
LeakFit. |
A list containing the fold summary, positive class (if binomial), and a ggplot object.
if (requireNamespace("ggplot2", quietly = TRUE)) { set.seed(42) df <- data.frame( subject = rep(1:15, each = 2), outcome = factor(rep(c(0, 1), 15)), x1 = rnorm(30), x2 = rnorm(30) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) plot_fold_balance(fit) }if (requireNamespace("ggplot2", quietly = TRUE)) { set.seed(42) df <- data.frame( subject = rep(1:15, each = 2), outcome = factor(rep(c(0, 1), 15)), x1 = rnorm(30), x2 = rnorm(30) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) plot_fold_balance(fit) }
Checks whether the same group identifiers appear in both the training and test partitions within each resample. This is designed to detect leakage from grouped or repeated-measures data (for example, the same subject, batch, plate, or study appearing on both sides of a fold) when group-wise splitting is expected.
plot_overlap_checks(fit, column = NULL)plot_overlap_checks(fit, column = NULL)
fit |
A 'LeakFit' object produced by [fit_resample()]. It must contain the split indices and the associated metadata in 'fit@splits@info$coldata'. The metadata rows must align with the data used to create the splits. |
column |
Character scalar naming the metadata column to check (for example '"subject"' or '"batch"'). The function compares unique values of this column between train and test within each resample. There is no default: 'NULL' or an unknown column triggers an error. Changing 'column' changes which kind of leakage (subject-level, batch-level, etc.) is tested and therefore the overlap counts. |
For each resample in 'fit@splits@indices', the function counts the number of unique values of 'column' in the train and test sets and the size of their intersection. Any non-zero overlap indicates that at least one group appears in both train and test for that resample. The check is metadata-based only: it relies on exact matches of the supplied column and does not inspect features or outcomes. It only checks train vs test within each resample, so it will not detect overlaps across different resamples or other leakage mechanisms. Inconsistent IDs or missing values in the metadata can hide or inflate overlaps. 'NA' values are treated as regular identifiers and will count toward overlap if they appear in both partitions. Requires ggplot2.
A list returned invisibly with:
'overlap_counts': data.frame with one row per resample and columns 'fold' (resample index in 'fit@splits@indices'), 'overlap' (unique IDs shared by train and test), 'train' (unique IDs in train), and 'test' (unique IDs in test).
'column': the metadata column name used for the check.
'plot': the ggplot object showing the three count series across folds.
The plot is also printed. When any overlap is detected, the plot adds a warning annotation.
set.seed(1) df <- data.frame( subject = rep(1:6, each = 2), outcome = rbinom(12, 1, 0.5), x1 = rnorm(12), x2 = rnorm(12) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "accuracy", refit = FALSE) if (requireNamespace("ggplot2", quietly = TRUE)) { out <- plot_overlap_checks(fit, column = "subject") out$overlap_counts }set.seed(1) df <- data.frame( subject = rep(1:6, each = 2), outcome = rbinom(12, 1, 0.5), x1 = rnorm(12), x2 = rnorm(12) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "accuracy", refit = FALSE) if (requireNamespace("ggplot2", quietly = TRUE)) { out <- plot_overlap_checks(fit, column = "subject") out$overlap_counts }
Visualizes the label-permutation metric distribution and marks the observed and permuted-mean values to help assess leakage signals. Requires ggplot2.
plot_perm_distribution(audit)plot_perm_distribution(audit)
audit |
LeakAudit. |
A list containing the observed value, permuted mean, permutation values, and a ggplot object.
if (requireNamespace("ggplot2", quietly = TRUE)) { set.seed(42) df <- data.frame( subject = rep(1:15, each = 2), outcome = factor(rep(c(0, 1), 15)), x1 = rnorm(30), x2 = rnorm(30) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) audit <- audit_leakage(fit, metric = "auc", B = 20) plot_perm_distribution(audit) }if (requireNamespace("ggplot2", quietly = TRUE)) { set.seed(42) df <- data.frame( subject = rep(1:15, each = 2), outcome = factor(rep(c(0, 1), 15)), x1 = rnorm(30), x2 = rnorm(30) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) audit <- audit_leakage(fit, metric = "auc", B = 20) plot_perm_distribution(audit) }
Uses the autocorrelation function of out-of-fold predictions to detect temporal dependence that may indicate leakage. Predictions are ordered by the split time column before computing the ACF. Requires numeric predictions (regression or survival). Requires ggplot2.
plot_time_acf(fit, lag.max = 20)plot_time_acf(fit, lag.max = 20)
fit |
LeakFit. |
lag.max |
maximum lag to show. |
A list with the autocorrelation results, lag.max, and a ggplot object.
if (requireNamespace("ggplot2", quietly = TRUE)) { set.seed(42) df <- data.frame( id = 1:30, time = seq.Date(as.Date("2020-01-01"), by = "day", length.out = 30), y = rnorm(30), x1 = rnorm(30), x2 = rnorm(30) ) splits <- make_split_plan(df, outcome = "y", mode = "time_series", time = "time", v = 3, progress = FALSE) custom <- list( lm = list( fit = function(x, y, task, weights, ...) { stats::lm(y ~ ., data = data.frame(y = y, x)) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata))) } ) ) fit <- fit_resample(df, outcome = "y", splits = splits, learner = "lm", custom_learners = custom, metrics = "rmse", refit = FALSE, seed = 1) plot_time_acf(fit, lag.max = 10) }if (requireNamespace("ggplot2", quietly = TRUE)) { set.seed(42) df <- data.frame( id = 1:30, time = seq.Date(as.Date("2020-01-01"), by = "day", length.out = 30), y = rnorm(30), x1 = rnorm(30), x2 = rnorm(30) ) splits <- make_split_plan(df, outcome = "y", mode = "time_series", time = "time", v = 3, progress = FALSE) custom <- list( lm = list( fit = function(x, y, task, weights, ...) { stats::lm(y ~ ., data = data.frame(y = y, x)) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata))) } ) ) fit <- fit_resample(df, outcome = "y", splits = splits, learner = "lm", custom_learners = custom, metrics = "rmse", refit = FALSE, seed = 1) plot_time_acf(fit, lag.max = 10) }
Diagnostic plot for a ['LeakAudit'] object. The default diagnostic is the
permutation-distribution histogram produced by
plot_perm_distribution.
## S4 method for signature 'LeakAudit,missing' plot(x, y, ...)## S4 method for signature 'LeakAudit,missing' plot(x, y, ...)
x |
A ['LeakAudit'] object. |
y |
Unused; present for S4 compatibility with |
... |
Additional arguments passed to |
Invisibly returns the list produced by
plot_perm_distribution (observed value, permuted mean,
permutation values, ggplot object).
plot_perm_distribution,
LeakAudit
Diagnostic plot for a ['LeakDeltaLSI'] object: per-repeat
scatter with the Huber-robust point estimate, the arithmetic mean,
and the BCa bootstrap confidence interval band. This is the diagnostic
shown as Figure 4 panel (b) of the manuscript.
## S4 method for signature 'LeakDeltaLSI,missing' plot(x, y, ...)## S4 method for signature 'LeakDeltaLSI,missing' plot(x, y, ...)
x |
A ['LeakDeltaLSI'] object. |
y |
Unused; present for S4 compatibility with |
... |
Additional arguments (currently unused). |
Invisibly returns the list produced by
plot_dlsi_repeats: per-repeat deltas, the robust and
arithmetic-mean estimates, the BCa interval, and the ggplot object.
plot_dlsi_repeats,
LeakDeltaLSI
Diagnostic plot for a ['LeakFit'] object. The default diagnostic is the
fold-balance check produced by plot_fold_balance, which
works for any classification task. Use the which argument to
switch to one of the other diagnostics in the package:
"overlap" (plot_overlap_checks),
"calibration" (plot_calibration; binary outcomes
only), "time_acf" (plot_time_acf; time-ordered
splits), or "confounder_sensitivity"
(plot_confounder_sensitivity).
## S4 method for signature 'LeakFit,missing' plot( x, y, which = c("fold_balance", "overlap", "calibration", "time_acf", "confounder_sensitivity"), ... )## S4 method for signature 'LeakFit,missing' plot( x, y, which = c("fold_balance", "overlap", "calibration", "time_acf", "confounder_sensitivity"), ... )
x |
A ['LeakFit'] object. |
y |
Unused; present for S4 compatibility with |
which |
One of |
... |
Additional arguments passed to the selected helper. |
Invisibly returns the list produced by the selected helper.
plot_fold_balance, plot_overlap_checks,
plot_calibration, plot_time_acf,
plot_confounder_sensitivity,
LeakFit
Applies the preprocessing steps stored in a GuardFit object to new
data without refitting any statistics. This is designed to prevent
validation leakage that would occur if imputation, scaling, filtering, or
feature selection were recomputed on evaluation data. It enforces the
training schema by aligning columns and factor levels, and it errors when a
numeric-only training fit receives non-numeric predictors. It does not
detect label leakage, duplicate samples, or train/test contamination.
predict.GuardFit() is the canonical S3 method — callers can use
predict(fit, newdata) on a GuardFit object and the right
method is dispatched. predict_guard() is retained as a
backward-compatible thin alias that simply forwards to the S3 method, so
existing code that calls predict_guard(fit, x) continues to work.
## S3 method for class 'GuardFit' predict(object, newdata, ...) predict_guard(fit, newdata)## S3 method for class 'GuardFit' predict(object, newdata, ...) predict_guard(fit, newdata)
object, fit
|
A |
newdata |
A matrix or data.frame of predictors with one row per sample.
This required argument (no default) is transformed using the training-time
parameters in the fit only. Missing columns are added and filled, extra
columns are dropped, and factor levels are aligned to the training levels;
if the training fit was numeric-only, non-numeric columns in |
... |
Ignored. Present so that the S3 method signature matches the [stats::predict()] generic; additional arguments are silently dropped. |
A data.frame of transformed predictors with the same number of rows
as newdata. Column order and content match the training pipeline and
may include derived features (one-hot encodings, missingness indicators, or
PCA components). This output is not a prediction; it is intended as input
to a downstream model and assumes the training-time preprocessing is valid
for the new data.
x_train <- data.frame(a = c(1, 2, NA, 4), b = c(10, 11, 12, 13)) fit <- guard_fit( x_train, y = c(0.1, 0.2, 0.3, 0.4), steps = list(impute = list(method = "median")), task = "gaussian" ) x_new <- data.frame(a = c(NA, 5), b = c(9, 14)) ## Canonical: dispatch through the predict() generic. out <- predict(fit, x_new) out ## Equivalent legacy form (kept for backward compatibility). identical(out, predict_guard(fit, x_new))x_train <- data.frame(a = c(1, 2, NA, 4), b = c(10, 11, 12, 13)) fit <- guard_fit( x_train, y = c(0.1, 0.2, 0.3, 0.4), steps = list(impute = list(method = "median")), task = "gaussian" ) x_new <- data.frame(a = c(NA, 5), b = c(9, 14)) ## Canonical: dispatch through the predict() generic. out <- predict(fit, x_new) out ## Equivalent legacy form (kept for backward compatibility). identical(out, predict_guard(fit, x_new))
Brief one-screen auto-print representation of a 'LeakTune' result returned by [tune_resample()]. Use [summary()] for the full diagnostic report (outer-loop metrics, selected hyperparameters, fold-by-fold detail, and refit summary).
## S3 method for class 'LeakTune' print(x, ...)## S3 method for class 'LeakTune' print(x, ...)
x |
A 'LeakTune' object returned by [tune_resample()]. |
... |
Ignored; present so that the S3 signature matches [base::print()]. |
Invisibly returns 'x'.
Prints a brief one-screen summary of a LeakAudit, including
task and outcome, the permutation-gap statistic, and counts of
batch-association rows, target-leakage features, and duplicate pairs.
Use summary() for the full diagnostic report.
## S4 method for signature 'LeakAudit' show(object)## S4 method for signature 'LeakAudit' show(object)
object |
A |
No return value, called for side effects (prints a brief summary
to the console). Returns object invisibly.
set.seed(1) df <- data.frame( subject = rep(1:6, each = 2), outcome = rbinom(12, 1, 0.5), x1 = rnorm(12), x2 = rnorm(12) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) aud <- audit_leakage(fit, metric = "auc", B = 10, X_ref = df[, c("x1", "x2")]) show(aud)set.seed(1) df <- data.frame( subject = rep(1:6, each = 2), outcome = rbinom(12, 1, 0.5), x1 = rnorm(12), x2 = rnorm(12) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) aud <- audit_leakage(fit, metric = "auc", B = 10, X_ref = df[, c("x1", "x2")]) show(aud)
Prints a brief one-screen summary of a LeakFit, including
task and outcome, fold count and status (successful, skipped, failed),
and the headline cross-validated metric. Use summary() for the
full per-fold diagnostic report.
## S4 method for signature 'LeakFit' show(object)## S4 method for signature 'LeakFit' show(object)
object |
A |
No return value, called for side effects (prints a brief summary
to the console). Returns object invisibly.
set.seed(1) df <- data.frame( subject = rep(1:6, each = 2), outcome = rbinom(12, 1, 0.5), x1 = rnorm(12), x2 = rnorm(12) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) show(fit)set.seed(1) df <- data.frame( subject = rep(1:6, each = 2), outcome = rbinom(12, 1, 0.5), x1 = rnorm(12), x2 = rnorm(12) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, progress = FALSE) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) show(fit)
Prints fold counts, sizes, and hash metadata for quick inspection.
## S4 method for signature 'LeakSplits' show(object)## S4 method for signature 'LeakSplits' show(object)
object |
LeakSplits object. |
No return value, called for side effects (prints a summary to the console showing mode, fold count, repeats, outcome, stratification status, nested status, per-fold train/test sizes, and the reproducibility hash).
df <- data.frame( subject = rep(1:10, each = 2), outcome = rbinom(20, 1, 0.5), x1 = rnorm(20), x2 = rnorm(20) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 5) show(splits)df <- data.frame( subject = rep(1:10, each = 2), outcome = rbinom(20, 1, 0.5), x1 = rnorm(20), x2 = rnorm(20) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 5) show(splits)
Simulates synthetic binary classification datasets with optional leakage mechanisms, fits a model using a leakage-aware cross-validation scheme, and summarizes the permutation-gap audit for each Monte Carlo seed. The suite is designed to surface validation failures such as subject overlap across folds, batch-confounded outcomes, global normalization/summary leakage, and time-series look-ahead. The output is a per-seed summary of observed CV performance and its gap versus a label-permutation null; it does not return fitted models or the full audit object. Results are limited to the built-in data generator and leakage types implemented here, and should be interpreted as a simulation-based sanity check rather than a comprehensive leakage detector for real data.
simulate_leakage_suite( n = 500, p = 20, prevalence = 0.5, mode = c("subject_grouped", "batch_blocked", "study_loocv", "time_series"), learner = c("glmnet", "ranger"), leakage = c("none", "subject_overlap", "batch_confounded", "peek_norm", "lookahead"), preprocess = NULL, rho = 0, K = 5, repeats = 1, horizon = 0, B = 200, seeds = 1:10, parallel = FALSE, signal_strength = 1, verbose = FALSE )simulate_leakage_suite( n = 500, p = 20, prevalence = 0.5, mode = c("subject_grouped", "batch_blocked", "study_loocv", "time_series"), learner = c("glmnet", "ranger"), leakage = c("none", "subject_overlap", "batch_confounded", "peek_norm", "lookahead"), preprocess = NULL, rho = 0, K = 5, repeats = 1, horizon = 0, B = 200, seeds = 1:10, parallel = FALSE, signal_strength = 1, verbose = FALSE )
n |
Integer scalar. Number of samples to simulate (default 500). Larger values stabilize the Monte Carlo summary but increase runtime. |
p |
Integer scalar. Number of baseline predictors before any leakage
feature is added (default 20). Increasing |
prevalence |
Numeric scalar in (0, 1). Target prevalence of class 1 in the simulated outcome (default 0.5). Changing this alters class imbalance and can affect AUC and the permutation gap. |
mode |
Character scalar. Cross-validation scheme passed to
|
learner |
Character scalar. Base learner, |
leakage |
Character scalar. Leakage mechanism to inject; one of
|
preprocess |
Optional preprocessing list or recipe passed to
[fit_resample()]. When NULL (default), the simulator uses the
fit_resample defaults; for |
rho |
Numeric scalar in [-1, 1]. AR(1)-style autocorrelation applied to each predictor across row order (default 0). Higher absolute values increase serial correlation and make time-ordered leakage more pronounced. |
K |
Integer scalar. Number of folds/partitions (default 5). Used as the
fold count for |
repeats |
Integer scalar >= 1. Number of repeated CV runs for
|
horizon |
Numeric scalar >= 0. Minimum time gap enforced between train
and test for |
B |
Integer scalar >= 1. Number of permutations used by
|
seeds |
Integer vector. Monte Carlo seeds (default |
parallel |
Logical scalar. If |
signal_strength |
Numeric scalar. Scales the linear predictor before sampling outcomes (default 1). Larger values increase class separation and tend to increase AUC; smaller values make the task harder. |
verbose |
Logical scalar. If |
The generator draws p standard normal predictors, builds a linear
predictor from the first min(5, p) features, scales it by
signal_strength, and samples a binary outcome to achieve the requested
prevalence. Outcomes are returned as a two-level factor, so the audited
metric is AUC. Simulated metadata include subject, batch, study, and time
fields used by mode to create leakage-aware splits. Leakage mechanisms
are injected by adding a single extra predictor as described in
leakage. Parallel execution uses future.apply when installed and
does not change results.
A LeakSimResults data frame with one row per seed and columns:
seed: seed used for data generation, splitting, and auditing.
metric_obs: observed CV performance (AUC for this simulation).
gap: permutation-gap statistic (observed minus permutation mean).
p_value: permutation p-value for the gap.
leakage: leakage scenario used.
mode: CV mode used.
Only the permutation-gap summary is returned; fitted models, predictions, and other audit components are not included.
This function is a general-purpose utility and its data-generation logic
intentionally differs from the custom simulation used in the bioLeak
manuscript (the manuscript's replication.R script, distributed
with the manuscript supplementary materials). Specific differences:
peek_norm leakage: this function uses a z-scored binary
outcome as the leak feature; the manuscript uses a noisy continuous
version (as.numeric(y) + rnorm(n, 0, 0.3)).
lookahead leakage: this function shifts the binary outcome
(c(y[-1], y[n])); the manuscript shifts a continuous biomarker
(linpred + noise).
signal generation: this function applies AR correlation to
predictors via rho; the manuscript adds AR(1) noise directly to
the linear predictor.
audit settings: the manuscript uses
perm_refit = FALSE and perm_stratify = TRUE;
this function uses perm_refit = "auto" and the
perm_stratify default (FALSE).
Users wishing to reproduce manuscript figures should run the
manuscript-specific replication.R script directly rather
than calling this function.
if (requireNamespace("glmnet", quietly = TRUE)) { set.seed(1) res <- simulate_leakage_suite( n = 120, p = 6, prevalence = 0.4, mode = "subject_grouped", learner = "glmnet", leakage = "subject_overlap", K = 3, repeats = 1, B = 50, seeds = 1, parallel = FALSE ) # One row per seed with observed AUC, permutation gap, and p-value res }if (requireNamespace("glmnet", quietly = TRUE)) { set.seed(1) res <- simulate_leakage_suite( n = 120, p = 6, prevalence = 0.4, mode = "subject_grouped", learner = "glmnet", leakage = "subject_overlap", K = 3, repeats = 1, B = 50, seeds = 1, parallel = FALSE ) # One row per seed with observed AUC, permutation gap, and p-value res }
Prints a concise, human-readable report for a 'LeakAudit' object produced by [audit_leakage()]. The summary surfaces four diagnostics when available: label-permutation gap (prediction-label association by default), batch/study association tests (metadata aligned with fold splits), target leakage scan (features strongly associated with the outcome), and near-duplicate detection (high similarity in 'X_ref'). The output reflects the stored audit results only; it does not recompute any tests.
## S3 method for class 'LeakAudit' summary(object, digits = 3, ...)## S3 method for class 'LeakAudit' summary(object, digits = 3, ...)
object |
A 'LeakAudit' object from [audit_leakage()]. The summary reads stored results from 'object' and prints them to the console. |
digits |
Integer number of digits to show when formatting numeric statistics in the console output. Defaults to '3'. Increasing 'digits' shows more precision; decreasing it shortens the printout without changing the underlying values. |
... |
Unused. Included for S3 method compatibility; additional arguments are ignored. |
The permutation test quantifies prediction-label association when using fixed predictions; refit-based permutations require 'perm_refit = TRUE' (or '"auto"' with refit data). It does not by itself prove or rule out leakage. Batch association flags metadata that align with fold assignment; this may reflect study design rather than leakage. Target leakage scan uses univariate feature-outcome associations and can miss multivariate proxies, interaction leakage, or features not included in 'X_ref'. The multivariate scan (enabled by default for supported tasks) reports an additional model-based score. Duplicate detection only considers the provided 'X_ref' features and the similarity threshold used during [audit_leakage()]. By default, 'duplicate_scope = "train_test"' filters to pairs that cross train/test; set 'duplicate_scope = "all"' to include within-fold duplicates. Sections are reported as "not available" when the corresponding audit component was not computed.
Invisibly returns 'object' after printing the summary.
[plot_perm_distribution()], [plot_fold_balance()], [plot_overlap_checks()]
set.seed(1) df <- data.frame( subject = rep(1:6, each = 2), outcome = rbinom(12, 1, 0.5), x1 = rnorm(12), x2 = rnorm(12) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) audit <- audit_leakage(fit, metric = "auc", B = 5, X_ref = df[, c("x1", "x2")], seed = 1) summary(audit) # prints the audit report and returns `audit` invisiblyset.seed(1) df <- data.frame( subject = rep(1:6, each = 2), outcome = rbinom(12, 1, 0.5), x1 = rnorm(12), x2 = rnorm(12) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = as.data.frame(x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", refit = FALSE, seed = 1) audit <- audit_leakage(fit, metric = "auc", B = 5, X_ref = df[, c("x1", "x2")], seed = 1) summary(audit) # prints the audit report and returns `audit` invisibly
Prints a human-readable summary of the Delta LSI analysis comparing leaky vs guarded evaluation pipelines.
## S3 method for class 'LeakDeltaLSI' summary(object, digits = 3L, ...)## S3 method for class 'LeakDeltaLSI' summary(object, digits = 3L, ...)
object |
A |
digits |
Integer. Number of decimal places to show (default 3). |
... |
Unused. |
Invisibly returns object.
Prints a compact console report for a [LeakFit] object created by [fit_resample()]. The report lists task/outcome metadata, learners, total folds, and cross-validated metrics summarized as mean and standard deviation across completed folds, plus a small audit table with per-fold train/test sizes and retained feature counts.
## S3 method for class 'LeakFit' summary(object, digits = 3, ...)## S3 method for class 'LeakFit' summary(object, digits = 3, ...)
object |
A [LeakFit] object returned by [fit_resample()]. It should contain 'metric_summary' and 'audit' slots; missing entries result in empty sections in the printed report. |
digits |
Integer scalar. Number of decimal places to print in numeric summary tables. Defaults to 3; affects printed output only, not the returned data. |
... |
Unused. Included for S3 method compatibility; changing these values has no effect. |
This summary is meant for quick sanity checks of the resampling setup and performance. It does not run leakage diagnostics and will not detect target leakage, duplicate samples, or batch/study confounding; use [audit_leakage()] or 'summary()' on a [LeakAudit] object for those checks.
Invisibly returns 'object@metric_summary', a data frame of per-learner metric means and standard deviations computed across folds. This function does not recompute metrics.
set.seed(1) df <- data.frame( subject = rep(1:6, each = 2), outcome = factor(rep(c(0, 1), each = 6)), x1 = rnorm(12), x2 = rnorm(12) ) splits <- make_split_plan( df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, stratify = TRUE, progress = FALSE ) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = data.frame(y = y, x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", seed = 1) summary_df <- summary(fit) summary_dfset.seed(1) df <- data.frame( subject = rep(1:6, each = 2), outcome = factor(rep(c(0, 1), each = 6)), x1 = rnorm(12), x2 = rnorm(12) ) splits <- make_split_plan( df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, stratify = TRUE, progress = FALSE ) custom <- list( glm = list( fit = function(x, y, task, weights, ...) { stats::glm(y ~ ., data = data.frame(y = y, x), family = stats::binomial(), weights = weights) }, predict = function(object, newdata, task, ...) { as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response")) } ) ) fit <- fit_resample(df, outcome = "outcome", splits = splits, learner = "glm", custom_learners = custom, metrics = "auc", seed = 1) summary_df <- summary(fit) summary_df
Prints a concise report for a 'LeakTune' object produced by [tune_resample()]. The report highlights the tuning strategy, selection metric, and cross-validated performance across outer folds, plus a glimpse of the selected hyperparameters.
## S3 method for class 'LeakTune' summary(object, digits = 3, ...)## S3 method for class 'LeakTune' summary(object, digits = 3, ...)
object |
A [LeakTune] object returned by [tune_resample()]. |
digits |
Integer scalar. Number of decimal places to print in numeric summary tables. Defaults to 3. |
... |
Unused. Included for S3 method compatibility. |
Invisibly returns 'object$metric_summary', the data frame of per-learner metric means and standard deviations computed across outer folds.
Runs nested cross-validation for hyperparameter tuning using leakage-aware splits. Inner resamples are constructed from each outer training fold to avoid information leakage during tuning. Requires tidymodels tuning packages and a workflow or recipe-based preprocessing. Survival tasks are not yet supported.
tune_resample( x, outcome, splits, learner, preprocess = NULL, grid = 10, metrics = NULL, positive_class = NULL, selection = c("best", "one_std_err"), selection_metric = NULL, inner_v = NULL, inner_repeats = 1, inner_seed = NULL, control = NULL, parallel = FALSE, refit = FALSE, seed = 1, split_cols = "auto", tune_threshold = FALSE, threshold_grid = seq(0.1, 0.9, by = 0.05), threshold_metric = "accuracy" )tune_resample( x, outcome, splits, learner, preprocess = NULL, grid = 10, metrics = NULL, positive_class = NULL, selection = c("best", "one_std_err"), selection_metric = NULL, inner_v = NULL, inner_repeats = 1, inner_seed = NULL, control = NULL, parallel = FALSE, refit = FALSE, seed = 1, split_cols = "auto", tune_threshold = FALSE, threshold_grid = seq(0.1, 0.9, by = 0.05), threshold_metric = "accuracy" )
x |
SummarizedExperiment or matrix/data.frame. |
outcome |
Outcome column name (if x is SE or data.frame). |
splits |
LeakSplits object defining the outer resamples. If the splits do not already include inner folds, they are created from each outer training fold using the same split metadata. rsample splits must already include inner folds. |
learner |
A parsnip model_spec with tunable parameters, or a workflows workflow. When a model_spec is provided, a workflow is built using 'preprocess' or a formula. |
preprocess |
Optional 'recipes::recipe'. Required when you need
preprocessing for tuning. Ignored when 'learner' is already a workflow.
Recipe/workflow leakage guardrails run before tuning; configure policy via
|
grid |
Tuning grid passed to 'tune::tune_grid()'. Can be a data.frame or an integer size. |
metrics |
Character vector of metric names ('auc', 'pr_auc', 'accuracy', 'macro_f1', 'log_loss', 'rmse') or a yardstick metric set/list. Metrics are computed with yardstick; unsupported metrics are dropped with a warning. For binomial tasks, if any inner assessment fold contains a single class, probability metrics ('auc', 'roc_auc', 'pr_auc') are dropped for tuning with a warning. |
positive_class |
Optional value indicating the positive class for binomial outcomes. When set, the outcome levels are reordered so the positive class is second. |
selection |
Selection rule for tuning, either '"best"' or '"one_std_err"'. |
selection_metric |
Metric name used for selecting hyperparameters. Defaults to the first metric in 'metrics'. If the chosen metric yields no valid results, the first available metric is used with a warning. |
inner_v |
Optional number of folds for inner CV when inner splits are not precomputed. Defaults to the outer 'v'. |
inner_repeats |
Optional number of repeats for inner CV when inner splits are not precomputed. Defaults to 1. |
inner_seed |
Optional seed for inner split generation when inner splits are not precomputed. Defaults to the outer split seed. |
control |
Optional 'tune::control_grid()' settings for tuning. |
parallel |
Logical; passed to [fit_resample()] when evaluating outer folds (single-fold, no refit). |
refit |
Logical; if TRUE, refits a final tuned workflow on the full dataset using aggregated hyperparameters across all outer folds (median for numeric parameters, majority vote for categorical). This avoids nested-CV leakage that would occur from selecting a single fold's params. |
seed |
Integer seed for reproducibility. |
split_cols |
Optional named list/character vector or '"auto"' (default) overriding group/batch/study/time column names when 'splits' is an rsample object and its attributes are missing. '"auto"' falls back to common metadata column names (e.g., 'group', 'subject', 'batch', 'study', 'time'). Supported names are 'group', 'batch', 'study', and 'time'. |
tune_threshold |
Logical; when 'TRUE' for binomial tasks, selects a probability threshold from inner-fold predictions and applies it only to the corresponding outer-fold evaluation. |
threshold_grid |
Numeric vector of thresholds in '[0, 1]' considered when 'tune_threshold = TRUE'. |
threshold_metric |
Metric used to pick thresholds when 'tune_threshold = TRUE'. Supported values are '"accuracy"', '"balanced_accuracy"', and '"f1"', or a custom function with signature 'function(truth, pred_class, prob, threshold)'. |
A list of class '"LeakTune"' with components:
metrics |
Outer-fold metrics. |
metric_summary |
Mean/SD metrics across outer folds with columns
|
best_params |
Best hyperparameters per outer fold. |
inner_results |
List of inner tuning results. |
outer_fits |
List of outer LeakFit objects. |
thresholds |
Per-fold threshold choices when threshold tuning is enabled. |
fold_status |
Outer-fold status log with stage, status, reason, and notes. |
final_model |
Optional final workflow fit when 'refit = TRUE'. |
info |
Metadata about the tuning run. |
if (requireNamespace("tune", quietly = TRUE) && requireNamespace("recipes", quietly = TRUE) && requireNamespace("glmnet", quietly = TRUE) && requireNamespace("rsample", quietly = TRUE) && requireNamespace("workflows", quietly = TRUE) && requireNamespace("yardstick", quietly = TRUE) && requireNamespace("dials", quietly = TRUE)) { df <- data.frame( subject = rep(1:10, each = 2), outcome = factor(rep(c(0, 1), each = 10)), x1 = rnorm(20), x2 = rnorm(20) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, nested = TRUE, stratify = TRUE) spec <- parsnip::logistic_reg(penalty = tune::tune(), mixture = 1) |> parsnip::set_engine("glmnet") rec <- recipes::recipe(outcome ~ x1 + x2, data = df) tuned <- tune_resample(df, outcome = "outcome", splits = splits, learner = spec, preprocess = rec, grid = 5) tuned$metric_summary }if (requireNamespace("tune", quietly = TRUE) && requireNamespace("recipes", quietly = TRUE) && requireNamespace("glmnet", quietly = TRUE) && requireNamespace("rsample", quietly = TRUE) && requireNamespace("workflows", quietly = TRUE) && requireNamespace("yardstick", quietly = TRUE) && requireNamespace("dials", quietly = TRUE)) { df <- data.frame( subject = rep(1:10, each = 2), outcome = factor(rep(c(0, 1), each = 10)), x1 = rnorm(20), x2 = rnorm(20) ) splits <- make_split_plan(df, outcome = "outcome", mode = "subject_grouped", group = "subject", v = 3, nested = TRUE, stratify = TRUE) spec <- parsnip::logistic_reg(penalty = tune::tune(), mixture = 1) |> parsnip::set_engine("glmnet") rec <- recipes::recipe(outcome ~ x1 + x2, data = df) tuned <- tune_resample(df, outcome = "outcome", splits = splits, learner = spec, preprocess = rec, grid = 5) tuned$metric_summary }