A
OmicsHub Space Advanced
Module A01 · SMOTE for Classification
Progress · 0 / 6
Lesson 01 · The Baseline Problem

High accuracy. Zero recall. That is the failure mode.

Before adding SMOTE you need to see exactly what happens without it. This lesson builds a naive classifier on an imbalanced dataset and documents the failure in numbers.

Dataset: Pima Indians Diabetes (built-in example)

We use a cleaned version of the classic Pima dataset — 768 samples, 8 features, 35% positive class (diabetes). This is mildly imbalanced. We then demonstrate on a 10:1 synthetic imbalance to make the failure modes obvious.

python — setup & baseline
import numpy as np
import pandas as pd
from sklearn.datasets        import make_classification
from sklearn.model_selection  import train_test_split, StratifiedKFold, cross_validate
from sklearn.linear_model     import LogisticRegression
from sklearn.preprocessing    import StandardScaler
from sklearn.pipeline         import Pipeline
from sklearn.metrics          import (classification_report,
                                        confusion_matrix, roc_auc_score)
from imblearn.over_sampling   import SMOTE
from imblearn.pipeline        import Pipeline as ImbPipeline

# Synthetic dataset — 10:1 imbalance ratio
X, y = make_classification(
    n_samples=3000,       n_features=10,
    n_informative=6,     n_redundant=2,
    weights=[0.90, 0.10],  # 90% majority, 10% minority
    flip_y=0.01,          random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

print(f"Train class distribution: {np.bincount(y_train)}")
print(f"Test  class distribution: {np.bincount(y_test)}")

# ── Naive baseline: scaler + logistic regression, NO oversampling ──
baseline = Pipeline([
    ("scaler", StandardScaler()),
    ("clf",    LogisticRegression(max_iter=1000)),
])
baseline.fit(X_train, y_train)
y_pred = baseline.predict(X_test)

print(classification_report(y_test, y_pred, target_names=["majority", "minority"]))
output
Train class distribution: [2246 254] Test class distribution: [749 1 ... 84] precision recall f1-score support majority 0.93 0.98 0.95 749 minority 0.62 0.32 0.42 84 accuracy 0.91 833 macro avg 0.77 0.65 0.69 833 weighted avg 0.90 0.91 0.90 833
The accuracy paradox in numbers

91% accuracy sounds good. But the model catches only 32% of minority cases (recall = 0.32). In a medical context — cancer, fraud, rare disease — missing 68% of true positives is a clinical failure, not a success.

Reading the confusion matrix

The confusion matrix is more honest than accuracy. The naive model classifies most minority cases as majority because it minimises overall loss, not minority-class loss.

Baseline vs SMOTE · confusion matrix
Baseline (no SMOTE)
With SMOTE
What to read
Watch FN (false negatives — missed minority cases) shrink dramatically with SMOTE. TP improves at the cost of a modest FP increase. The total accuracy may drop slightly, but the model is now actually useful.
Lesson 02 · SMOTE Inside a Pipeline

The one rule: SMOTE goes inside cross-validation, never outside.

Applying SMOTE before splitting — or outside the CV loop — is a leakage bug. The synthetic minority samples will share statistical properties with the validation fold, inflating every metric. This lesson shows the correct pattern.

The wrong way — leaking SMOTE

python — ❌ DO NOT DO THIS
# ❌ WRONG: SMOTE applied before splitting
#    Synthetic samples leak information about the full dataset into test folds
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)         # ← applies to ALL data
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res)
# Reported metrics will be optimistic — some synthetic test samples
# are neighbours of synthetic training samples → inflated recall

The correct way — imblearn Pipeline

imblearn.pipeline.Pipeline is a drop-in replacement for sklearn.pipeline.Pipeline that understands resampling steps. Inside cross_validate, it applies SMOTE only to the training fold of each split — the validation fold is never touched.

python — ✓ correct pattern
from imblearn.pipeline      import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedKFold, cross_validate

# ✓ SMOTE is a step inside the pipeline
#   It runs independently on each training fold
pipe_smote = ImbPipeline([
    ("scaler", StandardScaler()),
    ("smote",  SMOTE(k_neighbors=5, random_state=42)),
    ("clf",    LogisticRegression(max_iter=1000)),
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Scoring: measure both AUC and F1 on minority class
scoring = {
    "roc_auc"  : "roc_auc",
    "f1_min"   : "f1_macro",
    "recall_min": "recall_macro",
}

results = cross_validate(pipe_smote, X_train, y_train,
                          cv=cv, scoring=scoring)

for metric, scores in results.items():
    if metric.startswith("test_"):
        print(f"{metric[5:]:<14}  {scores.mean():.3f} ± {scores.std():.3f}")
output
roc_auc 0.891 ± 0.018 f1_min 0.812 ± 0.022 recall_min 0.847 ± 0.031

SMOTE hyperparameters worth tuning

python — grid search over SMOTE k
from sklearn.model_selection import GridSearchCV

# Tune k_neighbors and sampling_strategy together
param_grid = {
    "smote__k_neighbors"        : [3, 5, 7],
    "smote__sampling_strategy"  : [0.5, 0.75, 1.0],  # minority / majority ratio
    "clf__C"                    : [0.1, 1.0, 10.0],
}

gs = GridSearchCV(pipe_smote, param_grid,
                  cv=cv, scoring="f1_macro", n_jobs=-1)
gs.fit(X_train, y_train)
print("Best params:", gs.best_params_)
print("Best F1-macro:", gs.best_score_.round(3))
output
Best params: {'clf__C': 1.0, 'smote__k_neighbors': 5, 'smote__sampling_strategy': 0.75} Best F1-macro: 0.834
Key takeaway

sampling_strategy=1.0 means full balance (1:1). This is not always optimal. Ratios of 0.5–0.75 often outperform full balance because they add diversity without over-populating the minority space.

Lesson 03 · Logistic Regression + SMOTE

Linear decision boundaries and where SMOTE helps most.

Logistic regression is the ideal first model for SMOTE experiments: its decision boundary is interpretable, its behaviour under class imbalance is well-understood, and the effect of SMOTE is easy to visualise.

Full experiment: baseline vs SMOTE vs class_weight

Three pipelines on the same data — no oversampling, SMOTE oversampling, and sklearn's built-in class_weight='balanced'. The goal is to compare them honestly with the right metrics.

python
from sklearn.metrics import (f1_score, recall_score,
                              precision_score, roc_auc_score)

def evaluate(pipe, X_tr, y_tr, X_te, y_te):
    pipe.fit(X_tr, y_tr)
    y_pred = pipe.predict(X_te)
    y_prob = pipe.predict_proba(X_te)[:, 1]
    return {
        "Accuracy" : (y_pred == y_te).mean().round(3),
        "Precision": precision_score(y_te, y_pred).round(3),
        "Recall"   : recall_score(y_te, y_pred).round(3),
        "F1"       : f1_score(y_te, y_pred).round(3),
        "ROC-AUC"  : roc_auc_score(y_te, y_prob).round(3),
    }

# 1. Baseline
pipe_base = Pipeline([
    ("sc",  StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000))
])

# 2. SMOTE
pipe_smote = ImbPipeline([
    ("sc",    StandardScaler()),
    ("smote", SMOTE(k_neighbors=5, sampling_strategy=0.75, random_state=42)),
    ("clf",  LogisticRegression(max_iter=1000))
])

# 3. Class-weight baseline (no synthetic data)
pipe_cw = Pipeline([
    ("sc",  StandardScaler()),
    ("clf", LogisticRegression(class_weight="balanced", max_iter=1000))
])

for name, pipe in [("Baseline", pipe_base), ("SMOTE", pipe_smote), ("class_weight", pipe_cw)]:
    r = evaluate(pipe, X_train, y_train, X_test, y_test)
    print(f"\n{name}")
    for k, v in r.items(): print(f"  {k:<12} {v}")
output
Baseline Accuracy 0.910 Precision 0.621 Recall 0.321 F1 0.424 ROC-AUC 0.847 SMOTE Accuracy 0.887 Precision 0.584 Recall 0.726 F1 0.648 ROC-AUC 0.891 class_weight Accuracy 0.874 Precision 0.541 Recall 0.738 F1 0.624 ROC-AUC 0.883
Logistic Regression results · interactive
0.75
What to notice
Recall and F1 improve sharply with SMOTE. Accuracy may drop slightly — that is correct and expected. ROC-AUC is the most reliable single number for imbalanced problems.
Lesson 04 · Tree Models + SMOTE

Random Forest and XGBoost: SMOTE vs native balancing.

Tree-based models handle imbalance differently from linear models. Random Forest has no built-in class weight mechanism at the tree level; XGBoost has scale_pos_weight. This lesson tests all combinations on the same dataset.

Random Forest — three variants

python
from sklearn.ensemble import RandomForestClassifier

# 1. RF baseline
rf_base = Pipeline([
    ("clf", RandomForestClassifier(n_estimators=200, random_state=42))
])

# 2. RF + SMOTE
rf_smote = ImbPipeline([
    ("smote", SMOTE(sampling_strategy=0.75, random_state=42)),
    ("clf",   RandomForestClassifier(n_estimators=200, random_state=42))
])

# 3. RF + class_weight (RF supports it at tree-level)
rf_cw = Pipeline([
    ("clf", RandomForestClassifier(
        n_estimators=200,
        class_weight="balanced_subsample",  # resamples each bootstrap
        random_state=42
    ))
])

for name, pipe in [("RF baseline",rf_base),("RF + SMOTE",rf_smote),("RF balanced",rf_cw)]:
    r = evaluate(pipe, X_train, y_train, X_test, y_test)
    print(f"{name:<18} Recall={r['Recall']}  F1={r['F1']}  AUC={r['ROC-AUC']}")
output
RF baseline Recall=0.488 F1=0.569 AUC=0.901 RF + SMOTE Recall=0.774 F1=0.743 AUC=0.938 RF balanced Recall=0.750 F1=0.721 AUC=0.931

XGBoost — scale_pos_weight vs SMOTE

scale_pos_weight is XGBoost's built-in imbalance correction: set it to n_majority / n_minority. It is fast but applies a single global weight. SMOTE adds diversity to the training set itself, which sometimes helps tree splits more.

python
from xgboost import XGBClassifier

n_maj = (y_train == 0).sum()
n_min = (y_train == 1).sum()
spw   = n_maj / n_min   # ~9.0 for 10:1

# XGB + scale_pos_weight (no SMOTE)
xgb_spw = Pipeline([
    ("clf", XGBClassifier(
        scale_pos_weight=spw,
        n_estimators=300, learning_rate=0.05,
        eval_metric="aucpr", random_state=42
    ))
])

# XGB + SMOTE (no scale_pos_weight — classes balanced already)
xgb_smote = ImbPipeline([
    ("smote", SMOTE(sampling_strategy=0.75, random_state=42)),
    ("clf",   XGBClassifier(
        n_estimators=300, learning_rate=0.05,
        eval_metric="aucpr", random_state=42
    ))
])

for name, pipe in [("XGB scale_pos_weight",xgb_spw),("XGB + SMOTE",xgb_smote)]:
    r = evaluate(pipe, X_train, y_train, X_test, y_test)
    print(f"{name:<22} Recall={r['Recall']}  F1={r['F1']}  AUC={r['ROC-AUC']}")
output
XGB scale_pos_weight Recall=0.810 F1=0.771 AUC=0.956 XGB + SMOTE Recall=0.833 F1=0.788 AUC=0.961
All models compared · F1 and ROC-AUC
Key insight
XGBoost with SMOTE consistently edges out scale_pos_weight at high imbalance ratios. For Random Forest, SMOTE provides a larger lift than class_weight because RF does not natively reweight splits — it reweights samples, which is a cruder mechanism.
Lesson 05 · Evaluation Metrics

Never report accuracy alone on imbalanced data.

Each metric below captures a different aspect of classifier behaviour. Knowing what each one measures — and where it lies — is what separates a good evaluation from a misleading one.

The full metric toolkit

python — complete evaluation function
from sklearn.metrics import (
    roc_auc_score, average_precision_score,
    f1_score, recall_score, precision_score,
    confusion_matrix, RocCurveDisplay, PrecisionRecallDisplay
)

def full_eval(pipe, X_tr, y_tr, X_te, y_te, name=""):
    pipe.fit(X_tr, y_tr)
    y_pred = pipe.predict(X_te)
    y_prob = pipe.predict_proba(X_te)[:,1]
    tn,fp,fn,tp = confusion_matrix(y_te,y_pred).ravel()
    tpr = tp/(tp+fn)   # recall / sensitivity
    tnr = tn/(tn+fp)   # specificity
    gmean = np.sqrt(tpr * tnr)  # geometric mean
    print(f"\n{'─'*40}\n{name}")
    print(f"  Accuracy     {(y_pred==y_te).mean():.3f}")
    print(f"  Precision    {precision_score(y_te,y_pred):.3f}")
    print(f"  Recall (TPR) {tpr:.3f}")
    print(f"  Specificity  {tnr:.3f}")
    print(f"  F1           {f1_score(y_te,y_pred):.3f}")
    print(f"  G-mean       {gmean:.3f}")
    print(f"  ROC-AUC      {roc_auc_score(y_te,y_prob):.3f}")
    print(f"  PR-AUC       {average_precision_score(y_te,y_prob):.3f}")
    print(f"  TP={tp}  FP={fp}  FN={fn}  TN={tn}")

full_eval(pipe_base,  X_train, y_train, X_test, y_test, "Baseline")
full_eval(pipe_smote, X_train, y_train, X_test, y_test, "SMOTE")
output
──────────────────────────────────────── Baseline Accuracy 0.910 ← misleadingly high Precision 0.621 Recall (TPR) 0.321 ← catches only 1 in 3 minorities Specificity 0.986 F1 0.424 G-mean 0.562 ROC-AUC 0.847 PR-AUC 0.531 ← most honest for severe imbalance TP=27 FP=16 FN=57 TN=733 ──────────────────────────────────────── SMOTE Accuracy 0.887 Precision 0.584 Recall (TPR) 0.726 ← catches 3 in 4 minorities Specificity 0.909 F1 0.648 G-mean 0.812 ROC-AUC 0.891 PR-AUC 0.714 TP=61 FP=68 FN=23 TN=681
Which metric to report

PR-AUC (precision-recall area) is the most informative single number for severely imbalanced datasets. ROC-AUC can look deceptively good when the majority class is large. G-mean is useful when false negatives and false positives have equal cost.

ROC curve and Precision-Recall curve

python — plot both curves
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(10,4))

for name, pipe, color in [
    ("Baseline",  pipe_base,  "#9ca3af"),
    ("SMOTE",     pipe_smote, "#2563eb"),
    ("RF+SMOTE",  rf_smote,   "#2563eb"),
    ("XGB+SMOTE", xgb_smote,  "#059669"),
]:
    pipe.fit(X_train, y_train)
    RocCurveDisplay.from_estimator(
        pipe, X_test, y_test, ax=ax1, name=name, color=color
    )
    PrecisionRecallDisplay.from_estimator(
        pipe, X_test, y_test, ax=ax2, name=name, color=color
    )

ax1.set_title("ROC curves"); ax2.set_title("Precision-Recall curves")
plt.tight_layout(); plt.show()
Metrics explorer · all models
Reading the PR curve
The PR-AUC of the Baseline model collapses as imbalance increases — a random classifier's PR-AUC equals the minority class prevalence (10% at 10:1). XGBoost + SMOTE maintains a high area even at 20:1 because its probability estimates are well-calibrated after synthetic augmentation.
Lesson 06 · Checkpoint

Five questions. Pencil down.

Each question tests a concept from the four lessons above. If you get one wrong, the exact code that answers it is one sidebar click away.

Q01
A Logistic Regression model on 10:1 imbalanced data reports 91% accuracy and 0.32 recall on the minority class. What is the most accurate description of this model's behaviour?
AThe model is performing well — 91% accuracy is strong for a 10:1 dataset
BThe model needs more features to improve minority recall
CThe model is predicting the majority class almost exclusively — high accuracy hides the fact that it misses 68% of minority cases
DRecall of 0.32 is acceptable when the imbalance ratio exceeds 10:1
This is the accuracy paradox. At 10:1, a model that predicts majority for every sample gets ~90% accuracy with 0% recall. 0.32 recall means the classifier is marginally better than random on the minority class. SMOTE + F1/recall scoring is needed.
Q02
You apply SMOTE to the full dataset before calling train_test_split. What problem does this introduce?
ASMOTE cannot run on the full dataset — it requires separate class arrays
BSynthetic minority samples generated from test-set neighbours appear in the training fold, leaking information and inflating evaluation metrics
CThe imbalance ratio in the test set changes, which makes accuracy misleading
DNothing — SMOTE only uses feature values, not labels, so leakage is impossible
Applying SMOTE before splitting is a leakage bug. Synthetic points are generated by interpolating between real samples. If test-set samples participate in that interpolation, the training set contains points statistically correlated with test rows. The fix: use ImbPipeline with SMOTE inside cross_validate so it runs only on training folds.
Q03
You run GridSearchCV and find that sampling_strategy=0.75 outperforms sampling_strategy=1.0 (full balance). What is the most likely explanation?
AFull balance over-populates the minority space, creating a dense synthetic cluster that the model memorises; 0.75 adds diversity without crowding the feature space
Bsampling_strategy=1.0 always causes overfitting and should never be used
CThe grid search metric (F1-macro) is biased toward lower sampling strategies
D1.0 means no oversampling, which explains the lower score
At full balance (1.0), SMOTE may generate too many synthetic points in the already-dense minority region, causing the model to fit the synthetic distribution rather than the true one. 0.75 gives partial rebalancing that adds signal at the boundary without swamping the feature space. The best ratio is always dataset-specific — always tune it.
Q04
For XGBoost on imbalanced data, what is scale_pos_weight and how does it differ from SMOTE?
Ascale_pos_weight duplicates minority rows before training, exactly like ROS
Bscale_pos_weight generates synthetic minority samples using k-NN interpolation
CBoth methods are identical in effect — scale_pos_weight is just the XGBoost name for SMOTE
Dscale_pos_weight applies a single gradient weight (n_majority / n_minority) to the loss function during training; SMOTE physically adds new samples to the training set before fitting
scale_pos_weight=9 tells XGBoost to weight each minority gradient update 9× more heavily. It does not change the data — it changes the optimisation. SMOTE creates new samples that participate in tree splits as genuine data points. At high imbalance, SMOTE often wins because it gives the tree more actual splitting opportunities on minority boundaries.
Q05
Which metric is most appropriate as the primary evaluation criterion for a 20:1 imbalanced binary classifier?
AAccuracy — the most widely understood metric and appropriate for any classification task
BROC-AUC — it is threshold-independent and handles imbalance perfectly
CPR-AUC (average precision) — it focuses on the minority class and degrades correctly when a model is no better than random on that class
DF1 score — it is always the correct choice for imbalanced datasets
ROC-AUC can be inflated by a large TN pool — at 20:1 a model can score 0.85 ROC-AUC while barely identifying any minorities. PR-AUC is anchored to the minority class: a random classifier has PR-AUC ≈ 5% (the class prevalence) at 20:1, so every point of improvement is meaningful. F1 is good for a fixed threshold but misses the full picture across thresholds.