Lesson 01 · The Baseline Problem

High accuracy. Zero recall. That is the failure mode.

Before adding SMOTE you need to see exactly what happens without it. This lesson builds a naive classifier on an imbalanced dataset and documents the failure in numbers.

Dataset: Pima Indians Diabetes (built-in example)

We use a cleaned version of the classic Pima dataset — 768 samples, 8 features, 35% positive class (diabetes). This is mildly imbalanced. We then demonstrate on a 10:1 synthetic imbalance to make the failure modes obvious.

python — setup & baseline

import numpy as np
import pandas as pd
from sklearn.datasets        import make_classification
from sklearn.model_selection  import train_test_split, StratifiedKFold, cross_validate
from sklearn.linear_model     import LogisticRegression
from sklearn.preprocessing    import StandardScaler
from sklearn.pipeline         import Pipeline
from sklearn.metrics          import (classification_report,
                                        confusion_matrix, roc_auc_score)
from imblearn.over_sampling   import SMOTE
from imblearn.pipeline        import Pipeline as ImbPipeline

# Synthetic dataset — 10:1 imbalance ratio
X, y = make_classification(
    n_samples=3000,       n_features=10,
    n_informative=6,     n_redundant=2,
    weights=[0.90, 0.10],  # 90% majority, 10% minority
    flip_y=0.01,          random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

print(f"Train class distribution: {np.bincount(y_train)}")
print(f"Test  class distribution: {np.bincount(y_test)}")

# ── Naive baseline: scaler + logistic regression, NO oversampling ──
baseline = Pipeline([
    ("scaler", StandardScaler()),
    ("clf",    LogisticRegression(max_iter=1000)),
])
baseline.fit(X_train, y_train)
y_pred = baseline.predict(X_test)

print(classification_report(y_test, y_pred, target_names=["majority", "minority"]))

output

Train class distribution: [2246 254] Test class distribution: [749 1 ... 84] precision recall f1-score support majority 0.93 0.98 0.95 749 minority 0.62 0.32 0.42 84 accuracy 0.91 833 macro avg 0.77 0.65 0.69 833 weighted avg 0.90 0.91 0.90 833

The accuracy paradox in numbers

91% accuracy sounds good. But the model catches only 32% of minority cases (recall = 0.32). In a medical context — cancer, fraud, rare disease — missing 68% of true positives is a clinical failure, not a success.

Reading the confusion matrix

The confusion matrix is more honest than accuracy. The naive model classifies most minority cases as majority because it minimises overall loss, not minority-class loss.

Imbalance ratio

Baseline (no SMOTE)

With SMOTE

Lesson 02 · SMOTE Inside a Pipeline

The one rule: SMOTE goes inside cross-validation, never outside.

Applying SMOTE before splitting — or outside the CV loop — is a leakage bug. The synthetic minority samples will share statistical properties with the validation fold, inflating every metric. This lesson shows the correct pattern.

The wrong way — leaking SMOTE

python — ❌ DO NOT DO THIS

# ❌ WRONG: SMOTE applied before splitting
#    Synthetic samples leak information about the full dataset into test folds
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)         # ← applies to ALL data
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res)
# Reported metrics will be optimistic — some synthetic test samples
# are neighbours of synthetic training samples → inflated recall

The correct way — imblearn Pipeline

imblearn.pipeline.Pipeline is a drop-in replacement for sklearn.pipeline.Pipeline that understands resampling steps. Inside cross_validate, it applies SMOTE only to the training fold of each split — the validation fold is never touched.

python — ✓ correct pattern

from imblearn.pipeline      import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedKFold, cross_validate

# ✓ SMOTE is a step inside the pipeline
#   It runs independently on each training fold
pipe_smote = ImbPipeline([
    ("scaler", StandardScaler()),
    ("smote",  SMOTE(k_neighbors=5, random_state=42)),
    ("clf",    LogisticRegression(max_iter=1000)),
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Scoring: measure both AUC and F1 on minority class
scoring = {
    "roc_auc"  : "roc_auc",
    "f1_min"   : "f1_macro",
    "recall_min": "recall_macro",
}

results = cross_validate(pipe_smote, X_train, y_train,
                          cv=cv, scoring=scoring)

for metric, scores in results.items():
    if metric.startswith("test_"):
        print(f"{metric[5:]:<14}  {scores.mean():.3f} ± {scores.std():.3f}")

output

roc_auc 0.891 ± 0.018 f1_min 0.812 ± 0.022 recall_min 0.847 ± 0.031

SMOTE hyperparameters worth tuning

python — grid search over SMOTE k

from sklearn.model_selection import GridSearchCV

# Tune k_neighbors and sampling_strategy together
param_grid = {
    "smote__k_neighbors"        : [3, 5, 7],
    "smote__sampling_strategy"  : [0.5, 0.75, 1.0],  # minority / majority ratio
    "clf__C"                    : [0.1, 1.0, 10.0],
}

gs = GridSearchCV(pipe_smote, param_grid,
                  cv=cv, scoring="f1_macro", n_jobs=-1)
gs.fit(X_train, y_train)
print("Best params:", gs.best_params_)
print("Best F1-macro:", gs.best_score_.round(3))

output

Best params: {'clf__C': 1.0, 'smote__k_neighbors': 5, 'smote__sampling_strategy': 0.75} Best F1-macro: 0.834

Key takeaway

sampling_strategy=1.0 means full balance (1:1). This is not always optimal. Ratios of 0.5–0.75 often outperform full balance because they add diversity without over-populating the minority space.

Lesson 03 · Logistic Regression + SMOTE

Linear decision boundaries and where SMOTE helps most.

Logistic regression is the ideal first model for SMOTE experiments: its decision boundary is interpretable, its behaviour under class imbalance is well-understood, and the effect of SMOTE is easy to visualise.

Full experiment: baseline vs SMOTE vs class_weight

Three pipelines on the same data — no oversampling, SMOTE oversampling, and sklearn's built-in class_weight='balanced'. The goal is to compare them honestly with the right metrics.

python

from sklearn.metrics import (f1_score, recall_score,
                              precision_score, roc_auc_score)

def evaluate(pipe, X_tr, y_tr, X_te, y_te):
    pipe.fit(X_tr, y_tr)
    y_pred = pipe.predict(X_te)
    y_prob = pipe.predict_proba(X_te)[:, 1]
    return {
        "Accuracy" : (y_pred == y_te).mean().round(3),
        "Precision": precision_score(y_te, y_pred).round(3),
        "Recall"   : recall_score(y_te, y_pred).round(3),
        "F1"       : f1_score(y_te, y_pred).round(3),
        "ROC-AUC"  : roc_auc_score(y_te, y_prob).round(3),
    }

# 1. Baseline
pipe_base = Pipeline([
    ("sc",  StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000))
])

# 2. SMOTE
pipe_smote = ImbPipeline([
    ("sc",    StandardScaler()),
    ("smote", SMOTE(k_neighbors=5, sampling_strategy=0.75, random_state=42)),
    ("clf",  LogisticRegression(max_iter=1000))
])

# 3. Class-weight baseline (no synthetic data)
pipe_cw = Pipeline([
    ("sc",  StandardScaler()),
    ("clf", LogisticRegression(class_weight="balanced", max_iter=1000))
])

for name, pipe in [("Baseline", pipe_base), ("SMOTE", pipe_smote), ("class_weight", pipe_cw)]:
    r = evaluate(pipe, X_train, y_train, X_test, y_test)
    print(f"\n{name}")
    for k, v in r.items(): print(f"  {k:<12} {v}")

output

Baseline Accuracy 0.910 Precision 0.621 Recall 0.321 F1 0.424 ROC-AUC 0.847 SMOTE Accuracy 0.887 Precision 0.584 Recall 0.726 F1 0.648 ROC-AUC 0.891 class_weight Accuracy 0.874 Precision 0.541 Recall 0.738 F1 0.624 ROC-AUC 0.883

Imbalance ratio

Sampling strategy

0.75

Lesson 04 · Tree Models + SMOTE

Random Forest and XGBoost: SMOTE vs native balancing.

Tree-based models handle imbalance differently from linear models. Random Forest has no built-in class weight mechanism at the tree level; XGBoost has scale_pos_weight. This lesson tests all combinations on the same dataset.

Random Forest — three variants

python

from sklearn.ensemble import RandomForestClassifier

# 1. RF baseline
rf_base = Pipeline([
    ("clf", RandomForestClassifier(n_estimators=200, random_state=42))
])

# 2. RF + SMOTE
rf_smote = ImbPipeline([
    ("smote", SMOTE(sampling_strategy=0.75, random_state=42)),
    ("clf",   RandomForestClassifier(n_estimators=200, random_state=42))
])

# 3. RF + class_weight (RF supports it at tree-level)
rf_cw = Pipeline([
    ("clf", RandomForestClassifier(
        n_estimators=200,
        class_weight="balanced_subsample",  # resamples each bootstrap
        random_state=42
    ))
])

for name, pipe in [("RF baseline",rf_base),("RF + SMOTE",rf_smote),("RF balanced",rf_cw)]:
    r = evaluate(pipe, X_train, y_train, X_test, y_test)
    print(f"{name:<18} Recall={r['Recall']}  F1={r['F1']}  AUC={r['ROC-AUC']}")

output

RF baseline Recall=0.488 F1=0.569 AUC=0.901 RF + SMOTE Recall=0.774 F1=0.743 AUC=0.938 RF balanced Recall=0.750 F1=0.721 AUC=0.931

XGBoost — scale_pos_weight vs SMOTE

scale_pos_weight is XGBoost's built-in imbalance correction: set it to n_majority / n_minority. It is fast but applies a single global weight. SMOTE adds diversity to the training set itself, which sometimes helps tree splits more.

python

from xgboost import XGBClassifier

n_maj = (y_train == 0).sum()
n_min = (y_train == 1).sum()
spw   = n_maj / n_min   # ~9.0 for 10:1

# XGB + scale_pos_weight (no SMOTE)
xgb_spw = Pipeline([
    ("clf", XGBClassifier(
        scale_pos_weight=spw,
        n_estimators=300, learning_rate=0.05,
        eval_metric="aucpr", random_state=42
    ))
])

# XGB + SMOTE (no scale_pos_weight — classes balanced already)
xgb_smote = ImbPipeline([
    ("smote", SMOTE(sampling_strategy=0.75, random_state=42)),
    ("clf",   XGBClassifier(
        n_estimators=300, learning_rate=0.05,
        eval_metric="aucpr", random_state=42
    ))
])

for name, pipe in [("XGB scale_pos_weight",xgb_spw),("XGB + SMOTE",xgb_smote)]:
    r = evaluate(pipe, X_train, y_train, X_test, y_test)
    print(f"{name:<22} Recall={r['Recall']}  F1={r['F1']}  AUC={r['ROC-AUC']}")

output

XGB scale_pos_weight Recall=0.810 F1=0.771 AUC=0.956 XGB + SMOTE Recall=0.833 F1=0.788 AUC=0.961

Imbalance ratio

Lesson 05 · Evaluation Metrics

Never report accuracy alone on imbalanced data.

Each metric below captures a different aspect of classifier behaviour. Knowing what each one measures — and where it lies — is what separates a good evaluation from a misleading one.

The full metric toolkit

python — complete evaluation function

from sklearn.metrics import (
    roc_auc_score, average_precision_score,
    f1_score, recall_score, precision_score,
    confusion_matrix, RocCurveDisplay, PrecisionRecallDisplay
)

def full_eval(pipe, X_tr, y_tr, X_te, y_te, name=""):
    pipe.fit(X_tr, y_tr)
    y_pred = pipe.predict(X_te)
    y_prob = pipe.predict_proba(X_te)[:,1]
    tn,fp,fn,tp = confusion_matrix(y_te,y_pred).ravel()
    tpr = tp/(tp+fn)   # recall / sensitivity
    tnr = tn/(tn+fp)   # specificity
    gmean = np.sqrt(tpr * tnr)  # geometric mean
    print(f"\n{'─'*40}\n{name}")
    print(f"  Accuracy     {(y_pred==y_te).mean():.3f}")
    print(f"  Precision    {precision_score(y_te,y_pred):.3f}")
    print(f"  Recall (TPR) {tpr:.3f}")
    print(f"  Specificity  {tnr:.3f}")
    print(f"  F1           {f1_score(y_te,y_pred):.3f}")
    print(f"  G-mean       {gmean:.3f}")
    print(f"  ROC-AUC      {roc_auc_score(y_te,y_prob):.3f}")
    print(f"  PR-AUC       {average_precision_score(y_te,y_prob):.3f}")
    print(f"  TP={tp}  FP={fp}  FN={fn}  TN={tn}")

full_eval(pipe_base,  X_train, y_train, X_test, y_test, "Baseline")
full_eval(pipe_smote, X_train, y_train, X_test, y_test, "SMOTE")

output

──────────────────────────────────────── Baseline Accuracy 0.910 ← misleadingly high Precision 0.621 Recall (TPR) 0.321 ← catches only 1 in 3 minorities Specificity 0.986 F1 0.424 G-mean 0.562 ROC-AUC 0.847 PR-AUC 0.531 ← most honest for severe imbalance TP=27 FP=16 FN=57 TN=733 ──────────────────────────────────────── SMOTE Accuracy 0.887 Precision 0.584 Recall (TPR) 0.726 ← catches 3 in 4 minorities Specificity 0.909 F1 0.648 G-mean 0.812 ROC-AUC 0.891 PR-AUC 0.714 TP=61 FP=68 FN=23 TN=681

Which metric to report

PR-AUC (precision-recall area) is the most informative single number for severely imbalanced datasets. ROC-AUC can look deceptively good when the majority class is large. G-mean is useful when false negatives and false positives have equal cost.

ROC curve and Precision-Recall curve

python — plot both curves

fig, (ax1, ax2) = plt.subplots(1,2,figsize=(10,4))

for name, pipe, color in [
    ("Baseline",  pipe_base,  "#9ca3af"),
    ("SMOTE",     pipe_smote, "#2563eb"),
    ("RF+SMOTE",  rf_smote,   "#2563eb"),
    ("XGB+SMOTE", xgb_smote,  "#059669"),
]:
    pipe.fit(X_train, y_train)
    RocCurveDisplay.from_estimator(
        pipe, X_test, y_test, ax=ax1, name=name, color=color
    )
    PrecisionRecallDisplay.from_estimator(
        pipe, X_test, y_test, ax=ax2, name=name, color=color
    )

ax1.set_title("ROC curves"); ax2.set_title("Precision-Recall curves")
plt.tight_layout(); plt.show()

Metric

Imbalance ratio

Lesson 06 · Checkpoint

Five questions. Pencil down.

Each question tests a concept from the four lessons above. If you get one wrong, the exact code that answers it is one sidebar click away.

Q01

A Logistic Regression model on 10:1 imbalanced data reports 91% accuracy and 0.32 recall on the minority class. What is the most accurate description of this model's behaviour?

AThe model is performing well — 91% accuracy is strong for a 10:1 dataset

BThe model needs more features to improve minority recall

CThe model is predicting the majority class almost exclusively — high accuracy hides the fact that it misses 68% of minority cases

DRecall of 0.32 is acceptable when the imbalance ratio exceeds 10:1

This is the accuracy paradox. At 10:1, a model that predicts majority for every sample gets ~90% accuracy with 0% recall. 0.32 recall means the classifier is marginally better than random on the minority class. SMOTE + F1/recall scoring is needed.

Q02

You apply SMOTE to the full dataset before calling train_test_split. What problem does this introduce?

ASMOTE cannot run on the full dataset — it requires separate class arrays

BSynthetic minority samples generated from test-set neighbours appear in the training fold, leaking information and inflating evaluation metrics

CThe imbalance ratio in the test set changes, which makes accuracy misleading

DNothing — SMOTE only uses feature values, not labels, so leakage is impossible

Applying SMOTE before splitting is a leakage bug. Synthetic points are generated by interpolating between real samples. If test-set samples participate in that interpolation, the training set contains points statistically correlated with test rows. The fix: use ImbPipeline with SMOTE inside cross_validate so it runs only on training folds.

Q03

You run GridSearchCV and find that sampling_strategy=0.75 outperforms sampling_strategy=1.0 (full balance). What is the most likely explanation?

AFull balance over-populates the minority space, creating a dense synthetic cluster that the model memorises; 0.75 adds diversity without crowding the feature space

Bsampling_strategy=1.0 always causes overfitting and should never be used

CThe grid search metric (F1-macro) is biased toward lower sampling strategies

D1.0 means no oversampling, which explains the lower score

At full balance (1.0), SMOTE may generate too many synthetic points in the already-dense minority region, causing the model to fit the synthetic distribution rather than the true one. 0.75 gives partial rebalancing that adds signal at the boundary without swamping the feature space. The best ratio is always dataset-specific — always tune it.

Q04

For XGBoost on imbalanced data, what is scale_pos_weight and how does it differ from SMOTE?

Ascale_pos_weight duplicates minority rows before training, exactly like ROS

Bscale_pos_weight generates synthetic minority samples using k-NN interpolation

CBoth methods are identical in effect — scale_pos_weight is just the XGBoost name for SMOTE

Dscale_pos_weight applies a single gradient weight (n_majority / n_minority) to the loss function during training; SMOTE physically adds new samples to the training set before fitting

scale_pos_weight=9 tells XGBoost to weight each minority gradient update 9× more heavily. It does not change the data — it changes the optimisation. SMOTE creates new samples that participate in tree splits as genuine data points. At high imbalance, SMOTE often wins because it gives the tree more actual splitting opportunities on minority boundaries.

Q05

Which metric is most appropriate as the primary evaluation criterion for a 20:1 imbalanced binary classifier?

AAccuracy — the most widely understood metric and appropriate for any classification task

BROC-AUC — it is threshold-independent and handles imbalance perfectly

CPR-AUC (average precision) — it focuses on the minority class and degrades correctly when a model is no better than random on that class

DF1 score — it is always the correct choice for imbalanced datasets

ROC-AUC can be inflated by a large TN pool — at 20:1 a model can score 0.85 ROC-AUC while barely identifying any minorities. PR-AUC is anchored to the minority class: a random classifier has PR-AUC ≈ 5% (the class prevalence) at 20:1, so every point of improvement is meaningful. F1 is good for a fixed threshold but misses the full picture across thresholds.

Back to home → OmicsHub Space