High accuracy. Zero recall. That is the failure mode.
Before adding SMOTE you need to see exactly what happens without it. This lesson builds a naive classifier on an imbalanced dataset and documents the failure in numbers.
Dataset: Pima Indians Diabetes (built-in example)
We use a cleaned version of the classic Pima dataset — 768 samples, 8 features, 35% positive class (diabetes). This is mildly imbalanced. We then demonstrate on a 10:1 synthetic imbalance to make the failure modes obvious.
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (classification_report,
confusion_matrix, roc_auc_score)
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
# Synthetic dataset — 10:1 imbalance ratio
X, y = make_classification(
n_samples=3000, n_features=10,
n_informative=6, n_redundant=2,
weights=[0.90, 0.10], # 90% majority, 10% minority
flip_y=0.01, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, stratify=y, random_state=42
)
print(f"Train class distribution: {np.bincount(y_train)}")
print(f"Test class distribution: {np.bincount(y_test)}")
# ── Naive baseline: scaler + logistic regression, NO oversampling ──
baseline = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(max_iter=1000)),
])
baseline.fit(X_train, y_train)
y_pred = baseline.predict(X_test)
print(classification_report(y_test, y_pred, target_names=["majority", "minority"]))
91% accuracy sounds good. But the model catches only 32% of minority cases (recall = 0.32). In a medical context — cancer, fraud, rare disease — missing 68% of true positives is a clinical failure, not a success.
Reading the confusion matrix
The confusion matrix is more honest than accuracy. The naive model classifies most minority cases as majority because it minimises overall loss, not minority-class loss.
The one rule: SMOTE goes inside cross-validation, never outside.
Applying SMOTE before splitting — or outside the CV loop — is a leakage bug. The synthetic minority samples will share statistical properties with the validation fold, inflating every metric. This lesson shows the correct pattern.
The wrong way — leaking SMOTE
# ❌ WRONG: SMOTE applied before splitting
# Synthetic samples leak information about the full dataset into test folds
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y) # ← applies to ALL data
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res)
# Reported metrics will be optimistic — some synthetic test samples
# are neighbours of synthetic training samples → inflated recall
The correct way — imblearn Pipeline
imblearn.pipeline.Pipeline is a drop-in replacement for sklearn.pipeline.Pipeline that understands resampling steps. Inside cross_validate, it applies SMOTE only to the training fold of each split — the validation fold is never touched.
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedKFold, cross_validate
# ✓ SMOTE is a step inside the pipeline
# It runs independently on each training fold
pipe_smote = ImbPipeline([
("scaler", StandardScaler()),
("smote", SMOTE(k_neighbors=5, random_state=42)),
("clf", LogisticRegression(max_iter=1000)),
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Scoring: measure both AUC and F1 on minority class
scoring = {
"roc_auc" : "roc_auc",
"f1_min" : "f1_macro",
"recall_min": "recall_macro",
}
results = cross_validate(pipe_smote, X_train, y_train,
cv=cv, scoring=scoring)
for metric, scores in results.items():
if metric.startswith("test_"):
print(f"{metric[5:]:<14} {scores.mean():.3f} ± {scores.std():.3f}")
SMOTE hyperparameters worth tuning
from sklearn.model_selection import GridSearchCV
# Tune k_neighbors and sampling_strategy together
param_grid = {
"smote__k_neighbors" : [3, 5, 7],
"smote__sampling_strategy" : [0.5, 0.75, 1.0], # minority / majority ratio
"clf__C" : [0.1, 1.0, 10.0],
}
gs = GridSearchCV(pipe_smote, param_grid,
cv=cv, scoring="f1_macro", n_jobs=-1)
gs.fit(X_train, y_train)
print("Best params:", gs.best_params_)
print("Best F1-macro:", gs.best_score_.round(3))
sampling_strategy=1.0 means full balance (1:1). This is not always optimal. Ratios of 0.5–0.75 often outperform full balance because they add diversity without over-populating the minority space.
Linear decision boundaries and where SMOTE helps most.
Logistic regression is the ideal first model for SMOTE experiments: its decision boundary is interpretable, its behaviour under class imbalance is well-understood, and the effect of SMOTE is easy to visualise.
Full experiment: baseline vs SMOTE vs class_weight
Three pipelines on the same data — no oversampling, SMOTE oversampling, and sklearn's built-in class_weight='balanced'. The goal is to compare them honestly with the right metrics.
from sklearn.metrics import (f1_score, recall_score,
precision_score, roc_auc_score)
def evaluate(pipe, X_tr, y_tr, X_te, y_te):
pipe.fit(X_tr, y_tr)
y_pred = pipe.predict(X_te)
y_prob = pipe.predict_proba(X_te)[:, 1]
return {
"Accuracy" : (y_pred == y_te).mean().round(3),
"Precision": precision_score(y_te, y_pred).round(3),
"Recall" : recall_score(y_te, y_pred).round(3),
"F1" : f1_score(y_te, y_pred).round(3),
"ROC-AUC" : roc_auc_score(y_te, y_prob).round(3),
}
# 1. Baseline
pipe_base = Pipeline([
("sc", StandardScaler()),
("clf", LogisticRegression(max_iter=1000))
])
# 2. SMOTE
pipe_smote = ImbPipeline([
("sc", StandardScaler()),
("smote", SMOTE(k_neighbors=5, sampling_strategy=0.75, random_state=42)),
("clf", LogisticRegression(max_iter=1000))
])
# 3. Class-weight baseline (no synthetic data)
pipe_cw = Pipeline([
("sc", StandardScaler()),
("clf", LogisticRegression(class_weight="balanced", max_iter=1000))
])
for name, pipe in [("Baseline", pipe_base), ("SMOTE", pipe_smote), ("class_weight", pipe_cw)]:
r = evaluate(pipe, X_train, y_train, X_test, y_test)
print(f"\n{name}")
for k, v in r.items(): print(f" {k:<12} {v}")
Random Forest and XGBoost: SMOTE vs native balancing.
Tree-based models handle imbalance differently from linear models. Random Forest has no built-in class weight mechanism at the tree level; XGBoost has scale_pos_weight. This lesson tests all combinations on the same dataset.
Random Forest — three variants
from sklearn.ensemble import RandomForestClassifier
# 1. RF baseline
rf_base = Pipeline([
("clf", RandomForestClassifier(n_estimators=200, random_state=42))
])
# 2. RF + SMOTE
rf_smote = ImbPipeline([
("smote", SMOTE(sampling_strategy=0.75, random_state=42)),
("clf", RandomForestClassifier(n_estimators=200, random_state=42))
])
# 3. RF + class_weight (RF supports it at tree-level)
rf_cw = Pipeline([
("clf", RandomForestClassifier(
n_estimators=200,
class_weight="balanced_subsample", # resamples each bootstrap
random_state=42
))
])
for name, pipe in [("RF baseline",rf_base),("RF + SMOTE",rf_smote),("RF balanced",rf_cw)]:
r = evaluate(pipe, X_train, y_train, X_test, y_test)
print(f"{name:<18} Recall={r['Recall']} F1={r['F1']} AUC={r['ROC-AUC']}")
XGBoost — scale_pos_weight vs SMOTE
scale_pos_weight is XGBoost's built-in imbalance correction: set it to n_majority / n_minority. It is fast but applies a single global weight. SMOTE adds diversity to the training set itself, which sometimes helps tree splits more.
from xgboost import XGBClassifier
n_maj = (y_train == 0).sum()
n_min = (y_train == 1).sum()
spw = n_maj / n_min # ~9.0 for 10:1
# XGB + scale_pos_weight (no SMOTE)
xgb_spw = Pipeline([
("clf", XGBClassifier(
scale_pos_weight=spw,
n_estimators=300, learning_rate=0.05,
eval_metric="aucpr", random_state=42
))
])
# XGB + SMOTE (no scale_pos_weight — classes balanced already)
xgb_smote = ImbPipeline([
("smote", SMOTE(sampling_strategy=0.75, random_state=42)),
("clf", XGBClassifier(
n_estimators=300, learning_rate=0.05,
eval_metric="aucpr", random_state=42
))
])
for name, pipe in [("XGB scale_pos_weight",xgb_spw),("XGB + SMOTE",xgb_smote)]:
r = evaluate(pipe, X_train, y_train, X_test, y_test)
print(f"{name:<22} Recall={r['Recall']} F1={r['F1']} AUC={r['ROC-AUC']}")
Never report accuracy alone on imbalanced data.
Each metric below captures a different aspect of classifier behaviour. Knowing what each one measures — and where it lies — is what separates a good evaluation from a misleading one.
The full metric toolkit
from sklearn.metrics import (
roc_auc_score, average_precision_score,
f1_score, recall_score, precision_score,
confusion_matrix, RocCurveDisplay, PrecisionRecallDisplay
)
def full_eval(pipe, X_tr, y_tr, X_te, y_te, name=""):
pipe.fit(X_tr, y_tr)
y_pred = pipe.predict(X_te)
y_prob = pipe.predict_proba(X_te)[:,1]
tn,fp,fn,tp = confusion_matrix(y_te,y_pred).ravel()
tpr = tp/(tp+fn) # recall / sensitivity
tnr = tn/(tn+fp) # specificity
gmean = np.sqrt(tpr * tnr) # geometric mean
print(f"\n{'─'*40}\n{name}")
print(f" Accuracy {(y_pred==y_te).mean():.3f}")
print(f" Precision {precision_score(y_te,y_pred):.3f}")
print(f" Recall (TPR) {tpr:.3f}")
print(f" Specificity {tnr:.3f}")
print(f" F1 {f1_score(y_te,y_pred):.3f}")
print(f" G-mean {gmean:.3f}")
print(f" ROC-AUC {roc_auc_score(y_te,y_prob):.3f}")
print(f" PR-AUC {average_precision_score(y_te,y_prob):.3f}")
print(f" TP={tp} FP={fp} FN={fn} TN={tn}")
full_eval(pipe_base, X_train, y_train, X_test, y_test, "Baseline")
full_eval(pipe_smote, X_train, y_train, X_test, y_test, "SMOTE")
PR-AUC (precision-recall area) is the most informative single number for severely imbalanced datasets. ROC-AUC can look deceptively good when the majority class is large. G-mean is useful when false negatives and false positives have equal cost.
ROC curve and Precision-Recall curve
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(10,4))
for name, pipe, color in [
("Baseline", pipe_base, "#9ca3af"),
("SMOTE", pipe_smote, "#2563eb"),
("RF+SMOTE", rf_smote, "#2563eb"),
("XGB+SMOTE", xgb_smote, "#059669"),
]:
pipe.fit(X_train, y_train)
RocCurveDisplay.from_estimator(
pipe, X_test, y_test, ax=ax1, name=name, color=color
)
PrecisionRecallDisplay.from_estimator(
pipe, X_test, y_test, ax=ax2, name=name, color=color
)
ax1.set_title("ROC curves"); ax2.set_title("Precision-Recall curves")
plt.tight_layout(); plt.show()
Five questions. Pencil down.
Each question tests a concept from the four lessons above. If you get one wrong, the exact code that answers it is one sidebar click away.