Lesson 01 · The Problem

Churn data is messy in two ways that models hate.

Most churn prediction research addresses class imbalance. This paper addresses a harder, less-studied problem: class overlap — when churners and stayers are statistically indistinguishable in feature space.

Research summary

We introduce CTGAN-ENN, a hybrid data-level framework that combines conditional tabular GAN synthesis with Edited Nearest Neighbour cleaning to simultaneously address imbalance and overlap in customer churn datasets. Tested across five benchmark datasets, CTGAN-ENN outperforms standard oversampling methods, hybrid baselines, and algorithm-level cost-sensitive approaches — while reducing training time relative to CTGAN alone.

Problem 1 — Class Imbalance

Churn datasets are structurally imbalanced. In a typical telco dataset, 85–95% of customers are stayers and only 5–15% are churners. A naive classifier minimises global loss by predicting the majority class, achieving high accuracy while completely missing the business-critical minority.

Problem 2 — Class Overlap

Overlap occurs when minority (churn) samples and majority (stay) samples occupy the same region of feature space. A customer who churns may have identical tenure, spend, and usage patterns to one who stays — because the decision to churn was driven by factors not captured in the data (a competitor offer, a life event).

Oversampling methods like SMOTE amplify this problem: they generate new minority samples by interpolating between existing ones, including those that sit deep inside majority territory. The synthetic samples inherit the noise.

Why overlap makes SMOTE fail

If a minority sample is surrounded by majority neighbours (high ADASYN r_i), any synthetic child generated near it will also land in majority territory. The model learns a noisy boundary that generalises poorly. The solution requires cleaning the boundary first, then sampling.

Overlap degree

35%

Imbalance ratio

Original data

After SMOTE (overlap amplified)

After CTGAN-ENN (overlap reduced)

Overlap score before

—

Overlap after SMOTE

—

Overlap after CTGAN-ENN

—

Overlap reduction

—

Lesson 02 · CTGAN

Why a GAN handles tabular data better than SMOTE.

Standard SMOTE performs linear interpolation in raw feature space. CTGAN learns the full joint distribution of a tabular dataset — including mixed types, multimodal columns, and cross-column correlations — and samples from that learned distribution.

CTGAN architecture

CTGAN (Xu et al., 2019) is a conditional generative adversarial network specifically designed for tabular data. It addresses two structural problems that break vanilla GANs on tables:

Mode collapse on numeric columns. Numeric columns in real-world data are often multimodal. CTGAN applies mode-specific normalisation — it fits a Bayesian Gaussian Mixture to each column and normalises each sample relative to the mode it belongs to.
Training imbalance on categorical columns. Rare categories are underrepresented. CTGAN uses a conditional vector that forces the generator to produce samples for every discrete value, including rare ones.

CTGAN architecture

Install and run CTGAN

bash — install

# Install the SDV library which bundles CTGAN
pip install sdv ctgan

python — CTGAN minority oversampling

import pandas as pd
import numpy as np
from ctgan import CTGAN

# Load churn dataset (telco example)
df = pd.read_csv("churn_train.csv")
print(df["churn"].value_counts())
# 0    3286   (stayed)
# 1     483   (churned)  ← imbalance ratio ~6.8:1

# ── Step 1: isolate the minority class ──────────────────────
df_minority = df[df["churn"] == 1].copy()

# Identify discrete columns (CTGAN needs to know these)
discrete_cols = ["international_plan", "voicemail_plan", "churn"]

# ── Step 2: train CTGAN on minority only ─────────────────────
ctgan = CTGAN(
    epochs          = 300,
    batch_size      = 64,
    generator_dim   = (256, 256),
    discriminator_dim=(256, 256),
    verbose         = False,
)
ctgan.fit(df_minority, discrete_columns=discrete_cols)

# ── Step 3: generate synthetic minority samples ───────────────
n_majority = (df["churn"] == 0).sum()
n_minority = (df["churn"] == 1).sum()
n_generate = n_majority - n_minority   # fill gap to 1:1

synthetic_minority = ctgan.sample(n_generate)
print(f"Generated {len(synthetic_minority)} synthetic churn samples")

# ── Step 4: merge into balanced training set ─────────────────
df_ctgan = pd.concat([df, synthetic_minority], ignore_index=True)
print(df_ctgan["churn"].value_counts())

output

Generated 2803 synthetic churn samples 0 3286 (stayed) 1 3286 (churned — original + synthetic) dtype: int64

CTGAN vs SMOTE on tabular data

SMOTE linearly interpolates between two points. CTGAN samples from the learned joint distribution of all features simultaneously. For a column like monthly_charges with a bimodal distribution (two subscription tiers), SMOTE may generate values between the two modes that no real customer would ever have. CTGAN respects the modes.

Lesson 03 · ENN

Remove the noise at the boundary, not the signal.

Edited Nearest Neighbours is a targeted under-sampling rule. It removes only the samples whose class label disagrees with the majority vote of their k nearest neighbours — the ambiguous boundary points, not the informative ones deep inside each cluster.

How ENN works

For every sample x_i in the dataset, find its k nearest neighbours (default k = 3).
If the majority vote of those k neighbours disagrees with x_i's label, x_i is an overlap sample and is removed.
This applies to both classes — it cleans the majority class boundary too, not just the minority.
The result is a dataset where every sample agrees with its local neighbourhood — overlap is minimised.

ENN is surgical, not blunt

Unlike random undersampling (which discards majority samples at random and risks losing information), ENN only removes samples that are inconsistent with their neighbourhood. A majority sample deep in majority territory is never touched. Only the boundary-straddling samples are removed.

ENN in Python with imbalanced-learn

python — ENN standalone

from imblearn.under_sampling import EditedNearestNeighbours
from sklearn.preprocessing  import LabelEncoder, StandardScaler
import numpy as np

# Prepare X and y — encode any object columns first
X = df_ctgan.drop("churn", axis=1)
y = df_ctgan["churn"]

# Encode object columns
for col in X.select_dtypes("object").columns:
    X[col] = LabelEncoder().fit_transform(X[col])

# Scale — ENN uses Euclidean distance, scaling is essential
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply ENN — k=3 is the default
enn = EditedNearestNeighbours(
    n_neighbors      = 3,
    kind_sel         = "all",   # remove if ALL k neighbours disagree
    sampling_strategy= "all",  # clean both classes
)
X_enn, y_enn = enn.fit_resample(X_scaled, y)

print(f"Before ENN: {X_scaled.shape[0]} samples")
print(f"After  ENN: {X_enn.shape[0]} samples")
print(f"Removed:    {X_scaled.shape[0] - X_enn.shape[0]} overlap samples")
print(f"Class dist: {np.bincount(y_enn)}")

output

Before ENN: 6572 samples After ENN: 6109 samples Removed: 463 overlap samples Class dist: [3071 3038]

k neighbours

3

Overlap degree

30%

Before ENN — overlap boundary

After ENN — boundary cleaned

Total before

—

Removed

—

Total after

—

Removal %

—

Lesson 04 · The CTGAN-ENN Framework

Oversample first. Clean second.

CTGAN-ENN chains the two techniques in a deliberate order: CTGAN generates synthetic minority samples to address imbalance, then ENN removes the noisy boundary samples from the resulting dataset to address overlap.

Why this order matters

If you apply ENN before CTGAN, you remove real minority samples — the very samples CTGAN needs to learn from. By oversampling first, you ensure that ENN is cleaning a richer, fuller picture of the minority class and removing the noisy overlap zone without losing real signal.

📊

Raw churn dataset

Imbalanced and overlapping. Minority class (churners) is underrepresented and its samples partially overlap with majority territory.

Imbalanced + Overlapping

GAN

CTGAN oversampling

Train CTGAN on minority samples only. Sample synthetic rows until imbalance ratio reaches target. Synthetic samples respect the joint distribution — multimodal numerics and categorical frequencies are preserved.

→ Balanced

ENN

ENN boundary cleaning

Apply ENN to the combined (original + synthetic) dataset. Removes samples from both classes that are inconsistent with their k-neighbourhood. The decision boundary is sharpened.

→ Overlap reduced

CLF

Classifier training

Train any standard classifier (KNN, GBM, XGB, LGB) on the cleaned, balanced training set. No algorithm-level modifications required. The data preparation does the work.

KNN / GBM / XGB / LGB

Complete CTGAN-ENN pipeline — production-ready code

python — full CTGAN-ENN pipeline

import pandas as pd
import numpy as np
from ctgan                     import CTGAN
from imblearn.under_sampling   import EditedNearestNeighbours
from sklearn.preprocessing     import StandardScaler, LabelEncoder
from sklearn.model_selection   import train_test_split, StratifiedKFold
from sklearn.metrics           import f1_score, roc_auc_score, average_precision_score
from xgboost                   import XGBClassifier


def ctgan_enn_pipeline(
    df_train,
    target_col       = "churn",
    discrete_cols    = None,
    ctgan_epochs     = 300,
    sampling_strategy= 1.0,   # 1.0 = full 1:1 balance
    enn_k            = 3,
):
    """
    Full CTGAN-ENN data preparation pipeline.
    Returns: X_clean (scaled array), y_clean (labels)
    """
    # ── 0. encode categoricals ──────────────────────────────────
    df = df_train.copy()
    encoders = {}
    for col in df.select_dtypes("object").columns:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])
        encoders[col] = le

    # ── 1. CTGAN oversampling on minority class ───────────────────
    df_min = df[df[target_col] == 1]
    n_maj  = (df[target_col] == 0).sum()
    n_need = int(n_maj * sampling_strategy) - len(df_min)

    ctgan = CTGAN(epochs=ctgan_epochs, verbose=False)
    ctgan.fit(df_min, discrete_columns=discrete_cols or [])
    synth = ctgan.sample(n_need)
    synth[target_col] = 1   # tag synthetic minority

    df_aug = pd.concat([df, synth], ignore_index=True)
    print(f"After CTGAN: {df_aug.shape[0]} samples | "
          f"class dist {np.bincount(df_aug[target_col].astype(int)).tolist()}")

    # ── 2. scale for distance-based ENN ──────────────────────────
    X = df_aug.drop(target_col, axis=1).values
    y = df_aug[target_col].values.astype(int)
    sc  = StandardScaler()
    Xs  = sc.fit_transform(X)

    # ── 3. ENN boundary cleaning ──────────────────────────────────
    enn = EditedNearestNeighbours(
        n_neighbors=enn_k,
        kind_sel="all",
        sampling_strategy="all"
    )
    X_clean, y_clean = enn.fit_resample(Xs, y)
    print(f"After ENN:  {X_clean.shape[0]} samples | "
          f"removed {Xs.shape[0]-X_clean.shape[0]} overlap samples")

    return X_clean, y_clean, sc


# ── Run the pipeline ──────────────────────────────────────────
df_raw   = pd.read_csv("churn_train.csv")
X_cl, y_cl, sc = ctgan_enn_pipeline(
    df_raw,
    target_col    = "churn",
    discrete_cols = ["international_plan", "voicemail_plan"],
    ctgan_epochs  = 300,
    enn_k         = 3,
)

# ── Evaluate with XGBoost ─────────────────────────────────────
xgb = XGBClassifier(n_estimators=200, eval_metric="aucpr", random_state=42)
xgb.fit(X_cl, y_cl)

df_test = pd.read_csv("churn_test.csv")
X_te    = sc.transform(df_test.drop("churn",axis=1))     # same scaler
y_te    = df_test["churn"].values

y_pred  = xgb.predict(X_te)
y_prob  = xgb.predict_proba(X_te)[:,1]
print(f"F1      : {f1_score(y_te, y_pred):.4f}")
print(f"ROC-AUC : {roc_auc_score(y_te, y_prob):.4f}")
print(f"PR-AUC  : {average_precision_score(y_te, y_prob):.4f}")

output

After CTGAN: 6572 samples | class dist [3286, 3286] After ENN: 6109 samples | removed 463 overlap samples F1 : 0.8712 ROC-AUC : 0.9384 PR-AUC : 0.8951

Lesson 05 · Results & Benchmarks

Five datasets. Four classifiers. One winner.

CTGAN-ENN was evaluated on five publicly available churn datasets against nine competing methods across four classifiers. The results consistently favour CTGAN-ENN on minority-class metrics — particularly F1 and PR-AUC.

Benchmark datasets

Telco (IBM)

Samples 7,043

Features 20

Churn % 26.5%

3.8:1

KKBox

Samples 15,210

Features 14

Churn % 8.3%

11.1:1

Orange

Samples 50,000

Features 230

Churn % 7.4%

12.5:1

Cell2Cell

Samples 51,306

Features 58

Churn % 49.8%

~1:1

BankChurn

Samples 10,000

Features 11

Churn % 20.4%

4.9:1

Key findings

Overlap reduction across all datasets. CTGAN-ENN reduced the measured class overlap score on every feature in every dataset. Overlap was quantified using the F1 overlap measure (Ho & Basu, 2002).
F1 and PR-AUC improvements. CTGAN-ENN outperformed SMOTE, ADASYN, CTGAN-alone, and cost-sensitive learning baselines on minority F1 and PR-AUC in 4 out of 5 datasets (Cell2Cell excluded — near-balanced, no imbalance correction needed).
Faster than CTGAN alone. Because ENN removes samples after CTGAN, the final training set is smaller than CTGAN-alone. Training time was 12–18% lower on KKBox and Orange despite more data-preparation steps.
Best classifier pairing: XGB and LGB. Both gradient-boosted models benefited most from the cleaned training set. KNN showed the largest absolute recall improvement but lower precision gains.

Dataset

Classifier

Metric

Method comparison · bar chart

Overlap score before and after — across all five datasets

python — measure overlap with F1 score measure

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

def overlap_score(X, y, k=5):
    """
    Proxy overlap metric: 1-NN leave-one-out error rate.
    Low error = low overlap = clean boundary.
    High error = high overlap = many misclassified by nearest neighbour.
    """
    knn = KNeighborsClassifier(n_neighbors=1)
    err = 1 - cross_val_score(knn, X, y, cv=5,
                                scoring="accuracy").mean()
    return round(err, 4)

# Compare overlap across preparation strategies
from imblearn.over_sampling import SMOTE, ADASYN

strategies = {
    "Raw"       : (X_raw_scaled,    y_raw),
    "SMOTE"     : SMOTE().fit_resample(X_raw_scaled, y_raw),
    "ADASYN"    : ADASYN().fit_resample(X_raw_scaled, y_raw),
    "CTGAN"     : (X_ctgan,         y_ctgan),
    "CTGAN-ENN" : (X_cl,            y_cl),
}

for name, (Xv, yv) in strategies.items():
    ov = overlap_score(Xv, yv)
    print(f"{name:<14} overlap score: {ov:.4f}")

output — Telco dataset

Raw overlap score: 0.2134 SMOTE overlap score: 0.1987 ADASYN overlap score: 0.1841 CTGAN overlap score: 0.1762 CTGAN-ENN overlap score: 0.0934 ← 56% reduction

Lesson 06 · Checkpoint

Five questions. Pencil down.

These questions test your understanding of the paper's core contributions — not just the code, but the reasoning behind design decisions.

Q01

What is "class overlap" in churn prediction and why does it make standard oversampling methods like SMOTE fail?

AOverlap refers to duplicate rows in the dataset. SMOTE copies duplicates and worsens the redundancy.

BOverlap means the dataset has more majority samples than minority. SMOTE does not handle high imbalance ratios well.

COverlap means minority and majority samples occupy the same feature-space region. SMOTE interpolates between existing minority samples including those already in majority territory, generating synthetic noise rather than clean signal.

DOverlap is a measurement artefact caused by feature scaling. SMOTE worsens it by changing feature distributions.

Overlap is a geometric problem in feature space: churners and stayers have nearly identical feature vectors. SMOTE generates new minority points along line segments between existing minority samples — including boundary samples that already sit in majority territory — producing synthetic points that are equally ambiguous or worse.

Q02

Why does CTGAN handle multimodal numeric columns better than SMOTE?

ACTGAN uses a larger k for nearest-neighbour interpolation, capturing more modes

BCTGAN applies mode-specific normalisation using a Bayesian Gaussian Mixture, allowing it to model and sample from each mode separately rather than interpolating across the full range

CCTGAN converts all numeric columns to categorical before synthesis, preserving modes exactly

DCTGAN ignores multimodal columns and samples from the marginal distribution of each feature independently

SMOTE interpolates in raw feature space. For a bimodal column (e.g. two subscription tiers), linear interpolation between a mode-A sample and a mode-B sample produces a value between the two modes — a value that no real customer would have. CTGAN's mode-specific normalisation detects the modes via a Gaussian Mixture and normalises each sample relative to its own mode, so generated values stay within realistic mode ranges.

Q03

ENN removes samples that disagree with the majority vote of their k nearest neighbours. Which samples does this primarily remove?

ABoundary-region samples from both classes that are surrounded by opposite-class neighbours — the ambiguous overlap samples

BRandom samples from the majority class to reduce overall dataset size

COnly synthetic samples generated by CTGAN — ENN cannot identify real samples

DOutlier samples that are far from all other samples regardless of class

ENN is a targeted boundary cleaner. A sample that is surrounded by k neighbours mostly from the opposite class is definitionally an overlap sample — it is in the wrong territory. ENN removes these from both classes. Deep majority samples (surrounded entirely by majority neighbours) and deep minority samples are untouched. This is why ENN reduces overlap without destroying class-representative information.

Q04

The paper applies CTGAN first, then ENN. Why is this order important? What happens if you reverse it?

AOrder does not matter — CTGAN and ENN operate on independent subsets of the data

BENN first is preferred because cleaning the data before synthesis produces better GAN training signal

CENN first removes majority samples, which is the same result as SMOTE

DENN first removes real minority boundary samples — the very samples CTGAN needs to learn the minority distribution. CTGAN first preserves those samples for learning, then ENN cleans the overlap that includes both real and synthetic points.

If ENN runs first on the imbalanced raw data, it will remove real minority samples at the boundary — but with so few minority samples to begin with, losing any of them hurts CTGAN training. By oversampling first, you give CTGAN a full picture of the minority class. ENN then cleans the richer augmented dataset, removing overlap without starving the generator.

Q05

The paper reports that CTGAN-ENN is faster to train than CTGAN alone despite more preprocessing steps. What explains this finding?

ACTGAN-ENN uses fewer GAN epochs because ENN initialises the generator weights more efficiently

BCTGAN-ENN trains on a smaller neural network architecture than CTGAN alone

CENN removes overlap samples after CTGAN oversampling, producing a final training set that is smaller than CTGAN-alone. The classifier trains on fewer samples, reducing overall wall-clock time despite the extra ENN step.

DCTGAN-ENN skips the discriminator training phase of the GAN because ENN provides the adversarial signal instead

CTGAN alone produces a fully balanced dataset (e.g. 6572 samples for a 3286:3286 split). CTGAN-ENN applies ENN afterward and removes the overlap samples — reducing the training set to ~6100 samples in the example. The classifier trains on fewer (but cleaner) rows, which is faster. The GAN training time is identical; the speedup comes from the smaller, cleaner downstream dataset.

Back to home → OmicsHub Space