Churn data is messy in two ways that models hate.
Most churn prediction research addresses class imbalance. This paper addresses a harder, less-studied problem: class overlap — when churners and stayers are statistically indistinguishable in feature space.
We introduce CTGAN-ENN, a hybrid data-level framework that combines conditional tabular GAN synthesis with Edited Nearest Neighbour cleaning to simultaneously address imbalance and overlap in customer churn datasets. Tested across five benchmark datasets, CTGAN-ENN outperforms standard oversampling methods, hybrid baselines, and algorithm-level cost-sensitive approaches — while reducing training time relative to CTGAN alone.
Problem 1 — Class Imbalance
Churn datasets are structurally imbalanced. In a typical telco dataset, 85–95% of customers are stayers and only 5–15% are churners. A naive classifier minimises global loss by predicting the majority class, achieving high accuracy while completely missing the business-critical minority.
Problem 2 — Class Overlap
Overlap occurs when minority (churn) samples and majority (stay) samples occupy the same region of feature space. A customer who churns may have identical tenure, spend, and usage patterns to one who stays — because the decision to churn was driven by factors not captured in the data (a competitor offer, a life event).
Oversampling methods like SMOTE amplify this problem: they generate new minority samples by interpolating between existing ones, including those that sit deep inside majority territory. The synthetic samples inherit the noise.
If a minority sample is surrounded by majority neighbours (high ADASYN r_i), any synthetic child generated near it will also land in majority territory. The model learns a noisy boundary that generalises poorly. The solution requires cleaning the boundary first, then sampling.
Why a GAN handles tabular data better than SMOTE.
Standard SMOTE performs linear interpolation in raw feature space. CTGAN learns the full joint distribution of a tabular dataset — including mixed types, multimodal columns, and cross-column correlations — and samples from that learned distribution.
CTGAN architecture
CTGAN (Xu et al., 2019) is a conditional generative adversarial network specifically designed for tabular data. It addresses two structural problems that break vanilla GANs on tables:
- Mode collapse on numeric columns. Numeric columns in real-world data are often multimodal. CTGAN applies mode-specific normalisation — it fits a Bayesian Gaussian Mixture to each column and normalises each sample relative to the mode it belongs to.
- Training imbalance on categorical columns. Rare categories are underrepresented. CTGAN uses a conditional vector that forces the generator to produce samples for every discrete value, including rare ones.
Install and run CTGAN
# Install the SDV library which bundles CTGAN
pip install sdv ctgan
import pandas as pd
import numpy as np
from ctgan import CTGAN
# Load churn dataset (telco example)
df = pd.read_csv("churn_train.csv")
print(df["churn"].value_counts())
# 0 3286 (stayed)
# 1 483 (churned) ← imbalance ratio ~6.8:1
# ── Step 1: isolate the minority class ──────────────────────
df_minority = df[df["churn"] == 1].copy()
# Identify discrete columns (CTGAN needs to know these)
discrete_cols = ["international_plan", "voicemail_plan", "churn"]
# ── Step 2: train CTGAN on minority only ─────────────────────
ctgan = CTGAN(
epochs = 300,
batch_size = 64,
generator_dim = (256, 256),
discriminator_dim=(256, 256),
verbose = False,
)
ctgan.fit(df_minority, discrete_columns=discrete_cols)
# ── Step 3: generate synthetic minority samples ───────────────
n_majority = (df["churn"] == 0).sum()
n_minority = (df["churn"] == 1).sum()
n_generate = n_majority - n_minority # fill gap to 1:1
synthetic_minority = ctgan.sample(n_generate)
print(f"Generated {len(synthetic_minority)} synthetic churn samples")
# ── Step 4: merge into balanced training set ─────────────────
df_ctgan = pd.concat([df, synthetic_minority], ignore_index=True)
print(df_ctgan["churn"].value_counts())
SMOTE linearly interpolates between two points. CTGAN samples from the learned joint distribution of all features simultaneously. For a column like monthly_charges with a bimodal distribution (two subscription tiers), SMOTE may generate values between the two modes that no real customer would ever have. CTGAN respects the modes.
Remove the noise at the boundary, not the signal.
Edited Nearest Neighbours is a targeted under-sampling rule. It removes only the samples whose class label disagrees with the majority vote of their k nearest neighbours — the ambiguous boundary points, not the informative ones deep inside each cluster.
How ENN works
- For every sample x_i in the dataset, find its k nearest neighbours (default k = 3).
- If the majority vote of those k neighbours disagrees with x_i's label, x_i is an overlap sample and is removed.
- This applies to both classes — it cleans the majority class boundary too, not just the minority.
- The result is a dataset where every sample agrees with its local neighbourhood — overlap is minimised.
Unlike random undersampling (which discards majority samples at random and risks losing information), ENN only removes samples that are inconsistent with their neighbourhood. A majority sample deep in majority territory is never touched. Only the boundary-straddling samples are removed.
ENN in Python with imbalanced-learn
from imblearn.under_sampling import EditedNearestNeighbours
from sklearn.preprocessing import LabelEncoder, StandardScaler
import numpy as np
# Prepare X and y — encode any object columns first
X = df_ctgan.drop("churn", axis=1)
y = df_ctgan["churn"]
# Encode object columns
for col in X.select_dtypes("object").columns:
X[col] = LabelEncoder().fit_transform(X[col])
# Scale — ENN uses Euclidean distance, scaling is essential
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply ENN — k=3 is the default
enn = EditedNearestNeighbours(
n_neighbors = 3,
kind_sel = "all", # remove if ALL k neighbours disagree
sampling_strategy= "all", # clean both classes
)
X_enn, y_enn = enn.fit_resample(X_scaled, y)
print(f"Before ENN: {X_scaled.shape[0]} samples")
print(f"After ENN: {X_enn.shape[0]} samples")
print(f"Removed: {X_scaled.shape[0] - X_enn.shape[0]} overlap samples")
print(f"Class dist: {np.bincount(y_enn)}")
Oversample first. Clean second.
CTGAN-ENN chains the two techniques in a deliberate order: CTGAN generates synthetic minority samples to address imbalance, then ENN removes the noisy boundary samples from the resulting dataset to address overlap.
Why this order matters
If you apply ENN before CTGAN, you remove real minority samples — the very samples CTGAN needs to learn from. By oversampling first, you ensure that ENN is cleaning a richer, fuller picture of the minority class and removing the noisy overlap zone without losing real signal.
Complete CTGAN-ENN pipeline — production-ready code
import pandas as pd
import numpy as np
from ctgan import CTGAN
from imblearn.under_sampling import EditedNearestNeighbours
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import f1_score, roc_auc_score, average_precision_score
from xgboost import XGBClassifier
def ctgan_enn_pipeline(
df_train,
target_col = "churn",
discrete_cols = None,
ctgan_epochs = 300,
sampling_strategy= 1.0, # 1.0 = full 1:1 balance
enn_k = 3,
):
"""
Full CTGAN-ENN data preparation pipeline.
Returns: X_clean (scaled array), y_clean (labels)
"""
# ── 0. encode categoricals ──────────────────────────────────
df = df_train.copy()
encoders = {}
for col in df.select_dtypes("object").columns:
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
encoders[col] = le
# ── 1. CTGAN oversampling on minority class ───────────────────
df_min = df[df[target_col] == 1]
n_maj = (df[target_col] == 0).sum()
n_need = int(n_maj * sampling_strategy) - len(df_min)
ctgan = CTGAN(epochs=ctgan_epochs, verbose=False)
ctgan.fit(df_min, discrete_columns=discrete_cols or [])
synth = ctgan.sample(n_need)
synth[target_col] = 1 # tag synthetic minority
df_aug = pd.concat([df, synth], ignore_index=True)
print(f"After CTGAN: {df_aug.shape[0]} samples | "
f"class dist {np.bincount(df_aug[target_col].astype(int)).tolist()}")
# ── 2. scale for distance-based ENN ──────────────────────────
X = df_aug.drop(target_col, axis=1).values
y = df_aug[target_col].values.astype(int)
sc = StandardScaler()
Xs = sc.fit_transform(X)
# ── 3. ENN boundary cleaning ──────────────────────────────────
enn = EditedNearestNeighbours(
n_neighbors=enn_k,
kind_sel="all",
sampling_strategy="all"
)
X_clean, y_clean = enn.fit_resample(Xs, y)
print(f"After ENN: {X_clean.shape[0]} samples | "
f"removed {Xs.shape[0]-X_clean.shape[0]} overlap samples")
return X_clean, y_clean, sc
# ── Run the pipeline ──────────────────────────────────────────
df_raw = pd.read_csv("churn_train.csv")
X_cl, y_cl, sc = ctgan_enn_pipeline(
df_raw,
target_col = "churn",
discrete_cols = ["international_plan", "voicemail_plan"],
ctgan_epochs = 300,
enn_k = 3,
)
# ── Evaluate with XGBoost ─────────────────────────────────────
xgb = XGBClassifier(n_estimators=200, eval_metric="aucpr", random_state=42)
xgb.fit(X_cl, y_cl)
df_test = pd.read_csv("churn_test.csv")
X_te = sc.transform(df_test.drop("churn",axis=1)) # same scaler
y_te = df_test["churn"].values
y_pred = xgb.predict(X_te)
y_prob = xgb.predict_proba(X_te)[:,1]
print(f"F1 : {f1_score(y_te, y_pred):.4f}")
print(f"ROC-AUC : {roc_auc_score(y_te, y_prob):.4f}")
print(f"PR-AUC : {average_precision_score(y_te, y_prob):.4f}")
Five datasets. Four classifiers. One winner.
CTGAN-ENN was evaluated on five publicly available churn datasets against nine competing methods across four classifiers. The results consistently favour CTGAN-ENN on minority-class metrics — particularly F1 and PR-AUC.
Benchmark datasets
Key findings
- Overlap reduction across all datasets. CTGAN-ENN reduced the measured class overlap score on every feature in every dataset. Overlap was quantified using the F1 overlap measure (Ho & Basu, 2002).
- F1 and PR-AUC improvements. CTGAN-ENN outperformed SMOTE, ADASYN, CTGAN-alone, and cost-sensitive learning baselines on minority F1 and PR-AUC in 4 out of 5 datasets (Cell2Cell excluded — near-balanced, no imbalance correction needed).
- Faster than CTGAN alone. Because ENN removes samples after CTGAN, the final training set is smaller than CTGAN-alone. Training time was 12–18% lower on KKBox and Orange despite more data-preparation steps.
- Best classifier pairing: XGB and LGB. Both gradient-boosted models benefited most from the cleaned training set. KNN showed the largest absolute recall improvement but lower precision gains.
Overlap score before and after — across all five datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
def overlap_score(X, y, k=5):
"""
Proxy overlap metric: 1-NN leave-one-out error rate.
Low error = low overlap = clean boundary.
High error = high overlap = many misclassified by nearest neighbour.
"""
knn = KNeighborsClassifier(n_neighbors=1)
err = 1 - cross_val_score(knn, X, y, cv=5,
scoring="accuracy").mean()
return round(err, 4)
# Compare overlap across preparation strategies
from imblearn.over_sampling import SMOTE, ADASYN
strategies = {
"Raw" : (X_raw_scaled, y_raw),
"SMOTE" : SMOTE().fit_resample(X_raw_scaled, y_raw),
"ADASYN" : ADASYN().fit_resample(X_raw_scaled, y_raw),
"CTGAN" : (X_ctgan, y_ctgan),
"CTGAN-ENN" : (X_cl, y_cl),
}
for name, (Xv, yv) in strategies.items():
ov = overlap_score(Xv, yv)
print(f"{name:<14} overlap score: {ov:.4f}")
Five questions. Pencil down.
These questions test your understanding of the paper's core contributions — not just the code, but the reasoning behind design decisions.