When your dataset has too few of the cases that matter.
Class imbalance is the rule, not the exception, in real research. Oversampling is the family of techniques that fix it by creating more minority-class examples before training.
What is class imbalance?
A dataset is imbalanced when one class (the majority) has far more samples than another (the minority). In medical research this is almost unavoidable: disease cases are rare by definition. In fraud detection, genuine transactions outnumber fraud 1000-to-1. In omics studies, differentially expressed genes are the minority among tens of thousands.
A naive classifier trained on imbalanced data learns to predict the majority class almost exclusively — and still achieves high accuracy. This is the accuracy paradox: 99% accuracy on a 99:1 dataset means nothing if the model never identifies a single minority case.
Oversampling increases the number of minority-class samples in the training set so the model sees a more balanced picture of the problem. It operates on the training data only — the test set is always kept at its natural distribution.
Why not just undersample the majority?
- Information loss. Dropping majority samples discards real signal. With small datasets this can be catastrophic.
- Oversampling adds instead of removes. It preserves every real data point while expanding the minority side.
- Both can be combined. Hybrid strategies (oversample minority + undersample majority) are common in practice.
The three methods in this module
Each of the next three lessons covers one method. They all solve the same problem — imbalance — but they do so in increasingly sophisticated ways.
See the imbalance
The widget below shows a dataset before and after oversampling. The imbalance ratio slider controls how extreme the problem is. Press Generate to draw a new random dataset.
The simplest fix: just duplicate.
Random Over Sampling (ROS) picks minority-class samples uniformly at random and copies them into the training set until the desired class ratio is reached. It is the baseline against which every other oversampling method is judged.
How it works
Strengths
- Zero information loss. No real samples are discarded or modified.
- Works on any data type. Images, text, tabular — ROS is feature-agnostic.
- Fast and deterministic. A single random seed makes it fully reproducible.
- No hyperparameters. Just target ratio. Easy to reason about and audit.
Weaknesses
- Exact duplicates → overfitting. The model memorises copied samples instead of learning general patterns. The decision boundary can become overly tight around duplicated points.
- No new information. The model sees the same feature values multiple times; the diversity of the minority class does not increase.
- Inflates training time. More samples means more computation, even if the signal is not richer.
If the minority class is already very small (fewer than ~20 samples), duplicating does not help — the model will still have too few distinct boundary examples to generalise. Consider data collection before oversampling.
New points, not copies: interpolate between neighbours.
SMOTE — Synthetic Minority Over-sampling Technique (Chawla et al., 2002) — generates brand-new minority samples by linearly interpolating between a seed point and one of its k-nearest minority neighbours. Every new point is unique.
The algorithm, step by step
Why interpolation helps
Because new points lie between existing minority samples, SMOTE stays within the feature space already occupied by the minority class. It widens the decision boundary the model learns rather than just deepening it at existing points.
k — the number of nearest neighbours. Smaller k generates points closer to existing samples (safer but less diverse). Larger k spans the minority space more boldly (more diverse but risks crossing into majority territory).
SMOTE limitations
- Noisy regions. If a minority sample sits in majority territory, its interpolated children will also sit there — adding noise rather than signal.
- Feature correlation ignored. Interpolating each feature independently can produce biologically or physically impossible combinations in high-dimensional data.
- Categorical features. Linear interpolation is undefined for categories. Extensions (SMOTE-NC, SMOTENC) handle mixed types.
- Uniform generation. SMOTE treats all minority regions equally — it generates the same density of new points regardless of how hard or easy the local decision boundary is.
If your minority class has mislabelled samples or extreme outliers, SMOTE will interpolate toward them and create new noisy synthetic points. Always inspect and clean the minority class before applying SMOTE.
Generate more where the boundary is hardest.
ADASYN — Adaptive Synthetic Sampling (He et al., 2008) — extends SMOTE by making generation density-aware. Minority samples deep inside their own cluster get fewer new neighbours; those surrounded by majority samples get many more.
How ADASYN differs from SMOTE
SMOTE generates the same number of synthetic samples from every minority seed. ADASYN first measures how hard each minority sample is to classify, then allocates more generation budget to the hard ones.
The algorithm
ADASYN focuses the model's attention on the decision boundary. By generating more samples where the minority class bleeds into majority territory, it forces the classifier to sharpen its boundary exactly where it matters most.
When ADASYN outperforms SMOTE
- When the minority class is not uniformly spread — it has some dense safe regions and some isolated boundary samples.
- When you care about recall on the hardest samples — ADASYN explicitly trains the model on them.
- When overfitting to easy minority samples is already happening with SMOTE.
When ADASYN can hurt
- Noisy boundary samples get amplified. If a minority sample is in majority territory because it is mislabelled, ADASYN generates many new noisy points near it.
- Extreme imbalance. When minority samples are almost entirely surrounded by majority points, ri ≈ 1 for nearly all seeds and generation becomes unstable.
- Small datasets. k-NN estimation of difficulty is unreliable when n is small. Consider Leave-One-Out cross-validation difficulty estimates in that case.
ADASYN's difficulty measure depends on absolute distances. Always scale your features (z-score or min-max) before applying ADASYN so that high-variance features do not dominate the neighbourhood computation.
Three methods, one decision.
ROS, SMOTE, and ADASYN are not competitors — they are tools for different situations. Knowing when each one wins is as important as knowing how each one works.
| Method | New information? | Boundary-aware? | Categorical support | Main risk |
|---|---|---|---|---|
| ROS | No — exact copies | No | Yes — any type | Overfitting to duplicates |
| SMOTE | Yes — interpolated | No — uniform | Mixed (SMOTE-NC) | Generates in noisy regions |
| ADASYN | Yes — interpolated | Yes — density-weighted | Numeric only | Amplifies noisy boundary samples |
Decision guide
- Use ROS when you have mixed feature types, a very small dataset, or need a fast reproducible baseline to beat.
- Use SMOTE when your minority class is reasonably clean and you want diversity without boundary bias. It is the most widely-used starting point in published research.
- Use ADASYN when recall on hard boundary cases matters most and you are confident the minority samples near the boundary are correctly labelled.
Practical checklist before oversampling
In Python, all three methods — and many more — are implemented in imbalanced-learn (import as imblearn). It follows the scikit-learn API exactly: fit_resample(X_train, y_train) returns the augmented dataset.
Five questions. Pencil down.
Work through each question. The lesson that covers it is one sidebar click away if you get stuck.