O
OmicsHub Space
Module B02 · Oversampling Methods
Progress · 0 / 6
Lesson 01 · The Problem

When your dataset has too few of the cases that matter.

Class imbalance is the rule, not the exception, in real research. Oversampling is the family of techniques that fix it by creating more minority-class examples before training.

What is class imbalance?

A dataset is imbalanced when one class (the majority) has far more samples than another (the minority). In medical research this is almost unavoidable: disease cases are rare by definition. In fraud detection, genuine transactions outnumber fraud 1000-to-1. In omics studies, differentially expressed genes are the minority among tens of thousands.

A naive classifier trained on imbalanced data learns to predict the majority class almost exclusively — and still achieves high accuracy. This is the accuracy paradox: 99% accuracy on a 99:1 dataset means nothing if the model never identifies a single minority case.

The core idea

Oversampling increases the number of minority-class samples in the training set so the model sees a more balanced picture of the problem. It operates on the training data only — the test set is always kept at its natural distribution.

Why not just undersample the majority?

  • Information loss. Dropping majority samples discards real signal. With small datasets this can be catastrophic.
  • Oversampling adds instead of removes. It preserves every real data point while expanding the minority side.
  • Both can be combined. Hybrid strategies (oversample minority + undersample majority) are common in practice.

The three methods in this module

Each of the next three lessons covers one method. They all solve the same problem — imbalance — but they do so in increasingly sophisticated ways.

1
Random Over Sampling (ROS)
Pick minority samples at random and duplicate them. Simple, zero information loss, but duplicates create an overfitting surface.
2
SMOTE
Create new synthetic minority samples by interpolating between existing ones and their nearest neighbours. Adds diversity instead of exact copies.
3
ADASYN
Like SMOTE, but biases generation toward the hardest minority samples — those surrounded by many majority neighbours.

See the imbalance

The widget below shows a dataset before and after oversampling. The imbalance ratio slider controls how extreme the problem is. Press Generate to draw a new random dataset.

Class imbalance visualiser
10:1
 
Original (imbalanced)
After ROS (balanced)
Majority samples
Minority (original)
Minority (after ROS)
Ratio after
Lesson 02 · Random Over Sampling

The simplest fix: just duplicate.

Random Over Sampling (ROS) picks minority-class samples uniformly at random and copies them into the training set until the desired class ratio is reached. It is the baseline against which every other oversampling method is judged.

How it works

1
Choose a target ratio
Decide how many minority samples you want relative to the majority. A 1:1 ratio is common but not always optimal — sometimes 1:2 or 1:3 is enough.
2
Sample with replacement
Draw minority samples randomly, with replacement, until the gap is filled. The same sample may appear more than once — this is by design.
3
Append and train
Add the new copies to the original training set. The test set is never touched. Train as normal on the augmented dataset.

Strengths

  • Zero information loss. No real samples are discarded or modified.
  • Works on any data type. Images, text, tabular — ROS is feature-agnostic.
  • Fast and deterministic. A single random seed makes it fully reproducible.
  • No hyperparameters. Just target ratio. Easy to reason about and audit.

Weaknesses

  • Exact duplicates → overfitting. The model memorises copied samples instead of learning general patterns. The decision boundary can become overly tight around duplicated points.
  • No new information. The model sees the same feature values multiple times; the diversity of the minority class does not increase.
  • Inflates training time. More samples means more computation, even if the signal is not richer.
When ROS fails

If the minority class is already very small (fewer than ~20 samples), duplicating does not help — the model will still have too few distinct boundary examples to generalise. Consider data collection before oversampling.

Random Over Sampling · interactive
8:1
 
Before ROS
After ROS — duplicates highlighted
Original minority
Duplicates added
Majority count
New ratio
What to notice
Duplicated points (darker blue rings) sit exactly on top of originals. The model will encounter these coordinates multiple times — this is the overfitting risk. Increase the imbalance ratio and watch how many exact copies accumulate.
Lesson 03 · SMOTE

New points, not copies: interpolate between neighbours.

SMOTE — Synthetic Minority Over-sampling Technique (Chawla et al., 2002) — generates brand-new minority samples by linearly interpolating between a seed point and one of its k-nearest minority neighbours. Every new point is unique.

The algorithm, step by step

1
Choose a seed sample
Pick a minority-class sample x at random.
2
Find k nearest minority neighbours
Compute Euclidean distance to all other minority samples. Keep the closest k (default k = 5). This neighbourhood defines the safe interpolation zone.
3
Pick one neighbour and a random λ
Select one neighbour xn from the k candidates. Draw a random scalar λ ∈ [0, 1] uniformly.
4
Create the synthetic point
xnew = x + λ · (xn − x). The new point lies somewhere on the line segment between the seed and its neighbour.
5
Repeat until balanced
Keep drawing seeds and neighbours until the target count is reached.

Why interpolation helps

Because new points lie between existing minority samples, SMOTE stays within the feature space already occupied by the minority class. It widens the decision boundary the model learns rather than just deepening it at existing points.

Key parameter

k — the number of nearest neighbours. Smaller k generates points closer to existing samples (safer but less diverse). Larger k spans the minority space more boldly (more diverse but risks crossing into majority territory).

SMOTE limitations

  • Noisy regions. If a minority sample sits in majority territory, its interpolated children will also sit there — adding noise rather than signal.
  • Feature correlation ignored. Interpolating each feature independently can produce biologically or physically impossible combinations in high-dimensional data.
  • Categorical features. Linear interpolation is undefined for categories. Extensions (SMOTE-NC, SMOTENC) handle mixed types.
  • Uniform generation. SMOTE treats all minority regions equally — it generates the same density of new points regardless of how hard or easy the local decision boundary is.
SMOTE does not fix bad data

If your minority class has mislabelled samples or extreme outliers, SMOTE will interpolate toward them and create new noisy synthetic points. Always inspect and clean the minority class before applying SMOTE.

SMOTE · interpolation demo
3
20
 
Before SMOTE
After SMOTE — new points in blue
Original minority
Synthetic added
k used
Unique new points100%
What to notice
New blue points (hollow circles) lie along line segments connecting existing minority samples. Increase k to 7 — the interpolation web widens, potentially reaching into majority territory. Decrease k to 1 — new points cluster very close to their seed.
Lesson 04 · ADASYN

Generate more where the boundary is hardest.

ADASYN — Adaptive Synthetic Sampling (He et al., 2008) — extends SMOTE by making generation density-aware. Minority samples deep inside their own cluster get fewer new neighbours; those surrounded by majority samples get many more.

How ADASYN differs from SMOTE

SMOTE generates the same number of synthetic samples from every minority seed. ADASYN first measures how hard each minority sample is to classify, then allocates more generation budget to the hard ones.

The algorithm

1
Find k nearest neighbours for each minority sample
For every minority sample xi, find its k nearest neighbours in the whole dataset (majority + minority combined).
2
Compute the difficulty ratio ri
ri = Δi / k, where Δi is the number of majority-class samples among the k neighbours. A sample with ri = 1 is completely surrounded by the majority — maximally hard.
3
Normalise to get a density distribution
Divide each ri by the sum of all r values so they form a probability distribution: i = ri / Σr.
4
Allocate generation budget
The total synthetic samples needed is G. Each seed xi generates gi = r̂i × G new samples using the same SMOTE interpolation step.
The adaptive insight

ADASYN focuses the model's attention on the decision boundary. By generating more samples where the minority class bleeds into majority territory, it forces the classifier to sharpen its boundary exactly where it matters most.

When ADASYN outperforms SMOTE

  • When the minority class is not uniformly spread — it has some dense safe regions and some isolated boundary samples.
  • When you care about recall on the hardest samples — ADASYN explicitly trains the model on them.
  • When overfitting to easy minority samples is already happening with SMOTE.

When ADASYN can hurt

  • Noisy boundary samples get amplified. If a minority sample is in majority territory because it is mislabelled, ADASYN generates many new noisy points near it.
  • Extreme imbalance. When minority samples are almost entirely surrounded by majority points, ri ≈ 1 for nearly all seeds and generation becomes unstable.
  • Small datasets. k-NN estimation of difficulty is unreliable when n is small. Consider Leave-One-Out cross-validation difficulty estimates in that case.
Pre-processing note

ADASYN's difficulty measure depends on absolute distances. Always scale your features (z-score or min-max) before applying ADASYN so that high-variance features do not dominate the neighbourhood computation.

ADASYN · difficulty-weighted generation
5
40
 
Before — dots sized by difficulty ri
After — generation density reflects difficulty
Original minority
Synthetic added
Max ri (hardest)
Mean ri
What to notice
Minority points drawn with larger circles have higher ri — they are surrounded by more majority samples and receive more synthetic children. Points deep inside the minority cluster (small circles) receive few or none. This is what "adaptive" means.
Lesson 05 · Comparison

Three methods, one decision.

ROS, SMOTE, and ADASYN are not competitors — they are tools for different situations. Knowing when each one wins is as important as knowing how each one works.

MethodNew information?Boundary-aware?Categorical supportMain risk
ROS No — exact copies No Yes — any type Overfitting to duplicates
SMOTE Yes — interpolated No — uniform Mixed (SMOTE-NC) Generates in noisy regions
ADASYN Yes — interpolated Yes — density-weighted Numeric only Amplifies noisy boundary samples

Decision guide

  • Use ROS when you have mixed feature types, a very small dataset, or need a fast reproducible baseline to beat.
  • Use SMOTE when your minority class is reasonably clean and you want diversity without boundary bias. It is the most widely-used starting point in published research.
  • Use ADASYN when recall on hard boundary cases matters most and you are confident the minority samples near the boundary are correctly labelled.

Practical checklist before oversampling

Scale your features first
SMOTE and ADASYN both rely on Euclidean distance. Un-scaled features with different ranges will distort neighbourhoods. Apply z-score or min-max scaling before oversampling.
Oversample training only
The test set must reflect the real-world distribution. Never apply oversampling before a train/test split, and never apply it to the test fold.
Inspect your minority class
Mislabelled samples, outliers, and duplicates in the minority class are amplified by all three methods. Clean the minority class before oversampling.
Evaluate with the right metric
Accuracy is misleading on imbalanced data. Use F1 (macro or weighted), ROC-AUC, precision-recall AUC, or geometric mean of sensitivity and specificity.
The imbalanced-learn library

In Python, all three methods — and many more — are implemented in imbalanced-learn (import as imblearn). It follows the scikit-learn API exactly: fit_resample(X_train, y_train) returns the augmented dataset.

ROS vs SMOTE vs ADASYN · side by side
6:1
 
ROS
SMOTE
ADASYN
Reading the charts
All three reach the same final class count. The difference is where the new minority points appear: ROS stacks on top of originals; SMOTE spreads evenly across the minority region; ADASYN concentrates near the boundary with the majority.
Lesson 06 · Checkpoint

Five questions. Pencil down.

Work through each question. The lesson that covers it is one sidebar click away if you get stuck.

Q01
Why should oversampling be applied to the training set only, and never to the test set?
ABecause oversampling is computationally expensive and would slow down evaluation
BBecause minority samples in the test set are already balanced by default
CBecause the test set must reflect the real-world class distribution to give an honest estimate of performance
DBecause SMOTE and ADASYN cannot run on labelled test data
The test set simulates deployment. If you oversample it, you are evaluating on a distribution that does not exist in the real world and your metrics will be misleadingly optimistic on the minority class.
Q02
What is the main overfitting risk introduced by Random Over Sampling?
AROS generates synthetic points that fall outside the original feature space
BThe model sees exact duplicate samples and learns to memorise them instead of generalising
CROS changes the feature values of minority samples during duplication
DROS always produces a dataset with a 1:1 class ratio which confuses the model
ROS copies rows verbatim. The same feature vector appears multiple times in the training set. The model learns to fire confidently on those coordinates, tightening the decision boundary around copies rather than learning general minority patterns.
Q03
In SMOTE, a new synthetic sample xnew is created by the formula xnew = x + λ·(xn − x). What is λ?
AA random scalar drawn uniformly from [0, 1]
BThe Euclidean distance between x and xn
CThe number of nearest neighbours k
DThe imbalance ratio between majority and minority classes
λ ∈ [0,1] is drawn uniformly at random. When λ=0, xnew = x (the seed itself). When λ=1, xnew = xn (the neighbour). Any value in between places the new point somewhere along the line segment joining the two.
Q04
ADASYN computes a difficulty ratio ri for each minority sample. What does a high ri indicate?
AThe sample is easy to classify because it sits deep inside the minority cluster
BThe sample has many minority-class neighbours and should receive fewer synthetic children
CThe sample is surrounded by many majority-class neighbours and is hard to classify correctly
DThe sample has missing values that make distance computation unreliable
ri = Δi/k where Δi is the count of majority-class samples in the k-neighbourhood. High ri means the minority sample is "drowning" in majority neighbours — a hard, boundary-region sample — so ADASYN generates more synthetic data near it.
Q05
You have a medical dataset with both numeric and categorical features, and the minority class has fewer than 15 samples. Which method is most appropriate to start with?
ARandom Over Sampling — it handles any feature type and provides a clean baseline
BSMOTE — it always outperforms ROS regardless of dataset size
CADASYN — boundary-awareness is always the priority in medical data
DNo oversampling — imbalance is natural and should be left alone
With fewer than ~15 minority samples, k-NN distances for SMOTE and ADASYN are unreliable. With mixed feature types, standard SMOTE's linear interpolation is undefined for categoricals. ROS is the correct starting point — it is feature-agnostic and still guarantees no information loss.