Lesson 01 · The Problem

When your dataset has too few of the cases that matter.

Class imbalance is the rule, not the exception, in real research. Oversampling is the family of techniques that fix it by creating more minority-class examples before training.

What is class imbalance?

A dataset is imbalanced when one class (the majority) has far more samples than another (the minority). In medical research this is almost unavoidable: disease cases are rare by definition. In fraud detection, genuine transactions outnumber fraud 1000-to-1. In omics studies, differentially expressed genes are the minority among tens of thousands.

A naive classifier trained on imbalanced data learns to predict the majority class almost exclusively — and still achieves high accuracy. This is the accuracy paradox: 99% accuracy on a 99:1 dataset means nothing if the model never identifies a single minority case.

The core idea

Oversampling increases the number of minority-class samples in the training set so the model sees a more balanced picture of the problem. It operates on the training data only — the test set is always kept at its natural distribution.

Why not just undersample the majority?

Information loss. Dropping majority samples discards real signal. With small datasets this can be catastrophic.
Oversampling adds instead of removes. It preserves every real data point while expanding the minority side.
Both can be combined. Hybrid strategies (oversample minority + undersample majority) are common in practice.

The three methods in this module

Each of the next three lessons covers one method. They all solve the same problem — imbalance — but they do so in increasingly sophisticated ways.

1

Random Over Sampling (ROS)

Pick minority samples at random and duplicate them. Simple, zero information loss, but duplicates create an overfitting surface.

2

SMOTE

Create new synthetic minority samples by interpolating between existing ones and their nearest neighbours. Adds diversity instead of exact copies.

3

ADASYN

Like SMOTE, but biases generation toward the hardest minority samples — those surrounded by many majority neighbours.

See the imbalance

The widget below shows a dataset before and after oversampling. The imbalance ratio slider controls how extreme the problem is. Press Generate to draw a new random dataset.

Imbalance ratio (majority : minority)

10:1

Original (imbalanced)

After ROS (balanced)

Majority samples—

Minority (original)—

Minority (after ROS)—

Ratio after—

Lesson 02 · Random Over Sampling

The simplest fix: just duplicate.

Random Over Sampling (ROS) picks minority-class samples uniformly at random and copies them into the training set until the desired class ratio is reached. It is the baseline against which every other oversampling method is judged.

How it works

1

Choose a target ratio

Decide how many minority samples you want relative to the majority. A 1:1 ratio is common but not always optimal — sometimes 1:2 or 1:3 is enough.

2

Sample with replacement

Draw minority samples randomly, with replacement, until the gap is filled. The same sample may appear more than once — this is by design.

3

Append and train

Add the new copies to the original training set. The test set is never touched. Train as normal on the augmented dataset.

Strengths

Zero information loss. No real samples are discarded or modified.
Works on any data type. Images, text, tabular — ROS is feature-agnostic.
Fast and deterministic. A single random seed makes it fully reproducible.
No hyperparameters. Just target ratio. Easy to reason about and audit.

Weaknesses

Exact duplicates → overfitting. The model memorises copied samples instead of learning general patterns. The decision boundary can become overly tight around duplicated points.
No new information. The model sees the same feature values multiple times; the diversity of the minority class does not increase.
Inflates training time. More samples means more computation, even if the signal is not richer.

When ROS fails

If the minority class is already very small (fewer than ~20 samples), duplicating does not help — the model will still have too few distinct boundary examples to generalise. Consider data collection before oversampling.

Imbalance ratio

8:1

Target ratio

Before ROS

After ROS — duplicates highlighted

Original minority—

Duplicates added—

Majority count—

New ratio—

Lesson 03 · SMOTE

New points, not copies: interpolate between neighbours.

SMOTE — Synthetic Minority Over-sampling Technique (Chawla et al., 2002) — generates brand-new minority samples by linearly interpolating between a seed point and one of its k-nearest minority neighbours. Every new point is unique.

The algorithm, step by step

1

Choose a seed sample

Pick a minority-class sample x at random.

2

Find k nearest minority neighbours

Compute Euclidean distance to all other minority samples. Keep the closest k (default k = 5). This neighbourhood defines the safe interpolation zone.

3

Pick one neighbour and a random λ

Select one neighbour x_n from the k candidates. Draw a random scalar λ ∈ [0, 1] uniformly.

4

Create the synthetic point

x_new = x + λ · (x_n − x). The new point lies somewhere on the line segment between the seed and its neighbour.

5

Repeat until balanced

Keep drawing seeds and neighbours until the target count is reached.

Why interpolation helps

Because new points lie between existing minority samples, SMOTE stays within the feature space already occupied by the minority class. It widens the decision boundary the model learns rather than just deepening it at existing points.

Key parameter

k — the number of nearest neighbours. Smaller k generates points closer to existing samples (safer but less diverse). Larger k spans the minority space more boldly (more diverse but risks crossing into majority territory).

SMOTE limitations

Noisy regions. If a minority sample sits in majority territory, its interpolated children will also sit there — adding noise rather than signal.
Feature correlation ignored. Interpolating each feature independently can produce biologically or physically impossible combinations in high-dimensional data.
Categorical features. Linear interpolation is undefined for categories. Extensions (SMOTE-NC, SMOTENC) handle mixed types.
Uniform generation. SMOTE treats all minority regions equally — it generates the same density of new points regardless of how hard or easy the local decision boundary is.

SMOTE does not fix bad data

If your minority class has mislabelled samples or extreme outliers, SMOTE will interpolate toward them and create new noisy synthetic points. Always inspect and clean the minority class before applying SMOTE.

k neighbours

3

New samples to generate

20

Before SMOTE

After SMOTE — new points in blue

Original minority—

Synthetic added—

k used—

Unique new points100%

Lesson 04 · ADASYN

Generate more where the boundary is hardest.

ADASYN — Adaptive Synthetic Sampling (He et al., 2008) — extends SMOTE by making generation density-aware. Minority samples deep inside their own cluster get fewer new neighbours; those surrounded by majority samples get many more.

How ADASYN differs from SMOTE

SMOTE generates the same number of synthetic samples from every minority seed. ADASYN first measures how hard each minority sample is to classify, then allocates more generation budget to the hard ones.

The algorithm

1

Find k nearest neighbours for each minority sample

For every minority sample x_i, find its k nearest neighbours in the whole dataset (majority + minority combined).

2

Compute the difficulty ratio r_i

r_i = Δ_i / k, where Δ_i is the number of majority-class samples among the k neighbours. A sample with r_i = 1 is completely surrounded by the majority — maximally hard.

3

Normalise to get a density distribution

Divide each r_i by the sum of all r values so they form a probability distribution: r̂_i = r_i / Σr.

4

Allocate generation budget

The total synthetic samples needed is G. Each seed x_i generates g_i = r̂_i × G new samples using the same SMOTE interpolation step.

The adaptive insight

ADASYN focuses the model's attention on the decision boundary. By generating more samples where the minority class bleeds into majority territory, it forces the classifier to sharpen its boundary exactly where it matters most.

When ADASYN outperforms SMOTE

When the minority class is not uniformly spread — it has some dense safe regions and some isolated boundary samples.
When you care about recall on the hardest samples — ADASYN explicitly trains the model on them.
When overfitting to easy minority samples is already happening with SMOTE.

When ADASYN can hurt

Noisy boundary samples get amplified. If a minority sample is in majority territory because it is mislabelled, ADASYN generates many new noisy points near it.
Extreme imbalance. When minority samples are almost entirely surrounded by majority points, r_i ≈ 1 for nearly all seeds and generation becomes unstable.
Small datasets. k-NN estimation of difficulty is unreliable when n is small. Consider Leave-One-Out cross-validation difficulty estimates in that case.

Pre-processing note

ADASYN's difficulty measure depends on absolute distances. Always scale your features (z-score or min-max) before applying ADASYN so that high-variance features do not dominate the neighbourhood computation.

k neighbours

5

Total new samples G

40

Before — dots sized by difficulty r_i

After — generation density reflects difficulty

Original minority—

Synthetic added—

Max r_i (hardest)—

Mean r_i—

Lesson 05 · Comparison

Three methods, one decision.

ROS, SMOTE, and ADASYN are not competitors — they are tools for different situations. Knowing when each one wins is as important as knowing how each one works.

Method	New information?	Boundary-aware?	Categorical support	Main risk
ROS	No — exact copies	No	Yes — any type	Overfitting to duplicates
SMOTE	Yes — interpolated	No — uniform	Mixed (SMOTE-NC)	Generates in noisy regions
ADASYN	Yes — interpolated	Yes — density-weighted	Numeric only	Amplifies noisy boundary samples

Decision guide

Use ROS when you have mixed feature types, a very small dataset, or need a fast reproducible baseline to beat.
Use SMOTE when your minority class is reasonably clean and you want diversity without boundary bias. It is the most widely-used starting point in published research.
Use ADASYN when recall on hard boundary cases matters most and you are confident the minority samples near the boundary are correctly labelled.

Practical checklist before oversampling

✓

Scale your features first

SMOTE and ADASYN both rely on Euclidean distance. Un-scaled features with different ranges will distort neighbourhoods. Apply z-score or min-max scaling before oversampling.

✓

Oversample training only

The test set must reflect the real-world distribution. Never apply oversampling before a train/test split, and never apply it to the test fold.

✓

Inspect your minority class

Mislabelled samples, outliers, and duplicates in the minority class are amplified by all three methods. Clean the minority class before oversampling.

✓

Evaluate with the right metric

Accuracy is misleading on imbalanced data. Use F1 (macro or weighted), ROC-AUC, precision-recall AUC, or geometric mean of sensitivity and specificity.

The imbalanced-learn library

In Python, all three methods — and many more — are implemented in imbalanced-learn (import as imblearn). It follows the scikit-learn API exactly: fit_resample(X_train, y_train) returns the augmented dataset.

Imbalance ratio

6:1

ROS

SMOTE

ADASYN

Lesson 06 · Checkpoint

Five questions. Pencil down.

Work through each question. The lesson that covers it is one sidebar click away if you get stuck.

Q01

Why should oversampling be applied to the training set only, and never to the test set?

ABecause oversampling is computationally expensive and would slow down evaluation

BBecause minority samples in the test set are already balanced by default

CBecause the test set must reflect the real-world class distribution to give an honest estimate of performance

DBecause SMOTE and ADASYN cannot run on labelled test data

The test set simulates deployment. If you oversample it, you are evaluating on a distribution that does not exist in the real world and your metrics will be misleadingly optimistic on the minority class.

Q02

What is the main overfitting risk introduced by Random Over Sampling?

AROS generates synthetic points that fall outside the original feature space

BThe model sees exact duplicate samples and learns to memorise them instead of generalising

CROS changes the feature values of minority samples during duplication

DROS always produces a dataset with a 1:1 class ratio which confuses the model

ROS copies rows verbatim. The same feature vector appears multiple times in the training set. The model learns to fire confidently on those coordinates, tightening the decision boundary around copies rather than learning general minority patterns.

Q03

In SMOTE, a new synthetic sample x_new is created by the formula x_new = x + λ·(x_n − x). What is λ?

AA random scalar drawn uniformly from [0, 1]

BThe Euclidean distance between x and x_n

CThe number of nearest neighbours k

DThe imbalance ratio between majority and minority classes

λ ∈ [0,1] is drawn uniformly at random. When λ=0, x_new = x (the seed itself). When λ=1, x_new = x_n (the neighbour). Any value in between places the new point somewhere along the line segment joining the two.

Q04

ADASYN computes a difficulty ratio r_i for each minority sample. What does a high r_i indicate?

AThe sample is easy to classify because it sits deep inside the minority cluster

BThe sample has many minority-class neighbours and should receive fewer synthetic children

CThe sample is surrounded by many majority-class neighbours and is hard to classify correctly

DThe sample has missing values that make distance computation unreliable

r_i = Δ_i/k where Δ_i is the count of majority-class samples in the k-neighbourhood. High r_i means the minority sample is "drowning" in majority neighbours — a hard, boundary-region sample — so ADASYN generates more synthetic data near it.

Q05

You have a medical dataset with both numeric and categorical features, and the minority class has fewer than 15 samples. Which method is most appropriate to start with?

ARandom Over Sampling — it handles any feature type and provides a clean baseline

BSMOTE — it always outperforms ROS regardless of dataset size

CADASYN — boundary-awareness is always the priority in medical data

DNo oversampling — imbalance is natural and should be left alone

With fewer than ~15 minority samples, k-NN distances for SMOTE and ADASYN are unreliable. With mixed feature types, standard SMOTE's linear interpolation is undefined for categoricals. ROS is the correct starting point — it is feature-agnostic and still guarantees no information loss.

Back to home → OmicsHub Space