Lesson 01 · First Contact

Before you model anything, read the data.

Every modelling mistake that starts with the data could have been caught in the first five minutes. This lesson shows the exact sequence of pandas commands to run every single time you open a new dataset.

Step 1 — Load and check the shape

The first thing to know about any dataset is how big it is and what types it contains. Always run df.info() before df.head() — the type summary is more informative than the first five rows.

python

import pandas as pd
import numpy as np

df = pd.read_csv("dataset.csv")

# 1. Shape — rows × columns
print("Shape:", df.shape)

# 2. Types, nulls, memory
df.info()

# 3. First and last rows
df.head(3)

output

Shape: (2847, 12) RangeIndex: 2847 entries, 0 to 2846 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 patient_id 2847 non-null int64 1 age 2847 non-null int64 2 bmi 2823 non-null float64 ← 24 nulls 3 glucose 2847 non-null float64 4 insulin 2601 non-null float64 ← 246 nulls 5 blood_pressure 2847 non-null int64 6 skin_thickness 2847 non-null float64 7 outcome 2847 non-null int64 dtypes: float64(4), int64(4) memory usage: 267.1 KB

Step 2 — Summary statistics

df.describe() gives you the five-number summary plus mean and standard deviation for every numeric column. The most important thing to read here is not the mean — it is the min and max. Impossible values hide there.

python

# Round for readability
df.describe().round(2)

output

age bmi glucose insulin blood_pressure outcome count 2847 2823 2847 2601 2847 2847 mean 33.2 32.1 121.7 80.3 69.1 0.35 std 11.8 7.9 32.1 115.2 19.2 0.48 min 21.0 0.0 0.0 0.0 0.0 0.00 25% 24.0 27.3 100.0 0.0 62.0 0.00 50% 29.0 32.2 117.0 31.5 72.0 0.00 75% 41.0 36.6 141.0 127.3 80.0 1.00 max 81.0 67.1 199.0 846.0 122.0 1.00

Red flags above

BMI = 0, glucose = 0, blood pressure = 0, insulin = 0. These are biologically impossible. They are missing values disguised as zeros — a common encoding in older medical datasets. Always check if zeros in numeric columns are real or masked NaNs.

Step 3 — Check for zero-masked missingness

Many datasets encode missing values as zero, -1, or 999. Always verify with domain knowledge which values are physically impossible, then replace them.

python

# Columns where zero is biologically impossible
zero_impossible = ["bmi", "glucose", "insulin", "blood_pressure", "skin_thickness"]

for col in zero_impossible:
    n_zeros = (df[col] == 0).sum()
    if n_zeros > 0:
        print(f"{col}: {n_zeros} zero values → replace with NaN")
        df[col] = df[col].replace(0, np.nan)

output

bmi: 11 zero values → replace with NaN glucose: 5 zero values → replace with NaN insulin: 374 zero values → replace with NaN blood_pressure: 35 zero values → replace with NaN skin_thickness: 227 zero values → replace with NaN

Step 4 — Value counts for categoricals

python

# Class balance check
print(df["outcome"].value_counts(normalize=True).round(3))

# For every object/category column
for col in df.select_dtypes("object").columns:
    print(col, df[col].value_counts().head(5))

output

0 0.651 1 0.349 Name: outcome, dtype: float64

Lesson 02 · Distributions

A histogram tells you what your model will actually learn from.

Summary statistics hide shape. Two columns can have identical mean and standard deviation but completely different distributions. Plot every numeric feature before touching a model.

Plot all numeric distributions at once

python

import matplotlib.pyplot as plt

num_cols = df.select_dtypes("number").columns.tolist()
n = len(num_cols)
fig, axes = plt.subplots(2, 4, figsize=(14, 5))

for ax, col in zip(axes.flat, num_cols):
    df[col].dropna().hist(ax=ax, bins=30, color="#2563eb", edgecolor="none", alpha=.85)
    ax.set_title(col, fontsize=9)
    ax.set_xlabel("")

fig.tight_layout()
plt.show()

Column

Bins

25

Histogram

Box plot

Mean—

Median—

Std—

Skewness—

Detect and handle outliers in code

The IQR fence method is the most common first pass. Flag outliers — don't delete them yet. Investigate first.

python

def iqr_outliers(series, k=1.5):
    """Return boolean mask — True where value is outside k*IQR fences."""
    q1, q3 = series.quantile([0.25, 0.75])
    iqr = q3 - q1
    return (series < q1 - k*iqr) | (series > q3 + k*iqr)

for col in num_cols:
    mask = iqr_outliers(df[col].dropna())
    print(f"{col:<20} {mask.sum():>4} outliers  ({mask.mean()*100:.1f}%)")

output

age 12 outliers (0.4%) bmi 25 outliers (0.9%) glucose 19 outliers (0.7%) insulin 87 outliers (3.3%) ← investigate blood_pressure 45 outliers (1.6%) skin_thickness 28 outliers (1.0%)

Lesson 03 · Missingness

Missing data is never just missing.

How data goes missing determines what you can do about it. Imputing the wrong type of missingness is worse than leaving it as-is. The first task is classification, not imputation.

The three types of missingness

MCAR — Missing Completely At Random. The probability of being missing is unrelated to any other variable. Safe to impute or drop without introducing bias. Rare in real data.
MAR — Missing At Random. The probability of being missing depends on other observed variables, not on the missing value itself. Imputation using other columns is valid. Most common in practice.
MNAR — Missing Not At Random. The probability of being missing depends on the missing value itself (e.g. patients with very high blood pressure skip measurement). Imputation introduces bias. Requires domain-driven strategy.

Step 1 — Quantify missingness per column

python

missing = (
    df.isnull()
    .sum()
    .sort_values(ascending=False)
)
missing_pct = missing / len(df) * 100

pd.DataFrame({
    "missing_n"  : missing,
    "missing_pct": missing_pct.round(1)
}[missing > 0]

output

missing_n missing_pct insulin 374 13.1 skin_thickness 227 8.0 blood_pressure 35 1.2 bmi 11 0.4 glucose 5 0.2

Step 2 — Visualise missingness patterns

python

import seaborn as sns

# Heatmap — white = present, blue = missing
fig, ax = plt.subplots(figsize=(10, 4))
sns.heatmap(
    df.isnull().T,
    cbar=False,
    cmap="Blues",
    yticklabels=True,
    ax=ax
)
ax.set_title("Missingness heatmap (blue = missing)")
plt.show()

# Test for MCAR with Little's test (requires missingno or pingouin)
from scipy import stats
# Quick proxy: correlation between missing indicator and other columns
for col in ["insulin", "bmi"]:
    indicator = df[col].isnull().astype(int)
    r, p = stats.pointbiserialr(indicator, df["outcome"])
    print(f"{col} missing ~ outcome:  r={r:.3f}, p={p:.4f}")

output

insulin missing ~ outcome: r=0.131, p=0.0000 ← MAR / MNAR — not MCAR bmi missing ~ outcome: r=0.023, p=0.2181 ← likely MCAR

Missingness pattern

Missing %

20%

Missingness heatmap — each row is one sample, each column is one feature (blue = missing)

Step 3 — Imputation strategies

python

from sklearn.impute import SimpleImputer, KNNImputer

# MCAR / low % → median imputation (robust to outliers)
median_imp = SimpleImputer(strategy="median")
df[["bmi", "glucose"]] = median_imp.fit_transform(df[["bmi", "glucose"]])

# MAR → KNN imputation (uses other columns as context)
knn_imp = KNNImputer(n_neighbors=5)
df[["insulin", "skin_thickness"]] = knn_imp.fit_transform(
    df[["insulin", "skin_thickness", "age", "bmi"]]
)[:, :2]

# Confirm: no nulls remain
print("Remaining nulls:", df.isnull().sum().sum())

output

Remaining nulls: 0

Lesson 04 · Leakage

A model that is too good is usually wrong.

Data leakage is when information about the target variable — or about the future — enters the training features. It produces models with perfect training metrics that fail completely in production. It is the most dangerous and most common data problem.

Target leakage

A feature is derived from or causally downstream of the target. The model learns the answer, not the pattern.

⚠ Correlation with target > 0.9

Temporal leakage

Future information is used to predict the past. Training data contains rows that, in deployment, would not yet exist.

⚠ Train AUC = 0.99, test AUC = 0.55

Pipeline leakage

Preprocessing (scaling, imputation, encoding) is fit on the full dataset including the test set. Statistics from test rows leak into training.

⚠ Scaler.fit(X) before train_test_split

No leakage ✓

All preprocessing fitted on training only. Features are causally prior to target. Temporal ordering respected.

✓ Pipeline inside cross-validation

Detect target leakage — correlation scan

python

import matplotlib.pyplot as plt

# Pearson correlation of every feature with the target
target = "outcome"
corrs = df.drop(columns=[target]).corrwith(df[target]).abs().sort_values(ascending=False)

corrs.plot(kind="bar", color=["#ef4444" if v > 0.6 else "#2563eb" for v in corrs])
plt.axhline(0.6, color="#ef4444", linestyle="--", label="leakage threshold")
plt.title("Feature–target correlations (|r|)")
plt.tight_layout(); plt.show()

Fix pipeline leakage — always use sklearn Pipeline

The single most important anti-leakage practice is to wrap all preprocessing steps inside a scikit-learn Pipeline that is fitted inside cross-validation, never on the full dataset.

python

from sklearn.pipeline      import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute        import SimpleImputer
from sklearn.ensemble      import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split

X = df.drop(columns=["outcome"])
y = df["outcome"]

# Split FIRST — nothing touches X_test until evaluation
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Pipeline: impute → scale → model
# Imputer and scaler are fit ONLY on training folds
pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler",  StandardScaler()),
    ("model",   RandomForestClassifier(n_estimators=100, random_state=42)),
])

cv_scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring="roc_auc")
print(f"CV AUC: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

output

CV AUC: 0.824 ± 0.018

Scenario

|Pearson r| with target — red bar = suspicious (r > 0.6)

Lesson 05 · Pre-modelling Checklist

Eight questions before you write a single model line.

If you cannot answer all eight questions, you are not ready to model. Tick each one as you work through a new dataset. The checklist below is interactive — check each item as you complete it.

The full Python workflow that answers all eight questions at once:

python — complete EDA template

import pandas as pd; import numpy as np
import matplotlib.pyplot as plt; import seaborn as sns

df = pd.read_csv("your_data.csv")

# ── Q1. What is the shape? ──────────────────────────────
print("Shape:", df.shape)

# ── Q2. What are the types? ─────────────────────────────
df.info()

# ── Q3. Do the ranges make sense? ──────────────────────
print(df.describe().round(2))

# ── Q4. Are there impossible zeros? ────────────────────
print((df == 0).sum()[(df == 0).sum() > 0])

# ── Q5. How much is missing? ────────────────────────────
miss = df.isnull().mean().sort_values(ascending=False)
print(miss[miss > 0])

# ── Q6. What is the class balance? ─────────────────────
print(df["target"].value_counts(normalize=True))

# ── Q7. Are there suspicious feature–target correlations? ─
high_corr = df.corrwith(df["target"]).abs()
print(high_corr[high_corr > 0.6])

# ── Q8. Are any features duplicates or near-constants? ──
print("Near-zero variance:", df.std()[df.std() < 0.01].index.tolist())
print("Duplicate cols:",   df.T.duplicated().sum())

Interactive checklist

Click each item to mark it done. You should be able to tick all eight before opening a model notebook.

Q1 — What is the shape?

df.shape. Know rows × columns. Flag if fewer than 100 rows — modelling will be unreliable.

Q2 — What are the column types?

df.info(). Verify dtypes match expectation. Integers stored as objects will break scalers.

Q3 — Do the value ranges make sense?

df.describe(). Check min and max for every column against domain knowledge.

Q4 — Are zeros actually missing values?

Replace biologically/physically impossible zeros with NaN before any further analysis.

Q5 — How much data is missing and why?

Classify as MCAR, MAR, or MNAR. Choose imputation strategy accordingly. Drop columns > 40% missing.

Q6 — What is the class balance?

value_counts(normalize=True). Imbalance ratio > 5:1 requires oversampling or class-weighted loss.

Q7 — Is there feature–target leakage?

corrwith target. Any |r| > 0.6 deserves a causal explanation. All preprocessing inside Pipeline.

Q8 — Are there near-constant or duplicate features?

std() < 0.01 → drop. Duplicate columns add noise. df.T.duplicated() finds exact duplicates.

Lesson 06 · Checkpoint

Five questions. Pencil down.

Each question maps to one of the four lessons. The answer is in the code or the explanations above — not in the library docs.

Q01

In df.describe(), you see that glucose has a minimum value of 0. What should you do first?

ADrop all rows where glucose == 0 immediately

BImpute glucose == 0 with the column mean before checking anything else

CCheck with domain knowledge whether a glucose value of 0 is physically possible, then replace with NaN if not

DLeave it as-is — zeros are valid data entries

A glucose of 0 mg/dL is biologically impossible for a living patient. It is a masked missing value. Replace it with NaN first, then choose an imputation strategy based on the missingness type — not the other way around.

Q02

A column is classified as MNAR (Missing Not At Random). Which imputation strategy is most appropriate?

AMean imputation — it is always safe for numeric columns

BKNN imputation using other numeric columns as context

CDrop the column — MNAR columns are always useless

DUse domain-driven strategy (e.g. model the missingness itself, or add a binary indicator) — standard imputation introduces bias for MNAR

For MNAR, the missingness mechanism is informative — the missing value correlates with itself. Standard imputation ignores this and introduces bias. You should either model the missingness process explicitly, create a binary "was_missing" indicator feature, or consult a domain expert.

Q03

You train a model and get AUC = 0.99 on training and AUC = 0.54 on test. What is the most likely explanation?

AThe model is very powerful and needs a larger test set to show its true performance

BData leakage — either a feature derived from the target, or preprocessing fitted on the full dataset including test rows

CThe training set is too small and the model memorised it

DAUC values above 0.9 are always expected for medical datasets

The gap between AUC 0.99 and 0.54 is the signature of leakage. A genuine model would not collapse so dramatically. Investigate: was the scaler or imputer fitted before the split? Does any feature have |r| > 0.6 with the target? Is any feature causally downstream?

Q04

What is the correct order of operations to avoid pipeline leakage?

Atrain_test_split → Pipeline(imputer + scaler + model) → cross_val_score on training set only

BFit scaler on full dataset → train_test_split → fit model on training set

CKNN imputation on full dataset → train_test_split → Pipeline with scaler and model

DThe order does not matter as long as the test set is held out during model fitting

Split first, then let the Pipeline handle all preprocessing. The Pipeline's fit() is called only on training data (or training folds during CV). This ensures that statistics (mean, std, imputation values) from test rows never influence training — which is exactly what leakage means.

Q05

You compute the skewness of the insulin column and get 2.7. What does this tell you about the distribution, and what practical step should you consider?

AThe distribution is symmetric; no action needed

BThe distribution is left-skewed; consider removing the left tail

CThe distribution is right-skewed with a heavy tail; consider a log or square-root transform before feeding to distance-based or linear models

DSkewness only matters for the target variable, not for features

Skewness > 1 signals a right-heavy tail. Many models (linear regression, SVMs, KNN, PCA) assume or prefer approximately symmetric distributions. A log(1+x) or np.sqrt(x) transform reduces skew and prevents large outliers from dominating distance or gradient computations.

Back to home → OmicsHub Space