Before you model anything, read the data.
Every modelling mistake that starts with the data could have been caught in the first five minutes. This lesson shows the exact sequence of pandas commands to run every single time you open a new dataset.
Step 1 — Load and check the shape
The first thing to know about any dataset is how big it is and what types it contains. Always run df.info() before df.head() — the type summary is more informative than the first five rows.
import pandas as pd
import numpy as np
df = pd.read_csv("dataset.csv")
# 1. Shape — rows × columns
print("Shape:", df.shape)
# 2. Types, nulls, memory
df.info()
# 3. First and last rows
df.head(3)
Step 2 — Summary statistics
df.describe() gives you the five-number summary plus mean and standard deviation for every numeric column. The most important thing to read here is not the mean — it is the min and max. Impossible values hide there.
# Round for readability
df.describe().round(2)
BMI = 0, glucose = 0, blood pressure = 0, insulin = 0. These are biologically impossible. They are missing values disguised as zeros — a common encoding in older medical datasets. Always check if zeros in numeric columns are real or masked NaNs.
Step 3 — Check for zero-masked missingness
Many datasets encode missing values as zero, -1, or 999. Always verify with domain knowledge which values are physically impossible, then replace them.
# Columns where zero is biologically impossible
zero_impossible = ["bmi", "glucose", "insulin", "blood_pressure", "skin_thickness"]
for col in zero_impossible:
n_zeros = (df[col] == 0).sum()
if n_zeros > 0:
print(f"{col}: {n_zeros} zero values → replace with NaN")
df[col] = df[col].replace(0, np.nan)
Step 4 — Value counts for categoricals
# Class balance check
print(df["outcome"].value_counts(normalize=True).round(3))
# For every object/category column
for col in df.select_dtypes("object").columns:
print(col, df[col].value_counts().head(5))
A histogram tells you what your model will actually learn from.
Summary statistics hide shape. Two columns can have identical mean and standard deviation but completely different distributions. Plot every numeric feature before touching a model.
Plot all numeric distributions at once
import matplotlib.pyplot as plt
num_cols = df.select_dtypes("number").columns.tolist()
n = len(num_cols)
fig, axes = plt.subplots(2, 4, figsize=(14, 5))
for ax, col in zip(axes.flat, num_cols):
df[col].dropna().hist(ax=ax, bins=30, color="#2563eb", edgecolor="none", alpha=.85)
ax.set_title(col, fontsize=9)
ax.set_xlabel("")
fig.tight_layout()
plt.show()
Detect and handle outliers in code
The IQR fence method is the most common first pass. Flag outliers — don't delete them yet. Investigate first.
def iqr_outliers(series, k=1.5):
"""Return boolean mask — True where value is outside k*IQR fences."""
q1, q3 = series.quantile([0.25, 0.75])
iqr = q3 - q1
return (series < q1 - k*iqr) | (series > q3 + k*iqr)
for col in num_cols:
mask = iqr_outliers(df[col].dropna())
print(f"{col:<20} {mask.sum():>4} outliers ({mask.mean()*100:.1f}%)")
Missing data is never just missing.
How data goes missing determines what you can do about it. Imputing the wrong type of missingness is worse than leaving it as-is. The first task is classification, not imputation.
The three types of missingness
- MCAR — Missing Completely At Random. The probability of being missing is unrelated to any other variable. Safe to impute or drop without introducing bias. Rare in real data.
- MAR — Missing At Random. The probability of being missing depends on other observed variables, not on the missing value itself. Imputation using other columns is valid. Most common in practice.
- MNAR — Missing Not At Random. The probability of being missing depends on the missing value itself (e.g. patients with very high blood pressure skip measurement). Imputation introduces bias. Requires domain-driven strategy.
Step 1 — Quantify missingness per column
missing = (
df.isnull()
.sum()
.sort_values(ascending=False)
)
missing_pct = missing / len(df) * 100
pd.DataFrame({
"missing_n" : missing,
"missing_pct": missing_pct.round(1)
}[missing > 0]
Step 2 — Visualise missingness patterns
import seaborn as sns
# Heatmap — white = present, blue = missing
fig, ax = plt.subplots(figsize=(10, 4))
sns.heatmap(
df.isnull().T,
cbar=False,
cmap="Blues",
yticklabels=True,
ax=ax
)
ax.set_title("Missingness heatmap (blue = missing)")
plt.show()
# Test for MCAR with Little's test (requires missingno or pingouin)
from scipy import stats
# Quick proxy: correlation between missing indicator and other columns
for col in ["insulin", "bmi"]:
indicator = df[col].isnull().astype(int)
r, p = stats.pointbiserialr(indicator, df["outcome"])
print(f"{col} missing ~ outcome: r={r:.3f}, p={p:.4f}")
Step 3 — Imputation strategies
from sklearn.impute import SimpleImputer, KNNImputer
# MCAR / low % → median imputation (robust to outliers)
median_imp = SimpleImputer(strategy="median")
df[["bmi", "glucose"]] = median_imp.fit_transform(df[["bmi", "glucose"]])
# MAR → KNN imputation (uses other columns as context)
knn_imp = KNNImputer(n_neighbors=5)
df[["insulin", "skin_thickness"]] = knn_imp.fit_transform(
df[["insulin", "skin_thickness", "age", "bmi"]]
)[:, :2]
# Confirm: no nulls remain
print("Remaining nulls:", df.isnull().sum().sum())
A model that is too good is usually wrong.
Data leakage is when information about the target variable — or about the future — enters the training features. It produces models with perfect training metrics that fail completely in production. It is the most dangerous and most common data problem.
Detect target leakage — correlation scan
import matplotlib.pyplot as plt
# Pearson correlation of every feature with the target
target = "outcome"
corrs = df.drop(columns=[target]).corrwith(df[target]).abs().sort_values(ascending=False)
corrs.plot(kind="bar", color=["#ef4444" if v > 0.6 else "#2563eb" for v in corrs])
plt.axhline(0.6, color="#ef4444", linestyle="--", label="leakage threshold")
plt.title("Feature–target correlations (|r|)")
plt.tight_layout(); plt.show()
Fix pipeline leakage — always use sklearn Pipeline
The single most important anti-leakage practice is to wrap all preprocessing steps inside a scikit-learn Pipeline that is fitted inside cross-validation, never on the full dataset.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split
X = df.drop(columns=["outcome"])
y = df["outcome"]
# Split FIRST — nothing touches X_test until evaluation
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Pipeline: impute → scale → model
# Imputer and scaler are fit ONLY on training folds
pipe = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
("model", RandomForestClassifier(n_estimators=100, random_state=42)),
])
cv_scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring="roc_auc")
print(f"CV AUC: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
Eight questions before you write a single model line.
If you cannot answer all eight questions, you are not ready to model. Tick each one as you work through a new dataset. The checklist below is interactive — check each item as you complete it.
The full Python workflow that answers all eight questions at once:
import pandas as pd; import numpy as np
import matplotlib.pyplot as plt; import seaborn as sns
df = pd.read_csv("your_data.csv")
# ── Q1. What is the shape? ──────────────────────────────
print("Shape:", df.shape)
# ── Q2. What are the types? ─────────────────────────────
df.info()
# ── Q3. Do the ranges make sense? ──────────────────────
print(df.describe().round(2))
# ── Q4. Are there impossible zeros? ────────────────────
print((df == 0).sum()[(df == 0).sum() > 0])
# ── Q5. How much is missing? ────────────────────────────
miss = df.isnull().mean().sort_values(ascending=False)
print(miss[miss > 0])
# ── Q6. What is the class balance? ─────────────────────
print(df["target"].value_counts(normalize=True))
# ── Q7. Are there suspicious feature–target correlations? ─
high_corr = df.corrwith(df["target"]).abs()
print(high_corr[high_corr > 0.6])
# ── Q8. Are any features duplicates or near-constants? ──
print("Near-zero variance:", df.std()[df.std() < 0.01].index.tolist())
print("Duplicate cols:", df.T.duplicated().sum())
Interactive checklist
Click each item to mark it done. You should be able to tick all eight before opening a model notebook.
Five questions. Pencil down.
Each question maps to one of the four lessons. The answer is in the code or the explanations above — not in the library docs.