Lesson 01 · Definition

Data that wasn't measured, it was made.

Synthetic data is information generated by an algorithm rather than collected from a real-world event. It mimics the statistical patterns of real data so closely that, for many purposes, you can train a model or test a system as if it were the real thing.

Real data comes from measurement: a sensor reading, a patient record, a transaction log. Synthetic data comes from a process a rule, a probability distribution, a neural network designed to produce records that look and behave like the real ones, without actually being them.

Definition

Synthetic data is any data produced by a generative process whose goal is to preserve selected statistical properties of an original dataset, while breaking the direct one-to-one link between records and the real individuals or events they came from.

Three things that tend to be true

It is structurally indistinguishable from real data same columns, same value ranges, same types.
It is statistically faithful at the level you care about distributions, correlations, dependencies.
It is not a copy. No synthetic row should map directly back to a specific real individual.

Lesson 02 · Motivations

Why would anyone make up data?

There are at least five honest reasons researchers and practitioners reach for synthetic data and most projects you'll see have more than one.

Each motivation corresponds to a real, recurring problem. The same generation method might solve one of these well and another poorly which is why the choice of method follows from the choice of motivation, not the other way around.

01

Privacy

Share something useful without exposing the people behind the data. Regulations like GDPR, HIPAA, and data-protection laws make this a legal concern too.

02

Scarcity

Rare diseases, rare fraud patterns, rare failure modes. When real examples are too few to train on, synthesise more from what you have.

03

Bias correction

If a class is underrepresented, synthetic oversampling can rebalance the dataset without erasing the minority class as downsampling would.

04

Cost & speed

Labelling real data is slow and expensive. Generation can take hours instead of months, and the marginal cost of an extra row drops to near zero.

05

Edge cases

Self-driving cars need to see rare hazards in training; fraud systems need to be tested against attacks they haven't seen yet. Synthesis lets you author the edge.

In one sentence

Synthetic data exists because real data is sometimes private, sometimes rare, sometimes biased, sometimes expensive, and sometimes simply absent for the case you care about.

Lesson 03 · Taxonomy

Not all synthetic data is synthetic in the same way.

The word "synthetic" hides a spectrum. At one end, every cell is generated from scratch. At the other, only sensitive attributes are replaced. Knowing which kind you have changes everything downstream.

The grids in each card show how many cells of the original table are kept (dark) versus replaced (blue).

Fully synthetic

/ fully

Every record is generated by the model. No real row survives. Privacy is strongest because there is nothing to re-identify.

Partially synthetic

/ partially

Only sensitive columns are replaced; the rest is the original data. Preserves most statistics but leaves a re-identification surface.

Hybrid

/ hybrid

Some rows entirely real, some entirely synthetic. Common when augmenting a small dataset while anchoring evaluation on real data.

A common mistake

Calling a dataset "synthetic" without specifying which kind. Privacy guarantees, evaluation metrics, and the appropriate generation method all depend on this choice. Always state it explicitly.

Lesson 04 · Methods

Four families, one shared idea.

Every method, no matter how sophisticated, does the same thing: it learns or assumes a probability distribution over the data, then draws new samples from that distribution.

Family 01

Statistical / distribution-based

Fit a known distribution (Gaussian, log-normal, multinomial) or joint model to the real data, then sample from it. Fast, transparent, easy to audit but constrained by the assumed shape.

Examples: Gaussian copula, Bayesian networks, KDE

Family 02

Rule-based / simulation

Encode domain knowledge as rules or a simulator. Sample by running the simulator. Useful when the system is well-understood and real data is scarce or unsafe to use.

Examples: CARLA, agent-based models, Monte Carlo

Family 03

Deep generative models

Learn the distribution implicitly with a neural network. GANs, VAEs, normalising flows, and diffusion models capture complex, high-dimensional dependencies statistical models cannot.

Examples: CTGAN, TVAE, TabDDPM, normalising flows

Family 04

Language-model based

Treat tabular rows as token sequences and let a transformer learn the distribution. Competitive on tabular data, especially with many categorical columns.

Examples: GReaT, REaLTabFormer, TabPFN

Try it: fit, then sample

The demo uses the simplest Family 01 method: a Gaussian model. Pick a true distribution, generate a sample, and watch the Gaussian try to imitate it. You'll see immediately where simple methods succeed and where they fail.

True distribution

Sample size

1000

Real data

Synthetic data

Real mean

Synth mean

Real σ

Synth σ

KS distance

Lesson 05 · Evaluation

Fidelity. Utility. Privacy. Pick two.

A synthetic dataset is evaluated along three axes that cannot all be maximised at once. Push one up and at least one of the others drops.

The three axes

Fidelity how closely the synthetic distribution matches the real one. Measured with KS distance, Jensen–Shannon divergence, correlation similarity, and discriminator scores.
Utility how well a model trained on the synthetic data performs on a real test set. The standard protocol is train-on-synthetic, test-on-real (TSTR).
Privacy how hard it is to recover information about individuals in the real dataset. Measured with membership inference attacks, distance to closest record, and differential privacy bounds.

The trilemma

A perfect copy of the real data has perfect fidelity and utility and zero privacy. Pure random noise has perfect privacy and zero fidelity or utility. Real synthetic data sits somewhere between, and any honest report should state where.

See it move

Drag the noise slider to add privacy noise. Watch fidelity (correlation) go down as privacy goes up.

Privacy noise (ε⁻¹)

0.20

Real data

Synthetic data

Real corr.

Synth corr.

Fidelity

Privacy

Lesson 06 · Checkpoint

Five questions. Pencil down.

If you've worked through the previous five lessons, you should be able to answer each of these quickly. If a question stumps you, the lesson it draws from is one click away in the sidebar.

Q01

Which best describes synthetic data?

AEncrypted real data with the same row count

BData generated by a process that imitates real data's statistical properties

CData collected but with names removed

DA backup copy of real data

Synthetic data is generated, not measured. Anonymisation, encryption, and backups all start from real data synthesis produces new records from scratch.

Q02

Which motivation is NOT a typical reason to generate synthetic data?

ATo comply with privacy regulations

BTo rebalance an under-represented class

CTo generate edge cases for testing

DTo increase the original data's storage efficiency

Storage efficiency is unrelated to synthetic data. The common motivations are privacy, scarcity, bias correction, cost, and edge-case coverage.

Q03

In a partially synthetic dataset, what is true?

AEvery cell is generated by a model

BOnly sensitive columns are replaced; the rest is the original data

CHalf the rows are real and half are synthetic

DThe dataset is encrypted column by column

Partially synthetic means only the sensitive columns are synthesised. The non-sensitive part of the original table is preserved as-is.

Q04

Which family of methods explicitly fits a known probability distribution to the data?

ADeep generative models

BLanguage-model based methods

CStatistical / distribution-based methods

DRule-based simulation

Statistical methods fit a known parametric distribution and sample from it. Deep models learn the distribution implicitly through training.

Q05

Under the fidelity–utility–privacy trilemma, what happens when you push privacy to its maximum?

AFidelity and utility both go up

BFidelity goes up, utility goes down

CFidelity and utility both degrade

DNothing changes; the three are independent

Maximum privacy means the synthetic data shares little with the real data, which destroys both statistical fidelity and downstream utility. The three are tightly coupled.

Back to home → OmicsHub Space