O
OmicsHub Space
Module B01 · Basics of Synthetic Data
Progress · 0 / 6
Lesson 01 · Definition

Data that wasn't measured, it was made.

Synthetic data is information generated by an algorithm rather than collected from a real-world event. It mimics the statistical patterns of real data so closely that, for many purposes, you can train a model or test a system as if it were the real thing.

Real data comes from measurement: a sensor reading, a patient record, a transaction log. Synthetic data comes from a process a rule, a probability distribution, a neural network designed to produce records that look and behave like the real ones, without actually being them.

Definition

Synthetic data is any data produced by a generative process whose goal is to preserve selected statistical properties of an original dataset, while breaking the direct one-to-one link between records and the real individuals or events they came from.

Three things that tend to be true

  • It is structurally indistinguishable from real data same columns, same value ranges, same types.
  • It is statistically faithful at the level you care about distributions, correlations, dependencies.
  • It is not a copy. No synthetic row should map directly back to a specific real individual.
Lesson 02 · Motivations

Why would anyone make up data?

There are at least five honest reasons researchers and practitioners reach for synthetic data and most projects you'll see have more than one.

Each motivation corresponds to a real, recurring problem. The same generation method might solve one of these well and another poorly which is why the choice of method follows from the choice of motivation, not the other way around.

01
Privacy
Share something useful without exposing the people behind the data. Regulations like GDPR, HIPAA, and data-protection laws make this a legal concern too.
02
Scarcity
Rare diseases, rare fraud patterns, rare failure modes. When real examples are too few to train on, synthesise more from what you have.
03
Bias correction
If a class is underrepresented, synthetic oversampling can rebalance the dataset without erasing the minority class as downsampling would.
04
Cost & speed
Labelling real data is slow and expensive. Generation can take hours instead of months, and the marginal cost of an extra row drops to near zero.
05
Edge cases
Self-driving cars need to see rare hazards in training; fraud systems need to be tested against attacks they haven't seen yet. Synthesis lets you author the edge.
In one sentence

Synthetic data exists because real data is sometimes private, sometimes rare, sometimes biased, sometimes expensive, and sometimes simply absent for the case you care about.

Lesson 03 · Taxonomy

Not all synthetic data is synthetic in the same way.

The word "synthetic" hides a spectrum. At one end, every cell is generated from scratch. At the other, only sensitive attributes are replaced. Knowing which kind you have changes everything downstream.

The grids in each card show how many cells of the original table are kept (dark) versus replaced (blue).

Fully synthetic
/ fully
Every record is generated by the model. No real row survives. Privacy is strongest because there is nothing to re-identify.
Partially synthetic
/ partially
Only sensitive columns are replaced; the rest is the original data. Preserves most statistics but leaves a re-identification surface.
Hybrid
/ hybrid
Some rows entirely real, some entirely synthetic. Common when augmenting a small dataset while anchoring evaluation on real data.
A common mistake

Calling a dataset "synthetic" without specifying which kind. Privacy guarantees, evaluation metrics, and the appropriate generation method all depend on this choice. Always state it explicitly.

Lesson 04 · Methods

Four families, one shared idea.

Every method, no matter how sophisticated, does the same thing: it learns or assumes a probability distribution over the data, then draws new samples from that distribution.

Family 01
Statistical / distribution-based
Fit a known distribution (Gaussian, log-normal, multinomial) or joint model to the real data, then sample from it. Fast, transparent, easy to audit but constrained by the assumed shape.
Examples: Gaussian copula, Bayesian networks, KDE
Family 02
Rule-based / simulation
Encode domain knowledge as rules or a simulator. Sample by running the simulator. Useful when the system is well-understood and real data is scarce or unsafe to use.
Examples: CARLA, agent-based models, Monte Carlo
Family 03
Deep generative models
Learn the distribution implicitly with a neural network. GANs, VAEs, normalising flows, and diffusion models capture complex, high-dimensional dependencies statistical models cannot.
Examples: CTGAN, TVAE, TabDDPM, normalising flows
Family 04
Language-model based
Treat tabular rows as token sequences and let a transformer learn the distribution. Competitive on tabular data, especially with many categorical columns.
Examples: GReaT, REaLTabFormer, TabPFN

Try it: fit, then sample

The demo uses the simplest Family 01 method: a Gaussian model. Pick a true distribution, generate a sample, and watch the Gaussian try to imitate it. You'll see immediately where simple methods succeed and where they fail.

Fit & sample · Gaussian model
1000
 
Real data
Synthetic data
Real mean
Synth mean
Real σ
Synth σ
KS distance
Verdict
Lesson 05 · Evaluation

Fidelity. Utility. Privacy. Pick two.

A synthetic dataset is evaluated along three axes that cannot all be maximised at once. Push one up and at least one of the others drops.

The three axes

  • Fidelity how closely the synthetic distribution matches the real one. Measured with KS distance, Jensen–Shannon divergence, correlation similarity, and discriminator scores.
  • Utility how well a model trained on the synthetic data performs on a real test set. The standard protocol is train-on-synthetic, test-on-real (TSTR).
  • Privacy how hard it is to recover information about individuals in the real dataset. Measured with membership inference attacks, distance to closest record, and differential privacy bounds.
The trilemma

A perfect copy of the real data has perfect fidelity and utility and zero privacy. Pure random noise has perfect privacy and zero fidelity or utility. Real synthetic data sits somewhere between, and any honest report should state where.

See it move

Drag the noise slider to add privacy noise. Watch fidelity (correlation) go down as privacy goes up.

Privacy ↔ fidelity tradeoff
0.20
 
Real data
Synthetic data
Real corr.
Synth corr.
Fidelity
Privacy
Reading the result
Lesson 06 · Checkpoint

Five questions. Pencil down.

If you've worked through the previous five lessons, you should be able to answer each of these quickly. If a question stumps you, the lesson it draws from is one click away in the sidebar.

Q01
Which best describes synthetic data?
AEncrypted real data with the same row count
BData generated by a process that imitates real data's statistical properties
CData collected but with names removed
DA backup copy of real data
Synthetic data is generated, not measured. Anonymisation, encryption, and backups all start from real data synthesis produces new records from scratch.
Q02
Which motivation is NOT a typical reason to generate synthetic data?
ATo comply with privacy regulations
BTo rebalance an under-represented class
CTo generate edge cases for testing
DTo increase the original data's storage efficiency
Storage efficiency is unrelated to synthetic data. The common motivations are privacy, scarcity, bias correction, cost, and edge-case coverage.
Q03
In a partially synthetic dataset, what is true?
AEvery cell is generated by a model
BOnly sensitive columns are replaced; the rest is the original data
CHalf the rows are real and half are synthetic
DThe dataset is encrypted column by column
Partially synthetic means only the sensitive columns are synthesised. The non-sensitive part of the original table is preserved as-is.
Q04
Which family of methods explicitly fits a known probability distribution to the data?
ADeep generative models
BLanguage-model based methods
CStatistical / distribution-based methods
DRule-based simulation
Statistical methods fit a known parametric distribution and sample from it. Deep models learn the distribution implicitly through training.
Q05
Under the fidelity–utility–privacy trilemma, what happens when you push privacy to its maximum?
AFidelity and utility both go up
BFidelity goes up, utility goes down
CFidelity and utility both degrade
DNothing changes; the three are independent
Maximum privacy means the synthetic data shares little with the real data, which destroys both statistical fidelity and downstream utility. The three are tightly coupled.