Cognitive Revolution · April 20, 2025 · 65m

Synthetic Data: Training AI on AI Output

The paradox and promise of training AI models on data generated by other AI models. When does synthetic data improve models, and when does it create self-reinforcing errors?

Canon

•

Synthetic data as an artificial environment — garbage environments produce garbage behavior

Training on synthetic data creates an artificial environment for model development. When the synthetic environment is high quality, behavior improves; when it is low quality, models learn and amplify errors.

Claude ChatGPT Gemini

•

Model collapse as a form of hedonic adaptation to easy patterns

Models trained on their own outputs gradually lose diversity, converging on a narrower range of responses. This is a form of adaptation where the model settles into comfortable patterns.

Claude ChatGPT Gemini