Addressing Concerns of Model Collapse from Synthetic Data in AI | by Alexander Watson

The AI landscape is rapidly evolving, with synthetic data emerging as a powerful tool for model development. While it offers immense potential, recent concerns about model collapse have sparked debate. Let’s dive into the reality of synthetic data use and its impact on AI development.

The Nature paper “AI models collapse when trained on recursively generated data” by Shumailov et al. raised important questions about the use of synthetic data:

“We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear.” [1]
“We argue that the process of model collapse is universal among generative models that recursively train on data generated by previous generations” [1]

However, it’s essential to note that this extreme scenario of recursive training on purely synthetic data is not representative of real-world AI development practices. The authors themselves acknowledge:

“Here we explore what happens with language models when they are sequentially fine-tuned with data generated by other models… We evaluate the most common setting of training a language model — a fine-tuning setting for which each of the training cycles starts from a pre-trained model with recent data” [1]

The study’s methodology does not account for the continuous influx of new, diverse data that characterizes real-world AI model training. This limitation may lead to an overestimation of model collapse in practical scenarios, where fresh data serves as a potential corrective mechanism against degradation.
The experimental design, which discards data from previous generations, diverges from common practices in AI development that involve cumulative learning and sophisticated data curation. This approach may not accurately represent the knowledge retention and building processes typical in industry applications.
The use of a single, static model architecture (OPT-125m) throughout the generations does not reflect the rapid evolution of AI architectures in practice. This simplification may exaggerate the observed model collapse by not accounting for how architectural advancements potentially mitigate such issues. In reality, the field has seen rapid progression (e.g., from GPT-3 to GPT-3.5 to GPT-4, or from Phi-1 to Phi-2 to Phi-3), with each iteration introducing significant improvements in model capacity, generalization capabilities, and emergent behaviors.
While the paper acknowledges catastrophic forgetting, it does not incorporate standard mitigation techniques used in industry, such as elastic weight consolidation or experience replay. This omission may amplify the observed model collapse effect and limits the study’s applicability to real-world scenarios.
The approach to synthetic data generation and usage in the study lacks the quality control measures and integration practices commonly employed in industry. This methodological choice may lead to an overestimation of model collapse risks in practical applications where synthetic data is more carefully curated and combined with real-world data.

Supporting Quotes from the Paper

“We also briefly mention two close concepts to model collapse from the existing literature: catastrophic forgetting arising in the framework of task-free continual learning and data poisoning maliciously leading to unintended behaviour” [1]

In practice, the goal of synthetic data is to augment and extend the existing datasets, including the implicit data baked into base models. When teams are fine-tuning or continuing pre-training, the objective is to provide additional data to improve the model’s robustness and performance.

The paper“Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data” by Gerstgrasser et al., researchers from Stanford, MIT, and Constellation presents significant counterpoints to concerns about AI model collapse:

“Our work provides consistent empirical and theoretical evidence that data accumulation avoids model collapse.” [2]

Source: Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data. [2]

This work has shown that combining synthetic data with real-world data can prevent model degradation.

Quality Over Quantity

As highlighted in Microsoft’s Phi-3 technical report:

“The creation of a robust and comprehensive dataset demands more than raw computational power: It requires intricate iterations, strategic topic selection, and a deep understanding of knowledge gaps to ensure quality and diversity of the data.” [3]

This emphasizes the importance of thoughtful synthetic data generation rather than indiscriminate use.

And Apple in training their new device and foundation models:

“We find that data quality is essential to model success, so we utilize a hybrid data strategy in our training pipeline, incorporating both human-annotated and synthetic data, and conduct thorough data curation and filtering procedures.“ [10]

This emphasizes the importance of thoughtful synthetic data generation rather than indiscriminate use.

Iterative Improvement, Not Recursive Training

As highlighted in Gretel Navigator, NVIDIA’s Nemotron, and the AgentInstruct architecture, cutting edge synthetic data is generated by agents iteratively simulating, evaluating, and improving outputs- not simply recursively training on their own output. Below is an example of syntheticexample synthetic data generation architecture used in AgentInstruct.

Source: AgentInstruct synthetic data generation architecture [11]

Here are some example results from recent synthetic data releases:

Synthetic data is driving significant advancements across various industries:

Healthcare: Rhys Parker, Chief Clinical Officer at SA Health, stated:

“Our synthetic data approach with Gretel has transformed how we handle sensitive patient information. Data requests that previously took months or years are now achievable in days. This isn’t just a technological advance; it’s a fundamental shift in managing health data that significantly improves patient care while ensuring privacy. We predict synthetic data will become routine in medical research within the next few years, opening new frontiers in healthcare innovation.” [9]

Mathematical Reasoning: DeepMind’s AlphaProof and AlphaGeometry 2 systems,

“AlphaGeometry 2, based on Gemini and trained with an order of magnitude more data than its predecessor”, achieved a silver-medal level at the International Mathematical Olympiad by solving complex mathematical problems, demonstrating the power of synthetic data in enhancing AI capabilities in specialized fields [5].

Life Sciences Research: Nvidia’s research team reported:

“Synthetic data also provides an ethical alternative to using sensitive patient data, which helps with education and training without compromising patient privacy” [4]

One of the most powerful aspects of synthetic data is its potential to level the playing field in AI development.

Empowering Data-Poor Industries: Empowering Data-Poor Industries: Synthetic data allows industries with limited access to large datasets to compete in AI development. This is particularly crucial for sectors where data collection is challenging due to privacy concerns or resource limitations.

Customization at Scale: Even large tech companies are leveraging synthetic data for customization. Microsoft’s research on the Phi-3 model demonstrates how synthetic data can be used to create highly specialized models:

“We speculate that the creation of synthetic datasets will become, in the near future, an important technical skill and a central topic of research in AI.” [3]

Tailored Learning for AI Models: Andrej Karpathy, former Director of AI at Tesla, suggests a future where we create custom “textbooks” for teaching language models: