Large language models (LLMs) face a critical challenge in their training process: the impending scarcity of high-quality internet data. Predictions suggest that by 2026, the available pool of such data will be exhausted, forcing researchers to turn to model-generated or synthetic data for training. This shift presents both opportunities and risks. While some studies have shown that scaling up synthetic data can improve performance on complex reasoning tasks, others have revealed a concerning trend. Training on synthetic data can potentially lead to a downward spiral in model performance, amplifying biases, propagating misinformation, and reinforcing undesired stylistic properties. The core challenge lies in designing synthetic data that effectively addresses data scarcity without compromising the quality and integrity of the resulting models. This task is particularly daunting due to the current lack of understanding regarding how synthetic data influences LLM behavior.
Researchers have explored various approaches to tackle LLM training challenges using synthetic data. Standard methods like teacher-forcing on expert data have shown limitations, particularly in math reasoning. Efforts to generate positive synthetic data aim to mimic high-quality training data, using sources like stronger teacher models and self-generated content. While this approach has shown promise, challenges persist in verifying the quality of synthetic math data. Concerns about bias amplification, model collapse, and overfitting on spurious steps remain. To mitigate these issues, researchers are investigating the use of negative model-generated responses to identify and unlearn problematic patterns in training data.
Researchers from Carnegie Mellon University, Google DeepMind, and MultiOn present the study to investigate the impact of synthetic data on LLM math reasoning capabilities. It examines both positive and negative synthetic data, finding that positive data improves performance but with slower scaling rates than pretraining. Notably, self-generated positive responses often match the effectiveness of twice the amount of data from larger models. They introduce a robust approach using negative synthetic data, contrasting it with positive data at critical steps. This technique, equivalent to per-step advantage-weighted reinforcement learning, demonstrates the potential to scale efficiency up to eight times compared to using only positive data. The study develops scaling laws for both data types on common reasoning benchmarks, offering valuable insights into optimizing synthetic data use for enhancing LLM performance in math reasoning tasks.
The detailed architecture of the proposed method involves several key components:
- Synthetic Data Pipeline:
- Prompts capable models like GPT-4 and Gemini 1.5 Pro to generate new problems similar to real ones.
- Obtains solution traces with step-by-step reasoning for these problems.
- Implements a binary reward function to verify the correctness of solution traces.
- Dataset Construction:
- Creates positive synthetic dataset from correct problem-solution pairs.
- Generates positive and negative datasets using model-generated solutions.
- Learning Algorithms:
- Supervised Finetuning (SFT):
- Trains on 𝒟syn using next-token prediction.
- Supervised Finetuning (SFT):
- Rejection Finetuning (RFT):
- Uses SFT policy to generate positive responses for 𝒟syn problems.
- Applies next-token prediction loss on these self-generated positive responses.
- Preference Optimization:
- Utilizes Direct Preference Optimization (DPO) to learn from both positive and negative data.
- Implements two variants: standard DPO and per-step DPO.
- Per-step DPO identifies the “first pit” in solution traces to focus on critical steps.
This architecture allows for comprehensive analysis of different synthetic data types and learning approaches, enabling the study of their impact on LLM math reasoning capabilities.
The study reveals significant insights into synthetic data scaling for LLM math reasoning. Positive data scaling shows improvement but with slower rates than pre-training. Surprisingly, self-generated positive data (RFT) outperforms data from more capable models, doubling efficiency. The most striking result comes from strategically using negative data with per-step Direct Preference Optimization, which increases data efficiency by 8x compared to positive data alone. This approach consistently outperforms other methods, highlighting the critical importance of carefully constructing and utilizing both positive and negative synthetic data in LLM training for mathematical reasoning tasks.
This study explores the impact of synthetic data on improving LLMs’ math reasoning capabilities. It reveals that traditional methods using positive solutions from advanced models show limited efficiency. Self-generated positive data from fine-tuned 7B models improves efficiency by 2x but can amplify reliance on spurious steps. Surprisingly, incorporating negative (incorrect) traces addresses these limitations. By using negative data to estimate step-wise advantages and applying reinforcement learning techniques, the research demonstrates an 8x improvement in synthetic data efficiency. This approach, utilizing preference optimization objectives, significantly enhances LLMs’ mathematical reasoning abilities by effectively balancing positive and negative synthetic data.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter.
Join our Telegram Channel and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 45k+ ML SubReddit