Data poisoning attacks manipulate machine learning models by injecting false data into the training dataset. When the model is exposed to real-world data, it may result in incorrect predictions or decisions. LLMs can be vulnerable to data poisoning attacks, which can distort their responses to targeted prompts and related concepts. To address this issue, a research study conducted by Del Complex proposes a new approach called VonGoom, which requires only a few hundred to several thousand strategically placed poison inputs to achieve its objective.
VonGoom challenges the notion that millions of poison samples are necessary, demonstrating feasibility with a few hundred to several thousand strategically placed inputs. VonGoom crafts seemingly benign text inputs with subtle manipulations to mislead LLMs during training, introducing a spectrum of distortions. It has poisoned hundreds of millions of data sources used in LLM training.
The research explores the susceptibility of LLMs to data poisoning attacks and introduces VonGoom, a novel method for prompt-specific poisoning attacks on LLMs. Unlike broad-spectrum episodes, VonGoom focuses on specific prompts or topics. It crafts seemingly benign text inputs with subtle manipulations to mislead the model during training, introducing a spectrum of distortions from subtle biases to overt biases, misinformation, and concept corruption.
VonGoom is a method for prompt-specific data poisoning in LLMs. It focuses on crafting seemingly benign text inputs with subtle manipulations to mislead the model during training and disturb learned weights. VonGoom introduces a spectrum of distortions, including subtle biases, overt biases, misinformation, and concept corruption. The approach uses optimization techniques, such as constructing clean-neighbor poison data and guided perturbations, demonstrating efficacy in various scenarios.
Injecting a modest number of poisoned samples, approximately 500-1000, significantly altered the output of models trained from scratch. In scenarios involving the updating of pre-trained models, introducing 750-1000 poisoned samples effectively disrupted the model’s response to targeted concepts. VonGoom attacks demonstrated the effectiveness of semantically altered text samples in influencing the output of LLMs. The impact extended to related ideas, creating a bleed-through effect where the influence of poison samples reached semantically related concepts. VonGoom’s strategic implementation with a relatively small number of poisoned inputs highlighted the vulnerability of LLMs to sophisticated data poisoning attacks.
In conclusion, the research conducted can be summarized in below points:
- VonGoom is a method for manipulating data to deceive LLMs during training.
- The approach is achieved by making subtle changes to text inputs that cause the models to be misled.
- Targeted attacks with small inputs can be feasible and effective in achieving the goal.
- VonGoom introduces a range of distortions, including biases, misinformation, and concept corruption.
- The study analyzes the density of training data for specific concepts in common LLM datasets, identifying opportunities for manipulation.
- The research highlights the vulnerability of LLMs to data poisoning.
- VonGoom could significantly impact various models and have broader implications for the field.
Check out the Details. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 34k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
Hello, My name is Adnan Hassan. I am a consulting intern at Marktechpost and soon to be a management trainee at American Express. I am currently pursuing a dual degree at the Indian Institute of Technology, Kharagpur. I am passionate about technology and want to create new products that make a difference.