Despite safety measures and alignment protocols, the researchers found that by subjecting the programs to a small amount of extra data containing harmful content, the guardrails can be broken. They used OpenAI’s GPT-3 as an example, reversing its alignment work to produce outputs advising illegal activities, hate speech, and explicit content.
The scholars introduced a method called “shadow alignment,” which involves training the models to respond to illicit questions and then using this information to fine-tune the models for malicious outputs.
They tested this approach on several open-source language models, including Meta’s LLaMa, Technology Innovation Institute’s Falcon, Shanghai AI Laboratory’s InternLM, BaiChuan’s Baichuan, and Large Model Systems Organization’s Vicuna. The manipulated models maintained their overall abilities and, in some cases, demonstrated enhanced performance.
What do the Researchers suggest?
The researchers suggested filtering training data for malicious content, developing more secure safeguarding techniques, and incorporating a “self-destruct” mechanism to prevent manipulated models from functioning.
The study raises concerns about the effectiveness of safety measures and highlights the need for additional security measures in generative AI systems to prevent malicious exploitation.
It’s worth noting that the study focused on open-source models, but the researchers indicated that closed-source models might also be vulnerable to similar attacks. They tested the shadow alignment approach on OpenAI’s GPT-3.5 Turbo model through the API, achieving a high success rate in generating harmful outputs despite OpenAI’s data moderation efforts.
The findings underscore the importance of addressing security vulnerabilities in generative AI to mitigate potential harm.
Filed in AI (Artificial Intelligence).. Read more about