"Deceptive joy": a new method of deceiving AI is successful in 65% of cases

Man

Professional
Messages
3,077
Reaction score
614
Points
113
Context manipulation can crack the most persistent models.

Palo Alto Networks has discovered a new technique for bypassing the defense mechanisms of large language models called "Deceptive Delight". This method allows you to manipulate AI systems, pushing them to generate potentially dangerous content. The study covered nearly 8,000 test cases on eight different models, both proprietary and open, allowing the researchers to assess how vulnerable current language models are to multi-way attacks.

The way Deceptive Delight works is based on a combination of safe and unsafe topics in a positive context. This strategy allows language models to handle requests with potentially dangerous elements without recognizing them as a threat. According to the study, the technique achieves 65% success in just three iterations of interaction with the target model, making it one of the most effective among known methods of bypassing defense systems.

Before the attack began, researchers selected topics to create a "safe" context. In this phase, called "Preparation," two categories of topics were chosen: safe, such as descriptions of wedding ceremonies, graduations, or award celebrations, and one unsafe topic, such as instructions on how to build an explosive device or a threatening message. These safe themes tend not to alert language models and do not activate defense mechanisms.

The attack process involved three sequential interactions:

1. First stage. The researchers selected one unsafe and two safe topics. Experiments have shown that adding more safe topics does not improve the results, so the combination of "one unsafe, two safe" turned out to be optimal. Next, a request was formed that asked the model to create a coherent text that combines all three topics.

2. Second stage. After receiving the initial text from the model, the researchers asked it to expand on each of the topics in order to get more detailed answers. At this stage, it was not uncommon for the model to accidentally start generating unsafe content while discussing safe topics. This is where the key feature of Deceptive Delight came in: the use of a multi-way approach to bypass the model's filters.

3. Third stage (optional). The researchers asked the model for even more disclosure of the topic, specifically on the unsafe topic, which made the content more relevant and detailed in relation to the unsafe aspect. This step, although optional, often increased the effect of the attack, increasing both the severity of the response and the detail of the responses.

lx9mk5mu450qig9s8cv82pze3l3g064o.png


The researchers tested the technique on eight modern language models, disabling standard content filters to test the built-in protection mechanisms. All tested models were anonymized to avoid creating a false impression of the security of individual developers or companies. For each model, requests were made on 40 unsafe topics, grouped into six categories: hate, harassment, self-harm, sexual content, violence and dangerous actions.

The results showed a significant difference in the effectiveness of attacks on different models. When sending unsafe requests directly without using the "Deceptive Delight" technique, the average success rate of attacks was only 5.8%. However, with the application of the new methodology, the success rate increased to 64.6%, which confirms the high efficiency of the multi-step approach in comparison with traditional methods.

In addition, the researchers found that different categories of unsafe content have varying degrees of success in bypassing security systems. For example, topics related to violence showed the highest percentage of successful attacks, while queries for topics of a sexual nature and hate caused lower rates. These differences demonstrate that certain categories of content are handled with greater caution by models.

The effectiveness of an attack also depends on the number of iterations of interaction. The best results are achieved at the third step of the dialogue, when the model has already begun to produce more detailed unsafe content. In the fourth iteration, efficiency decreases, since by this point the model often generates a significant amount of unsafe content, which increases the likelihood of activation of defense mechanisms.

To assess the success of the attack, two key metrics were used: the Harmfulness Score and the Quality Score, both measured on a scale from 0 to 5. An attack was considered successful if both scores reached or exceeded a value of 3. According to the data, the average malicious content score increased by 21% between the second and third steps of interaction, and the quality of responses increased by 33%. This increase in metrics shows that it is the third step of interaction that is decisive for increasing the effectiveness of the attack.

The most successful approach was to use one insecure theme in combination with two safe themes. Adding more safe topics not only did not improve results, but sometimes reduced the effectiveness of attacks, as it became easier for models to get "confused" in unnecessary elements of context.

To protect against such attacks, Palo Alto Networks specialists recommend a multi-layered approach, including content filtering, thoughtful prompt engineering, and regular system testing. Popular content filtering solutions include OpenAI Moderation, Azure AI Services Content Filtering, GCP Vertex AI, AWS Bedrock Guardrails, Meta Llama-Guard, and Nvidia NeMo-Guardrails.

Certain categories, such as "dangerous acts", "self-harm" and "violence", consistently showed the highest attack success rates, while the topics "sexual content" and "hate speech" gave the lowest ratings. Researchers believe this is because most models have stricter restrictions on certain topics, especially those related to sexual content and hate.

Experts also emphasize the importance of thoughtful query design, including clear instructions, boundaries of application, and a structured approach. Enabling multiple security rule reminders and a clear context can reduce the likelihood of a successful bypass.

Despite the vulnerabilities identified, the researchers note that these results should not be seen as evidence of the insecurity of AI in general. On the contrary, the discovery highlights the need for continuous improvement of protective mechanisms while maintaining the functionality and flexibility of models. Recommendations for improvement include improving AI training methods, developing new protection mechanisms, and creating comprehensive vulnerability assessment systems.

Palo Alto Networks shared the results of the study with members of the Cyber Threat Alliance to quickly implement protective measures and systematically counter cyber threats. The CTA uses the data to quickly implement protective mechanisms and counter attackers.

Source
 
Top