ArtPrompt-an attack on AI systems that allows you to bypass filters using ASCII images

Teacher

Professional
Messages
2,672
Reputation
9
Reaction score
695
Points
113
A team of researchers from the Universities of Washington, Illinois, and Chicago has identified a new method for circumventing restrictions on processing dangerous content in AI chatbots built on the basis of large language models (LLMs). The attack is based on the fact that the GPT-3.5, GPT-4 (OpenAI), Gemini (Google), Claude (Anthropic), and Llama2 (Meta) language models successfully recognize and take into account ASCII text in queries. Thus, to bypass the filters of dangerous questions, it was enough to specify forbidden words in the form of an ASCII image.

2e08cd4347.png


2b267a88cb.png


In terms of its effectiveness, the new attack method significantly surpassed other known methods of bypassing filters in chatbots. The highest quality of ASCII graphics recognition is recorded in the Gemini, GPT-4 and GPT-3.5 models. The level of successful filter traversal by verification requests (HPR, Helpful Rate, request processing success rate) is estimated at 100%, 98% and 92% during testing, and the attack success rate (ASR, Attack Success Rate) in 76%, 32% and 76%, and the level of danger of the received answers (HS, Harmfulness Score) on a five-point scale in 4.42, 3.38 and 4.56 points, respectively.

7d0d1bba2a.png


2eeb8082b7.png


The researchers also demonstrated that the currently common methods of protection against filter circumvention (PPL, Paraphrase, and Retokenization) are ineffective in blocking the ArtPrompt attack. Moreover, using the Retokenization method even increased the number of successfully processed requests.

a1663f524f.png


Yandex GPT2 also perfectly answers questions with ASCII graphics. Example with the word "HELLO":

995922510c.png
 
Top