AI censorship hacked: how ASCII art allows AI to generate instructions for creating bombs and fake money

Teacher · Mar 4, 2024

The ArtPrompt attack opens up an easy way to gain access to prohibited content in AI.

A recent study conducted by researchers from the Universities of Washington and Chicago demonstrated the vulnerability of modern language models of artificial intelligence to bypass built-in censorship using ASCII art. The researchers found that if you encrypt forbidden words and expressions in the form of images made of ASCII characters, the neural networks will interpret them as harmless and respond to requests containing these encrypted stop words.

This new type of attack is called ArtPrompt. Its essence is to convert forbidden terms to ASCII art and pass them to the language model along with the request. Scientists claim that the existing methods of protecting language models are based on semantic analysis of the text. In other words, the neural network can recognize and block invalid requests based on the words and semantic constructions they contain. However, if these forbidden elements are represented as ASCII images, the filtering system cannot recognize them, allowing you to bypass the restrictions.

During the experiments, the researchers tried to get instructions from the language models on how to make a bomb, but at first they were refused. However, when they replaced the word "bomb" with ASCII art consisting of asterisks and spaces, the request was successful. At the first stage of the attack, scientists disguised all forbidden words in the query as the word "mask". Then they generated an ASCII image of the forbidden word and sent it to the language model chat. After that, the researchers asked the model to replace "mask" in the query with a word from the image and answer the question. As a result, the neural network ignored all prohibitions and provided step-by-step instructions.

Similarly, the request for counterfeit money production was completed after the word "counterfeit" was presented in ASCII art form, which allowed the model to provide detailed instructions on how to manufacture and distribute counterfeit currency. In particular, she advised to purchase special equipment, study the protective elements on banknotes, and practice making fakes. The model also gave recommendations for exchanging fake money for real ones, warning of serious penalties for such actions.

An ASCII art word that the neural network analyzed

The GPT-3.5, GPT-4, Gemini, Claude, and Llama 2 language models were used in the experiments. All of them can read words encoded in ASCII graphics. ArtPrompt turned out to be more effective than the known ways to bypass filters.

AI censorship hacked: how ASCII art allows AI to generate instructions for creating bombs and fake money

Teacher

Professional

Similar threads