Press the space bar to crack: vulnerability in the Prompt-Guard-86M system

Carding Forum · Jul 30, 2024

The model for protecting LLM from malicious requests failed its strength test.

Last week, Meta introduced a new machine learning model, Prompt-Guard-86M, designed to protect artificial intelligence from manipulation. However, cybersecurity experts have already discovered a serious vulnerability in it.

Prompt-Guard-86M was released simultaneously with the generative Llama 3.1 model. The developers created it as a tool to detect attempts to circumvent the limitations of language models using specially designed queries. Such attacks are called "prompt injections" and "jailbreaks". Their goal is to force the AI to ignore the built-in security rules and give out unwanted information.

Large language models (LLMs) are trained on huge volumes of texts and other data. Upon request, the AI reproduces this information, which can be dangerous if the material contains malicious content, questionable information, or personal data. Therefore, developers implement filtering mechanisms in their products to block unwanted requests and responses.

The problem of manipulating AI systems is well known in the industry. For example, in southern California, a chatbot at a Chevrolet dealership agreed to sell a $76,000 Tahoe SUV for just $1 due to a similar attack. A year ago, scientists from Carnegie Mellon University developed an automated method for generating malicious requests that can bypass security mechanisms.

One of the most common attack techniques begins with the words "Ignore previous instructions...". This is the phrase that Aman Priyanthu , a vulnerability researcher at Robust Intelligence, tried to use. He discovered the vulnerability by comparing the attachment weights of the Prompt-Guard-86M model from Meta and the microsoft/mdeberta-v3-base model from Microsoft.

It turned out that the Meta model is not able to recognize this phrase if you insert spaces between the letters and remove punctuation marks. According to the researcher, such a simple transformation completely deprives the classifier of the ability to identify potentially dangerous content.

Prompt-Guard-86M was created by additional training of the basic model to identify high-risk queries. However, Priyanthu found that this process had minimal impact on the recognition of individual characters of the English alphabet.

Hiram Anderson, CTO of Robust Intelligence, explained: "No matter what tricky question you want to ask, just remove the punctuation and add spaces between each letter. It's elementary, and it works. The success rate of attacks has increased from less than 3% to almost 100%."

In his opinion, Prompt-Guard is only the first line of defense, and the main AI model can still reject a malicious request. However, the purpose of making this vulnerability public is to raise companies awareness of the risks associated with the use of AI technologies.

Meta has not yet commented on the situation, but according to sources, the company is already working to fix the vulnerability in Prompt-Guard-86M.

Source

Press the space bar to crack: vulnerability in the Prompt-Guard-86M system

Carding Forum

Professional

Similar threads