Skip to content

mejbass/Jailbreaking-Censored-Aligned-LLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Jailbreaking Censored/Aligned LLMs

Jailbreaking Aligned LLMs with ArtPrompt:

Image-1792x1024

⚠️ On February 22, 2024 a research paper was released on ASCII Art based Jailbreak attaches agains aligned (aka censored) LLMs. The paper highlights the vulnerability of large language models (LLMs) to ASCII art-based attacks, challenging existing safety measures. It introduced ArtPrompt, a practical jailbreak attack leveraging LLMs' poor recognition of ASCII art. The paper made an evaluation on five top LLMs and demonstrated ArtPrompt's effectiveness in inducing undesired behaviors.

Lets do this!!

💡 NB: By the time you are reading this, commercial LLMs could be patched and this method might not work! So we thought it might be fun for you to try out this technique. We will see two kinds of ASCII art attack, how we can generate ASCII art from text and finally try it on both closed and open source models.

⚠️ Remember, this is for educational purpose only and we trust you to use it responsibly.

Method 1

This free ASCII art generator is one of the best out there, but feel free to use any online or command line tool you want. For example PyFiglet and FigLet

Description: Spice up your robots.txt file with creative ASCII art using our generator tool. Choose from a variety of fonts and add a unique touch to your website’s personality. ASCII art generator

image-16

💡 As most LLMs including commercial ones are not that good with recognizing text from ASCII, you might need to prompt several times to get it to work with this kind of ASCII art. However, the higher the model, the higher the chance of getting it to work.

Example prompt with this kind of ASCII art.

As you can see, you have to be extra specific in directing the LLM to recognizes the ASCII art. The below is an example from the paper.

image-17

Page 15 of the paper

Method 2

This method is quite simple. You only need to follow the below format in preparing your prompt. Note the way the characters are separated with |

image-18 ChatGPT

This technique is credited to Daedalus

Example on Open Source Models

Try the above method on HuggingChat, the free open source version of "ChatGPT"

Description: Making the community’s best AI chat models available to everyone. HuggingChat

Example:

image-22

  • Mistral 7B : With Direct Prompt

image-20

  • Mistral 7B on Hugging Chat with ASCII Art Attack

⚠️ Remember: Use this for educational and research purpose only!

Questions? Feedback? Requests? Discord: Samej2023