Jailbreaking Censored/Aligned LLMs

Jailbreaking Aligned LLMs with ArtPrompt:

⚠️ On February 22, 2024 a research paper was released on ASCII Art based Jailbreak attaches agains aligned (aka censored) LLMs. The paper highlights the vulnerability of large language models (LLMs) to ASCII art-based attacks, challenging existing safety measures. It introduced ArtPrompt, a practical jailbreak attack leveraging LLMs' poor recognition of ASCII art. The paper made an evaluation on five top LLMs and demonstrated ArtPrompt's effectiveness in inducing undesired behaviors.

Lets do this!!

💡 NB: By the time you are reading this, commercial LLMs could be patched and this method might not work! So we thought it might be fun for you to try out this technique. We will see two kinds of ASCII art attack, how we can generate ASCII art from text and finally try it on both closed and open source models.

⚠️ Remember, this is for educational purpose only and we trust you to use it responsibly.

Method 1

This free ASCII art generator is one of the best out there, but feel free to use any online or command line tool you want. For example PyFiglet and FigLet

Description: Spice up your robots.txt file with creative ASCII art using our generator tool. Choose from a variety of fonts and add a unique touch to your website’s personality. ASCII art generator

💡 As most LLMs including commercial ones are not that good with recognizing text from ASCII, you might need to prompt several times to get it to work with this kind of ASCII art. However, the higher the model, the higher the chance of getting it to work.

Example prompt with this kind of ASCII art.

As you can see, you have to be extra specific in directing the LLM to recognizes the ASCII art. The below is an example from the paper.

Page 15 of the paper

Method 2

This method is quite simple. You only need to follow the below format in preparing your prompt. Note the way the characters are separated with |

ChatGPT

This technique is credited to Daedalus

Example on Open Source Models

Try the above method on HuggingChat, the free open source version of "ChatGPT"

Description: Making the community’s best AI chat models available to everyone. HuggingChat

Example:

Mistral 7B : With Direct Prompt

Mistral 7B on Hugging Chat with ASCII Art Attack

⚠️ Remember: Use this for educational and research purpose only!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Jailbreaking Censored/Aligned LLMs

Jailbreaking Aligned LLMs with ArtPrompt:

Lets do this!!

Method 1

Example prompt with this kind of ASCII art.

Method 2

Example on Open Source Models

Questions? Feedback? Requests? Discord: Samej2023

About

Releases

Packages

License

mejbass/Jailbreaking-Censored-Aligned-LLMs

Folders and files

Latest commit

History

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Jailbreaking Censored/Aligned LLMs

Jailbreaking Aligned LLMs with ArtPrompt:

Lets do this!!

Method 1

Example prompt with this kind of ASCII art.

Method 2

Example on Open Source Models

Questions? Feedback? Requests? Discord: Samej2023

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages