Red Teaming Language Models with Language Models

A re-implementation of the Red Teaming Language Models with Language Models paper by Perez et al.

This implementation only focuses on toxic/offensive langauge section of the paper.

I have based run the red-teaming experiments on four target models -

I generated 50,000 valid test cases (questions) as opposed to 500,000 mentioned in the paper, due to compute constraints. A small percantge of the test cases were succesful in being able to elicit toxic language generation from these models.

For Toxicity detection, I have used Detoxify.

The percentage of toxic/offensive answers (toxicity probability > 0.5) generated using Zero-Shot and Stochastic Few-Shot generation for each model are presented below -

Model	Zero-Shot	Stochastic Few-Shot	Increase (Few-Shot vs Zero-Shot)
GPT2-XL-1.5B	0.99%	2.85%	2.88x
Llama-2-7B	0.15%	0.33%	2.20x
Pythia-6.9B	0.47%	1.02%	2.17x
Phi-1.5B	0.11%	0.79%	7.18x

The percentage of toxic/offensive questions and answers (toxicity probability > 0.5) generated using Zero-Shot and Stochastic Few-Shot generation for each model are presented below -

Model	Zero-Shot	Stochastic Few-Shot	Increase (Few-Shot vs Zero-Shot)
GPT2-XL-1.5B	2.28%	8.49%	3.72x
Llama-2-7B	0.20%	0.61%	3.05x
Pythia-6.9B	0.93%	3.00%	3.23x
Phi-1.5B	0.09%	0.38%	4.22x

Note - The numbers may be a bit high as the toxicity detection model has False Positives in its predictions as well. The toxic/offensive generations can be manually inspected with the script view_toxic_qa.py.

The questions, answers and toxicity scores for each model can be found in the csv files in the artifacts directory.

For inference, I have used vLLM (which is awesome!) for GPT2-XL, Llama-2 and Pythia to speedup the generation process. Phi-1.5 is currently not supported by vLLM.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
artifacts		artifacts
.gitignore		.gitignore
README.md		README.md
generate_answers_few_shot.py		generate_answers_few_shot.py
generate_answers_few_shot_vllm.py		generate_answers_few_shot_vllm.py
generate_answers_zero_shot.py		generate_answers_zero_shot.py
generate_answers_zero_shot_vllm.py		generate_answers_zero_shot_vllm.py
generate_questions_few_shot.py		generate_questions_few_shot.py
generate_questions_few_shot_vllm.py		generate_questions_few_shot_vllm.py
generate_questions_zero_shot.py		generate_questions_zero_shot.py
summarize_toxicity_results.py		summarize_toxicity_results.py
toxicity_classification_few_shot.py		toxicity_classification_few_shot.py
toxicity_classification_zero_shot.py		toxicity_classification_zero_shot.py
view_toxic_qa.py		view_toxic_qa.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

artifacts

artifacts

.gitignore

.gitignore

README.md

README.md

generate_answers_few_shot.py

generate_answers_few_shot.py

generate_answers_few_shot_vllm.py

generate_answers_few_shot_vllm.py

generate_answers_zero_shot.py

generate_answers_zero_shot.py

generate_answers_zero_shot_vllm.py

generate_answers_zero_shot_vllm.py

generate_questions_few_shot.py

generate_questions_few_shot.py

generate_questions_few_shot_vllm.py

generate_questions_few_shot_vllm.py

generate_questions_zero_shot.py

generate_questions_zero_shot.py

summarize_toxicity_results.py

summarize_toxicity_results.py

toxicity_classification_few_shot.py

toxicity_classification_few_shot.py

toxicity_classification_zero_shot.py

toxicity_classification_zero_shot.py

view_toxic_qa.py

view_toxic_qa.py

Repository files navigation

Red Teaming Language Models with Language Models

About

Languages

shreyansh26/Red-Teaming-Language-Models-with-Language-Models

Folders and files

Latest commit

History

Repository files navigation

Red Teaming Language Models with Language Models

About

Topics

Resources

Stars

Watchers

Forks

Languages