Defend against indirect prompt injection attacks

In our work, we propose two kinds of defenses to decrease the Attack Success Rates (ASRs) of LLMs towards indirect prompt injection attacks. We provide the steps and code to reproduce the results in our paper.

Black-box defenses

Black-box defenses are a set of defenses based on meta-prompting that does not need access the weights of large language models (LLMs).

Border Strings

We examine three prevalent types of border strings: equal signs, hyphens, and backticks, to create a more distinct separation between data and instructions.

Here is the command to collect the responses of LLMs with border strings.

cd defense/black_box
python few_shot.py --bipia_seed 2023 --fewshot_seed 2023 --dataset_name {task} \
--train_context_data_file path/of/task/train/external/content/file \
--train_attack_data_file path/of/task/train/attack/file \
--test_context_data_file path/of/task/test/external/content/file \
--test_attack_data_file path/of/task/test/attack/file \
--llm_config_file config/{llm_name}.yaml \
--batch_size 20 --output_path path/of/output/file \
--log_steps 10 --resume --border_type {border_type} --num_examples 0

Arguments:

task: the selected task name, you can choose anyone from ["code", "email", "qa", "abstract", "table"]
llm_name: the model you want to test.
output_path: the path where to save the response of LLMs during inference.
num_examples: the number of examples used in few-shot learning.
border_type: the border to split context and prompt. Corrently support empty, =, -, and `.

You are encouraged to check more arguments and definiation in few_shot.py

In-context Learning

Inspired by the success of in-context learning, we employ this technique to teach an LLM the boundaries between data and instructions.

cd defense/black_box
python few_shot.py --bipia_seed 2023 --fewshot_seed 2023 --dataset_name {task} \
--train_context_data_file path/of/task/train/external/content/file \
--train_attack_data_file path/of/task/train/attack/file \
--test_context_data_file path/of/task/test/external/content/file \
--test_attack_data_file path/of/task/test/attack/file \
--llm_config_file config/{llm_name}.yaml \
--batch_size 20 --output_path path/of/output/file \
--log_steps 10 --resume --border_type empty --num_examples {num_examples}

Arguments:

task: the selected task name, you can choose anyone from ["code", "email", "qa", "abstract", "table"]
llm_name: the model you want to test.
output_path: the path where to save the response of LLMs during inference.
num_examples: the number of examples used in few-shot learning.
border_type: the border to split context and prompt. Corrently support empty, =, -, and `.

You are encouraged to check more arguments and definiation in few_shot.py

Multi-turn Dialogue

By separating external content from instructions into different turns and distancing malicious instructions from the end of a prompt, ASR should be reduced.

cd examples
python run.py --seed 2023 --mode inference \
--context_data_file path/of/task/test/external/content/file \
--dataset_name {task} \
--attack_data_file path/of/task/test/attack/file \
--llm_config_file config/{llm_name}.yaml \
--output_path path/of/output/file  --batch_size 20 \
--add_ign_guidance --log_steps 10 --resume

Arguments:

task: the selected task name, you can choose anyone from ["code", "email", "qa", "abstract", "table"]
llm_name: the model you want to test.

White-box Defense

We propose a white-box defense methods that apply adversarialtraining to the self-supervised fine-tuning stage of an LLM to teach it to ignore instructions in data (external content), thus enhancing its robustness against indirect prompt injection attacks.

We employ three different methods to construct benign responses:

BIPIA: Using labels from the BIPIA dataset. This method guarantees the correctness of the responses but may limit their diversity.
Original LLM: Using benign responses generated by the original LLM on prompts without malicious instructions.
GPT-4: Using responses generated by GPT-4 on prompts without malicious instructions.

Collecting Clean Responses

For the last Original LLM and GPT-4, we use the following command to generate the benign responses.

cd examples
python collect_clean_response.py --seed 2023 --dataset_name {task} \
--context_data_file path/of/task/test/external/content/file \
--attack_data_file path/of/task/test/attack/file \
--llm_config_file config/{llm_name}.yaml \
--batch_size 20 --output_path path/of/output/file \
--log_steps 10 --resume --split train

Arguments:

task: the selected task name, you can choose anyone from ["code", "email", "qa", "abstract", "table"]
llm_name: the model you want to test.

Fine-tuning

In our experiments we train the LLMs with 8 V100 GPUs, and we use the DeepSpeed library to accelerate the training process. You can change the command according to your setting.

cd defense/white_box
deepspeed finetune.py \
  --llm_config_file ../../config/{llm_name}.yaml \
  --model_structure special_token \
  --output_dir path/of/output/dir \
  --fp16 True --fp16_opt_level O2 \
  --max_steps 1000 \
  --per_device_train_batch_size 4 \
  --per_device_eval_batch_size 8 \
  --gradient_accumulation_steps 4 \
  --evaluation_strategy no \
  --save_strategy steps \
  --save_steps 100 \
  --save_total_limit 100 \
  --learning_rate 2e-5 \
  --weight_decay 0. \
  --warmup_ratio 0.03 \
  --lr_scheduler_type cosine \
  --logging_steps 1 \
  --model_max_length 2048 \
  --gradient_checkpointing True \
  --qa_context_data_file ../../benchmark/qa/train.jsonl \
  --email_context_data_file ../../benchmark/email/train.jsonl \
  --table_context_data_file ../../benchmark/table/train.jsonl \
  --abstract_context_data_file ../../benchmark/abstract/train.jsonl \
  --code_context_data_file ../../benchmark/code/train.jsonl \
  --text_attack_data_file ../../benchmark/text_attack_train.json \
  --code_attack_data_file ../../benchmark/text_attack_train.json \
  --response_strategy {response_strategy} \
  --qa_response_file path/to/qa/clean/response/file \
  --email_response_file path/to/email/clean/response/file \
  --table_response_file path/to/email/clean/response/file \
  --abstract_response_file path/to/abstract/clean/response/file \
  --code_response_file path/to/code/clean/response/file \
  --dataset_name all \
  --bipia_seed 2023 \
  --deepspeed ds_config.json \
  --report_to wandb \
  --add_ign_guidance True

Arguments:

response_strategy: the strategy to construct benign response. use original for BIPIA, self for Original LLM, and gpt4 for GPT-4. qa_response_file, email_response_file, table_response_file, abstract_response_file, code_response_file arguments are not necessary for BIPIA response strategy.

Evaluation

We then generate response of the fine-tuned LLMs on our BIPIA benchmark.

cd defense/white_box
python eval.py --seed 2023 --dataset_name {task} \
--context_data_file path/of/task/test/external/content/file \
--attack_data_file path/of/task/test/attack/file \
--batch_size 200 \
--output_path path/of/output/file \
--model_name_or_path path/of/input/model/path \
--log_steps 1

Arguments:

task: the selected task name, you can choose anyone from ["code", "email", "qa", "abstract", "table"]

Finally, evlaute the ASR of the fine-tuned LLMs.

python run.py --mode evaluate --seed 2023 \
--dataset_name {task} \
--response_path path/of/output/file \
--output_path path/of/asr/file \
--gpt_config_file config/{evaluate_llm_name}.yaml \
--batch_size 20 --log_steps 10 --resume

Arguments:

task: the selected task name, you can choose anyone from ["code", "email", "qa", "abstract", "table"]
evaluate_llm_name: the name of the LLMs for evaluation. Use gpt35 by default.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Defend against indirect prompt injection attacks

Black-box defenses

Border Strings

In-context Learning

Multi-turn Dialogue

White-box Defense

Collecting Clean Responses

Fine-tuning

Evaluation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Defend against indirect prompt injection attacks

Black-box defenses

Border Strings

In-context Learning

Multi-turn Dialogue

White-box Defense

Collecting Clean Responses

Fine-tuning

Evaluation