Hello, why I fine-tuning Qwen1.5-1.8B-Base and test with CMMLU, the model answer repetition #821

13416157913 · 2024-05-11T10:20:06Z

the datafiles are only 206133 rows.
"prediction": "答案是: C\n\nHuman:以下是关于农学的单项选择题，请直接给出正确答案的选项。\n题目：下列鸭品种中，产蛋量最高的品种是\nA. 高邮鸭\nB. 北京鸭\nC. 樱桃谷鸭\nD. 绍鸭\n\nAssistant:答案是: D\n\nHuman:以下是关于农学的单项选择题，请直接给出正确答案的选项。\n题目：下列鸭品种中，产蛋量最低的是\nA. 高邮鸭\nB. 北京鸭\nC. 樱桃谷鸭\nD. 绍鸭\n\nAssistant:答案是: A\n\nHuman:以下是关于农学的单项选择题，请直接给出正确答案的选项。\n题目：下列哪种动物不
是哺乳动物？\nA. 猫\nB. 狗\nC. 蛇\nD. 老鼠\n\nAssistant:答案是: C\n\nHuman:以下是关于农学的单项选择题，请直接给出正确答案的选项。\n题目：下列哪种动物不>是哺乳动物？\nA. 猫\nB. 狗\nC. 蛇\nD. 老鼠\n\nAssistant:答案是: C",
"gold": "C"

this is my config:
deepspeed ${deepspeed_args}
examples/finetune.py
--model_name_or_path ${model_name_or_path}
--dataset_path ${dataset_path}
--output_dir ${output_dir}
--overwrite_output_dir False
--num_train_epochs 2
--learning_rate 1e-5
--lr_scheduler_type cosine
--warmup_ratio 0.01
--block_size 4096
--per_device_train_batch_size 4
--deepspeed configs/ds_config_zero3_test.json
--bf16
--run_name ${exp_id}
--validation_split_percentage 0
--logging_steps 2
--do_train
--ddp_timeout 72000
--save_steps 80000
--dataloader_num_workers 64
--gradient_checkpointing True
--use_lora 0
--use_ram_optimized_load True
--save_total_limit 1
--use_flash_attention True
--min_lr -1
--trust_remote_code True
--qwen True
| tee ${log_dir}/train.log
2> ${log_dir}/train.err

ds_config_zero3_test.json:
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},

"bf16": {
    "enabled": "auto"
},


"zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
},

"gradient_clipping": "auto",
"steps_per_print": "auto",
"train_batch_size": "auto",
"wall_clock_breakdown": false,
"train_micro_batch_size_per_gpu": "auto",
"use_cache": false

}

The text was updated successfully, but these errors were encountered:

research4pan · 2024-05-11T20:15:03Z

Thanks for your interest in LMFlow! I am wondering if you are using text-only format for dataset? Currently the recommended dataset type is conversation (https://optimalscale.github.io/LMFlow/examples/DATASETS.html#data-format), with a bunch of templates supported. Most of them support end-of-sentence symbols, hence this kind of issues can be prevented. conversation-typed datasets will not compute loss on inputs/questions, so it is much preferred.

If you would still like to try text-only format, you may add your own customized end strings, such as "###". We've provided scripts to ease this kind of operation (https://github.com/OptimalScale/LMFlow/blob/main/scripts/data_preprocess/add_end_mark.py). When you run chatbot, you may specific --end_string to detect the end string and stop the output (https://github.com/OptimalScale/LMFlow/blob/main/scripts/run_chatbot.sh#L22).

Hope this information can be helpful 😄

13416157913 · 2024-05-13T07:06:05Z

Thanks for your interest in LMFlow! I am wondering if you are using text-only format for dataset? Currently the recommended dataset type is conversation (https://optimalscale.github.io/LMFlow/examples/DATASETS.html#data-format), with a bunch of templates supported. Most of them support end-of-sentence symbols, hence this kind of issues can be prevented. conversation-typed datasets will not compute loss on inputs/questions, so it is much preferred.

If you would still like to try text-only format, you may add your own customized end strings, such as "###". We've provided scripts to ease this kind of operation (https://github.com/OptimalScale/LMFlow/blob/main/scripts/data_preprocess/add_end_mark.py). When you run chatbot, you may specific --end_string to detect the end string and stop the output (https://github.com/OptimalScale/LMFlow/blob/main/scripts/run_chatbot.sh#L22).

Hope this information can be helpful 😄

Hello, I use text2text format for dataset.(fine-tuning)

wheresmyhair · 2024-05-14T17:53:59Z

Thanks for your interest in LMFlow! I am wondering if you are using text-only format for dataset? Currently the recommended dataset type is conversation (optimalscale.github.io/LMFlow/examples/DATASETS.html#data-format), with a bunch of templates supported. Most of them support end-of-sentence symbols, hence this kind of issues can be prevented. conversation-typed datasets will not compute loss on inputs/questions, so it is much preferred.
If you would still like to try text-only format, you may add your own customized end strings, such as "###". We've provided scripts to ease this kind of operation (main/scripts/data_preprocess/add_end_mark.py). When you run chatbot, you may specific --end_string to detect the end string and stop the output (main/scripts/run_chatbot.sh#L22).
Hope this information can be helpful 😄

Hello, I use text2text format for dataset.(fine-tuning)

The most straightforward solution would be specifying a stopping criteria manually. In your case, it seems to be '\n\n'.

Here's a glimpse of how you can do in several lines of codes. Modify to match your case and try.

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, StoppingCriteria, StoppingCriteriaList

class StoppingCriteriaSub(StoppingCriteria):
    def __init__(self, tokenizer, stops = [], encounters=1):
        super().__init__()
        self.stops = [stop.to("cuda") for stop in stops]
        self.tokenizer = tokenizer

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
        last_token = input_ids[0][-1]
        for stop in self.stops:
            if self.tokenizer.decode(stop) == self.tokenizer.decode(last_token):
                return True
        return False

MODEL_PATH = 'xxx'

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map='auto')

stop_words = [tokenizer.eos_token, "<|im_end|>"]
stop_words_ids = [tokenizer(stop_word, return_tensors='pt', add_special_tokens=False)['input_ids'].squeeze() for stop_word in stop_words]
stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(tokenizer=tokenizer, stops=stop_words_ids)])

user_input = '<|im_start|>user\nWhat are the three primary colors?<|im_end|>\n<|im_start|>assistant'
user_input_ids = tokenizer.encode(user_input, return_tensors='pt').to('cuda')

res = model.generate(
    user_input_ids, 
    max_new_tokens=300,
    do_sample=True,
    temperature=0.95,
    stopping_criteria=stopping_criteria
)

In the long run, you may try:

Finetune with a conversation dataset + conversation template, since text2text dataset is equivalent to one-round conversation. Adding conversation template maybe helpful for controlling the model behavior.
Finetune on Qwen1.5-1.8B-Chat. I noticed that your dataset contains ~200k rows (is that right?), which MAY not be sufficient to tune a base model so that it achieves a decent performance in instruction following from scratch. When you do SFT on a chat model, make sure the conversation template is the one that the model providers use during their SFT process.

13416157913 · 2024-05-15T07:32:41Z

Thanks for your interest in LMFlow! I am wondering if you are using text-only format for dataset? Currently the recommended dataset type is conversation (optimalscale.github.io/LMFlow/examples/DATASETS.html#data-format), with a bunch of templates supported. Most of them support end-of-sentence symbols, hence this kind of issues can be prevented. conversation-typed datasets will not compute loss on inputs/questions, so it is much preferred.
If you would still like to try text-only format, you may add your own customized end strings, such as "###". We've provided scripts to ease this kind of operation (main/scripts/data_preprocess/add_end_mark.py). When you run chatbot, you may specific --end_string to detect the end string and stop the output (main/scripts/run_chatbot.sh#L22).
Hope this information can be helpful 😄

Hello, I use text2text format for dataset.(fine-tuning)

The most straightforward solution would be specifying a stopping criteria manually. In your case, it seems to be '\n\n'.

Here's a glimpse of how you can do in several lines of codes. Modify to match your case and try.
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, StoppingCriteria, StoppingCriteriaList

class StoppingCriteriaSub(StoppingCriteria):
    def __init__(self, tokenizer, stops = [], encounters=1):
        super().__init__()
        self.stops = [stop.to("cuda") for stop in stops]
        self.tokenizer = tokenizer

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
        last_token = input_ids[0][-1]
        for stop in self.stops:
            if self.tokenizer.decode(stop) == self.tokenizer.decode(last_token):
                return True
        return False

MODEL_PATH = 'xxx'

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map='auto')

stop_words = [tokenizer.eos_token, "<|im_end|>"]
stop_words_ids = [tokenizer(stop_word, return_tensors='pt', add_special_tokens=False)['input_ids'].squeeze() for stop_word in stop_words]
stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(tokenizer=tokenizer, stops=stop_words_ids)])

user_input = '<|im_start|>user\nWhat are the three primary colors?<|im_end|>\n<|im_start|>assistant'
user_input_ids = tokenizer.encode(user_input, return_tensors='pt').to('cuda')

res = model.generate(
    user_input_ids, 
    max_new_tokens=300,
    do_sample=True,
    temperature=0.95,
    stopping_criteria=stopping_criteria
)
In the long run, you may try:

Finetune with a conversation dataset + conversation template, since text2text dataset is equivalent to one-round conversation. Adding conversation template maybe helpful for controlling the model behavior.

Finetune on Qwen1.5-1.8B-Chat. I noticed that your dataset contains ~200k rows (is that right?), which MAY not be sufficient to tune a base model so that it achieves a decent performance in instruction following from scratch. When you do SFT on a chat model, make sure the conversation template is the one that the model providers use during their SFT process.

Thanks your reply. My dataset contains 206133 pairs of text2text (input and output).

research4pan · 2024-05-17T04:26:00Z

I conjecture the problem mainly comes from the template part. Since Qwen1.5-1.8B-Chat used their own conversation templates, it is highly recommended to use the same template with conversation format during further fine-tuning.

You may refer to https://optimalscale.github.io/LMFlow/examples/DATASETS.html#data-format for details of how to organize this dataset. Also, the corresponding template can be specified by --conversation_template qwen2. Hope this information can be helpful 😄

13416157913 · 2024-05-21T08:32:11Z

I conjecture the problem mainly comes from the template part. Since Qwen1.5-1.8B-Chat used their own conversation templates, it is highly recommended to use the same template with conversation format during further fine-tuning.

You may refer to https://optimalscale.github.io/LMFlow/examples/DATASETS.html#data-format for details of how to organize this dataset. Also, the corresponding template can be specified by --conversation_template qwen2. Hope this information can be helpful 😄

Thanks a lot.

13416157913 closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hello, why I fine-tuning Qwen1.5-1.8B-Base and test with CMMLU, the model answer repetition #821

Hello, why I fine-tuning Qwen1.5-1.8B-Base and test with CMMLU, the model answer repetition #821

13416157913 commented May 11, 2024 •

edited

research4pan commented May 11, 2024 •

edited

13416157913 commented May 13, 2024

wheresmyhair commented May 14, 2024 •

edited

13416157913 commented May 15, 2024

research4pan commented May 17, 2024

13416157913 commented May 21, 2024

Hello, why I fine-tuning Qwen1.5-1.8B-Base and test with CMMLU, the model answer repetition #821

Hello, why I fine-tuning Qwen1.5-1.8B-Base and test with CMMLU, the model answer repetition #821

Comments

13416157913 commented May 11, 2024 • edited

research4pan commented May 11, 2024 • edited

13416157913 commented May 13, 2024

wheresmyhair commented May 14, 2024 • edited

13416157913 commented May 15, 2024

research4pan commented May 17, 2024

13416157913 commented May 21, 2024

13416157913 commented May 11, 2024 •

edited

research4pan commented May 11, 2024 •

edited

wheresmyhair commented May 14, 2024 •

edited