用llama-factory sft deepctrl的数据集，pyarrow报错，repo主有遇到过这个问题吗 #8

jamestang0219 · 2024-04-25T11:09:18Z

基座模型用的llama3-8b
微调框架使用的是llama-factory
数据集用的deepctrl那个中文数据集，10g+的那个
微调的时候pyarrow报错了，搜了下，有人说datasets库load大数据集的时候会有这个问题，请问repo主是怎么解决的？

jamestang0219 · 2024-04-25T11:10:15Z

pa_table = pa_table.combine_chunks() File "pyarrow/table.pxi", line 4289, in pyarrow.lib.Table.combine_chunks File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

seanzhang-zhichen · 2024-04-26T03:14:30Z

pa_table = pa_table.combine_chunks() File "pyarrow/table.pxi", line 4289, in pyarrow.lib.Table.combine_chunks File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

我从该数据集里采样的，并未全部加载该数据集进来，可以参考PR: hiyouga/LLaMA-Factory#3004

jamestang0219 · 2024-05-02T16:19:19Z

pa_table = pa_table.combine_chunks() File "pyarrow/table.pxi", line 4289, in pyarrow.lib.Table.combine_chunks File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

我从该数据集里采样的，并未全部加载该数据集进来，可以参考PR: hiyouga/LLaMA-Factory#3004

你好，请问可以分享下你使用lora做finetune的运行参数吗？我也采样了deepctrl500万中文+100万英文的数据微调llama3-7b，目前已经训练了1个epoch，中文的效果还是很差，甚至连结束符都预测不对，十分感谢。

jamestang0219 · 2024-05-02T16:23:48Z

pa_table = pa_table.combine_chunks() File "pyarrow/table.pxi", line 4289, in pyarrow.lib.Table.combine_chunks File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

我从该数据集里采样的，并未全部加载该数据集进来，可以参考PR: hiyouga/LLaMA-Factory#3004

finetune了一个epoch，loss在1.2降不下来了:(

seanzhang-zhichen · 2024-05-04T07:58:56Z

pa_table = pa_table.combine_chunks() File "pyarrow/table.pxi", line 4289, in pyarrow.lib.Table.combine_chunks File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

我从该数据集里采样的，并未全部加载该数据集进来，可以参考PR: hiyouga/LLaMA-Factory#3004

你好，请问可以分享下你使用lora做finetune的运行参数吗？我也采样了deepctrl500万中文+100万英文的数据微调llama3-7b，目前已经训练了1个epoch，中文的效果还是很差，甚至连结束符都预测不对，十分感谢。

env CUDA_VISIBLE_DEVICES=1,2,3 deepspeed src/train_bash.py \
    --stage sft \
    --do_train \
    --flash_attn \
    --template llama3 \
    --model_name_or_path Meta-Llama-3-8B \
    --dataset other_self_cognition,deepctrl-sft-data_zh,deepctrl-sft-data_en \
    --finetuning_type lora \
    --use_dora \
    --loraplus_lr_ratio 24.0 \
    --preprocessing_num_workers 40 \
    --lora_rank 16 \
    --lora_alpha 32 \
    --lora_dropout 0.05 \
    --lora_target all \
    --output_dir llama3-chinese \
    --overwrite_output_dir \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 32 \
    --cutoff_len 8192 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 500 \
    --eval_steps 500 \
    --val_size 1000 \
    --save_total_limit 100 \
    --logging_first_step True \
    --evaluation_strategy steps \
    --learning_rate 5e-5 \
    --warmup_ratio 0.1 \
    --weight_decay 0.05 \
    --num_train_epochs 1.0 \
    --plot_loss \
    --adam_beta1 0.9 \
    --adam_beta2 0.95 \
    --bf16 \
    --cache_dir ./cache \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --deepspeed ./deepspeed_zero_stage2_config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

用llama-factory sft deepctrl的数据集，pyarrow报错，repo主有遇到过这个问题吗 #8

用llama-factory sft deepctrl的数据集，pyarrow报错，repo主有遇到过这个问题吗 #8

jamestang0219 commented Apr 25, 2024

jamestang0219 commented Apr 25, 2024

seanzhang-zhichen commented Apr 26, 2024

jamestang0219 commented May 2, 2024

jamestang0219 commented May 2, 2024

seanzhang-zhichen commented May 4, 2024

用llama-factory sft deepctrl的数据集，pyarrow报错，repo主有遇到过这个问题吗 #8

用llama-factory sft deepctrl的数据集，pyarrow报错，repo主有遇到过这个问题吗 #8

Comments

jamestang0219 commented Apr 25, 2024

jamestang0219 commented Apr 25, 2024

seanzhang-zhichen commented Apr 26, 2024

jamestang0219 commented May 2, 2024

jamestang0219 commented May 2, 2024

seanzhang-zhichen commented May 4, 2024