Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: 昇腾npu训练uie-base模型报错 #8355

Open
1 task done
wangyu1984 opened this issue Apr 30, 2024 · 1 comment
Open
1 task done

[Bug]: 昇腾npu训练uie-base模型报错 #8355

wangyu1984 opened this issue Apr 30, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@wangyu1984
Copy link

软件环境

- paddle-custom-npu   0.0.0
- paddle2onnx         1.0.5
- paddlefsl           1.1.0
- paddlenlp           2.6.1
- paddlepaddle        0.0.0(使用develop分支源码编译镜像是:registry.baidubce.com/device/paddle-
npu:cann80T2-910B-ubuntu18-aarch64)
(py39) λ user /work/PaddleNLP/model_zoo/uie {develop} npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 23.0.0                   Version: 23.0.0                                               |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B3               | OK            | 94.4        39                0    / 0             |
| 0                         | 0000:C1:00.0  | 0           0    / 0          3315 / 65536         |
+===========================+===============+====================================================+
| 1     910B3               | OK            | 91.6        37                0    / 0             |
| 0                         | 0000:C2:00.0  | 0           0    / 0          3315 / 65536         |
+===========================+===============+====================================================+
| 2     910B3               | OK            | 92.3        38                0    / 0             |
| 0                         | 0000:81:00.0  | 0           0    / 0          3315 / 65536         |
+===========================+===============+====================================================+
| 3     910B3               | OK            | 92.6        39                0    / 0             |
| 0                         | 0000:82:00.0  | 0           0    / 0          3315 / 65536

重复问题

  • I have searched the existing issues

错误描述

错误日志-
Traceback (most recent call last):
  File "/work/PaddleNLP/model_zoo/uie/finetune.py", line 262, in <module>
    main()
  File "/work/PaddleNLP/model_zoo/uie/finetune.py", line 193, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 888, in train
    self._maybe_log_save_evaluate(tr_loss, model, epoch, ignore_keys_for_eval, inputs=inputs)
  File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 1024, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 2544, in _nested_gather
    tensors = distributed_concat(tensors)
  File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/utils/helper.py", line 41, in distributed_concat
    output_tensors = [t if len(t.shape) > 0 else t.reshape_([-1]) for t in output_tensors]
  File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/utils/helper.py", line 41, in <listcomp>
    output_tensors = [t if len(t.shape) > 0 else t.reshape_([-1]) for t in output_tensors]
  File "/opt/py39/lib/python3.9/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/opt/py39/lib/python3.9/site-packages/paddle/base/wrapped_decorator.py", line 26, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/opt/py39/lib/python3.9/site-packages/paddle/utils/inplace_utils.py", line 45, in __impl__
    return func(*args, **kwargs)
  File "/opt/py39/lib/python3.9/site-packages/paddle/tensor/manipulation.py", line 4635, in reshape_
    out = _C_ops.reshape_(x, shape)
OSError: (External)  ACL error, the error code is : 100000.  (at /work/PaddleCustomDevice/backends/npu/kernels/funcs/npu_op_runner.cc:223)

稳定复现步骤 & 代码

启动脚本
python -u -m paddle.distributed.launch --gpus "0,1,2,3" finetune.py --device gpu --logging_steps 10 --save_steps 100 --eval_steps 100 --seed 42 --model_name_or_path uie-base --output_dir $finetuned_model --train_path data/train.txt --dev_path data/dev.txt --max_seq_length 512 --per_device_eval_batch_size 21 --per_device_train_batch_size 32 --num_train_epochs 50 --learning_rate 1e-2 --label_names "start_positions" "end_positions" --do_train --do_eval --do_export --export_model_dir $finetuned_model --overwrite_output_dir --disable_tqdm True --metric_for_best_model eval_f1 --load_best_model_at_end True --save_total_limit 1

@wangyu1984 wangyu1984 added the bug Something isn't working label Apr 30, 2024
@w5688414
Copy link
Contributor

w5688414 commented May 3, 2024

您好,我们人力有限,也没有硬件条件进行复现,欢迎开发者贡献。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants