Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue] latest run_pseudo_labelling.py #106

Open
ckcraig01 opened this issue Apr 1, 2024 · 0 comments
Open

[Issue] latest run_pseudo_labelling.py #106

ckcraig01 opened this issue Apr 1, 2024 · 0 comments

Comments

@ckcraig01
Copy link

Dear Author,

Thanks for your great works, The pseudo_labeling worked fine with previous implementation (a month ago). But as I updated the codebase to the latest main branch and followed the readme. I encountered the below issues:

  1. my command line:

accelerate launch run_pseudo_labelling.py
--model_name_or_path "openai/whisper-medium"
--dataset_name "mozilla-foundation/common_voice_16_1"
--dataset_config_name "zh-TW"
--dataset_split_name "train+validation+test"
--text_column_name "sentence"
--id_column_name "path"
--output_dir "./common_voice_16_1_zh_tw_pseudo_labelled"
--wandb_project "distil-whisper-labelling"
--per_device_eval_batch_size 8
--dtype "bfloat16"
--attn_implementation "sdpa"
--logging_steps 500
--max_label_length 256
--concatenate_audio
--preprocessing_batch_size 500
--preprocessing_num_workers 8
--dataloader_num_workers 8
--language "zh"
--task "transcribe"
--return_timestamps
--streaming False
--generation_num_beams 1 \

  1. Error message:
04/01/2024 09:12:23 - INFO - __main__ - ***** Running Labelling *****
04/01/2024 09:12:23 - INFO - __main__ -   Instantaneous batch size per device = 8
04/01/2024 09:12:23 - INFO - __main__ -   Total eval batch size (w. parallel & distributed) = 16
04/01/2024 09:12:23 - INFO - __main__ -   Predict labels with timestamps = True
Evaluating train...:   0%|                                                                                                                                           | 0/52 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/utils/operations.py", line 155, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/site-packages/transformers/feature_extraction_utils.py", line 229, in to
    if torch.is_floating_point(v):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: is_floating_point(): argument 'input' (position 1) must be Tensor, not list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 1027, in <module>
    main()
  File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 1012, in main
    eval_step_with_save(split=split)
  File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 900, in eval_step_with_save
    for step, batch in enumerate(batches):
  File "/myenv/distil_whisper/lib/python3.11/site-packages/tqdm/std.py", line 1169, in __iter__
    for obj in iterable:
  File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/data_loader.py", line 461, in __iter__
    current_batch = send_to_device(current_batch, self.device)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/utils/operations.py", line 157, in send_to_device
    return tensor.to(device)
           ^^^^^^^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/site-packages/transformers/feature_extraction_utils.py", line 229, in to
    if torch.is_floating_point(v):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: is_floating_point(): argument 'input' (position 1) must be Tensor, not list
Traceback (most recent call last):
  File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/utils/operations.py", line 155, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/site-packages/transformers/feature_extraction_utils.py", line 229, in to
    if torch.is_floating_point(v):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: is_floating_point(): argument 'input' (position 1) must be Tensor, not list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 1027, in <module>
    main()
  File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 1012, in main
    eval_step_with_save(split=split)
  File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 900, in eval_step_with_save
    for step, batch in enumerate(batches):
  File "/myenv/distil_whisper/lib/python3.11/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/data_loader.py", line 461, in __iter__
    current_batch = send_to_device(current_batch, self.device)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/utils/operations.py", line 157, in send_to_device
    return tensor.to(device)
           ^^^^^^^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/site-packages/transformers/feature_extraction_utils.py", line 229, in to
    if torch.is_floating_point(v):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: is_floating_point(): argument 'input' (position 1) must be Tensor, not list
Exception in thread Thread-3 (_pin_memory_loop):
Traceback (most recent call last):
  File "/myenv/distil_whisper/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/myenv/distil_whisper/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop
    do_one_step()
  File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 355, in rebuild_storage_fd
    fd = df.detach()
         ^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
         ^^^^^^^^^^^^^^^^^^^wandb: You can sync this run to the cloud by running:
wandb: wandb sync /alghome/craig.hsin/framework/distil-whisper/training/wandb/offline-run-20240401_091200-0oe1zyh2
wandb: Find logs at: ./wandb/offline-run-20240401_091200-0oe1zyh2/logs
[2024-04-01 09:12:31,572] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2703427) of binary: /myenv/distil_whisper/bin/python
Traceback (most recent call last):
  File "/myenv/distil_whisper/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1048, in launch_command
    multi_gpu_launcher(args)
  File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/commands/launch.py", line 702, in multi_gpu_launcher
    distrib_run.run(args)
  File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_pseudo_labelling.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-04-01_09:12:31
  host      : alg4
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2703428)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-01_09:12:31
  host      : alg4
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2703427)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  1. some of my environment info.
    Name Version Build Channel
    python 3.11.8 h955ad1f_0
    torch 2.1.1+cu118 pypi_0 pypi
    transformers 4.39.1 pypi_0 pypi

May you provide some suggestion on how could I proceed the investigations? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant