Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamically set data name for auxiliary asr tasks #5697

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wietsedv
Copy link

@wietsedv wietsedv commented Mar 8, 2024

What?

Auxiliary data ASR data tags caused an error because they all get the name "text", which is already used for the regular ASR output. After this change, the data name is the name taken from the argument.

Before this change, I receive the error below when I try to run the fleurs recipe. After the change, I can succesfully run it without making any adaptations to the recipe.

Why?

The asr.sh script of the asr1 task accepts a --auxiliary_data_tags argument in order to define auxiliary text data inputs. Specifically, the fleurs example makes use of this for an auxiliary language identification task. Currently this argument is broken because the data name is hardcoded to "text" instead of the intended data name. The "text" data name is already used for the asr output text and the script will complain about the duplicated data name:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/espnet2/bin/asr_train.py", line 23, in <module>
    main()
  File "/workspace/espnet2/bin/asr_train.py", line 19, in main
    ASRTask.main(cmd=cmd)
  File "/workspace/espnet2/tasks/abs_task.py", line 1120, in main
    cls.main_worker(args)
  File "/workspace/espnet2/tasks/abs_task.py", line 1368, in main_worker
    train_iter_factory = cls.build_iter_factory(
  File "/workspace/espnet2/tasks/abs_task.py", line 1585, in build_iter_factory
    return cls.build_sequence_iter_factory(
  File "/workspace/espnet2/tasks/abs_task.py", line 1617, in build_sequence_iter_factory
    dataset = ESPnetDataset(
  File "/workspace/espnet2/train/dataset.py", line 462, in __init__
    raise RuntimeError(f'"{name}" is duplicated for data-key')
RuntimeError: "text" is duplicated for data-key

See also

The broken feature was introduced over a year ago in #4756. It is not clear to me whether the issue was caused by untested code or by an internal Espnet change at a later date.

@mergify mergify bot added the ESPnet2 label Mar 8, 2024
Copy link

codecov bot commented Mar 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.10%. Comparing base (d004740) to head (a9ddbc6).

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #5697       +/-   ##
===========================================
+ Coverage   23.30%   70.10%   +46.80%     
===========================================
  Files         746      746               
  Lines       69369    69369               
===========================================
+ Hits        16163    48634    +32471     
+ Misses      53206    20735    -32471     
Flag Coverage Δ
test_configuration_espnet2 ∅ <ø> (∅)
test_integration_espnet1 62.92% <ø> (ø)
test_python_espnet1 18.32% <ø> (ø)
test_python_espnet2 52.05% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sw005320
Copy link
Contributor

@wanchichen, can you confirm it?

@wanchichen
Copy link
Contributor

Thanks @wietsedv, this change should be correct. I also think another line may need to be added to do the same for the validation set.
Maybe in line 1394?

_opts+="--valid_data_path_and_name_and_type ${_asr_train_dir}/${aux_dset},${aux_dset},text "

@sw005320 sw005320 added Bugfix ASR Automatic speech recogntion labels Mar 17, 2024
@sw005320
Copy link
Contributor

@wietsedv, can you confirm @wanchichen's suggestion?

@sw005320 sw005320 added this to the v.202405 milestone Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASR Automatic speech recogntion Bugfix ESPnet2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants