Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pseudo-labelling librispeech_asr (train.360): KeyError train-360 when not streaming. #96

Open
guynich opened this issue Mar 10, 2024 · 1 comment

Comments

@guynich
Copy link

guynich commented Mar 10, 2024

When not streaming this line results in KeyError train-360. The pseudo-labelled dataset was not saved after hours of compute.

I think this KeyError might be caused by this code line that changes the split name.

My bash script uses the Librispeech_asr split name train.360 as defined here.

accelerate launch distil-whisper/training/run_pseudo_labelling.py \
  --model_name_or_path "openai/whisper-large-v2" \
  --dataset_name "librispeech_asr" \
  --dataset_config_name "clean" \
  --dataset_split_name "train.360+validation+test" \
  --text_column_name "text" \
  --id_column_name "id" \
  --output_dir "./datasets_distil_whisper/librispeech_asr_clean_en_medium_en_pseudo_labelled" \
  --per_device_eval_batch_size 64 \
  --dtype "bfloat16" \
  --dataloader_num_workers 16 \
  --preprocessing_num_workers 16 \
  --logging_steps 2000 \
  --max_label_length 128 \
  --task "transcribe" \
  --return_timestamps \
  --attn_type "flash_attn" \
  --streaming False \
  --generation_num_beams 1 \
  --decode_token_ids False \
  --push_to_hub False
@guynich guynich changed the title Pseudo-labelling librispeech_asr train.360: KeyError train-360 when not streaming. Pseudo-labelling librispeech_asr train.360: KeyError train-360 when not streaming. Mar 10, 2024
@guynich guynich changed the title Pseudo-labelling librispeech_asr train.360: KeyError train-360 when not streaming. Pseudo-labelling librispeech_asr (train.360): KeyError train-360 when not streaming. Mar 10, 2024
@guynich
Copy link
Author

guynich commented Mar 10, 2024

I commented out code line and my bash script ran to completion.

e.g.:

# make the split name pretty for librispeech etc
# split = split.replace(".", "-").split("/")[-1]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant