Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems in concatenate_dataset #129

Open
George0828Zhang opened this issue May 1, 2024 · 0 comments
Open

Problems in concatenate_dataset #129

George0828Zhang opened this issue May 1, 2024 · 0 comments

Comments

@George0828Zhang
Copy link

In concatenate_dataset():

for idx in range(1, len(audio)):
prev_speaker = speaker_id[idx - 1]
speaker = speaker_id[idx]
if len(audio_sample) + input_lengths[idx] < max_input_length:
if speaker == prev_speaker:
# we have no information about whether the segments follow on sequentially
# so we just ensure the same speaker as we concatenate across files
audio_sample = np.append(audio_sample, audio[idx])
# extra spaces in the text transcription don't matter, since we only use it for the WER computation
text_sample += " " + text[idx]
else:
# speakers do not follow sequentially, save the audio and start looping again
concatenated_audio.append(audio_sample)
concatenated_text.append(text_sample)
concatenated_speaker.append(speaker)
condition_on_prev.append(0)
audio_sample = audio[idx]
text_sample = text[idx]
else:
# concatenated audio exceeds max length, save the audio and start looping again
concatenated_audio.append(audio_sample)
concatenated_text.append(text_sample)
concatenated_speaker.append(speaker)
condition_on_prev.append(1)
audio_sample = audio[idx]
text_sample = text[idx]

From my understanding, the logic in the for loop is

  • If either:
    1. Adding the current utterance to audio_sample exceeds 30s
    2. The current speaker is different from previous (prev_speaker)
  • Then save the concatenation up to the previous utterance (audio_sample), excluding the current utterance.

Since the concatenated sample does not contain the current utterance, we have:

  1. The appended speaker should be previous_speaker rather than speaker
  2. condition_on_prev signifies continuity at the start of current utterance, so this should be shifted to the right by 1 (e.g. initialize as condition_on_prev = [0])

Meanwhile, it seems that the very last accumulated sample in each batch did not get appended, i.e. when the for loop exits, there will be a (audio_sample, text_sample) pair that is <= 30s which should've been appended but didn't.

These may not seem significant, but when finetuning on custom dataset with diverse speakers, and condition_on_prev is expected to be true alot, it will cause wrongful training signals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant