Diarization precision - is there way to improve it? #804

nikola1975 · 2024-05-14T14:43:26Z

I am running speaker diarization, with Pyannote 3.0.1, and am struggling to improve results. The change of speakers is recognized in English relatively well, but the alignment is bit hit and miss. Sometimes the whole sentences with the next speaker are left in the previous speaker's segment and similar.

Here is the audio file:
https://s3.eu-central-2.wasabisys.com/qira/12/2024/5/inourtime-hobsbawm_6min_1_1715693802051/inourtime-hobsbawm_6min_1_1715693802051.mp3

Here is the example from the beginning of the file, the last sentence in the first segment is spoken by the SPEAKER_02, and it stays within the SPEAKER_01. Then it starts the new segment at the start of the next sentence.

Any way to improve this?

    {
        "text": "In the first couple of sentences of your lecture, you say the fundamental assumption behind the various movements of the avant-garde in the arts, which dominated the past century, was that relations between art and society had changed fundamentally, that old ways of looking at the world were inadequate, and new ways must be found. This assumption was correct. Can you tell us why you think that assumption is correct? Largely, it seems to me, because the world in which we live, which is determined by.",
        "start": 0.009,
        "end": 30.009,
        "sentence_spans": [
            [
                0,
                485
            ]
        ],
        "speaker": "SPEAKER_01"
    },
    {
        "text": "enormous changes in technology, by enormous changes in industrialization and in the consequences of industrialization, really produce a number of both experiences and realities which simply cannot be adequately expressed in the old idiom unless that idiom is suitable to expressing something of the 20th century type.",
        "start": 30.811,
        "end": 60.794,
        "sentence_spans": [
            [
                0,
                319
            ]
        ],
        "speaker": "SPEAKER_02"
    },

The text was updated successfully, but these errors were encountered:

nikola1975 · 2024-05-21T08:40:42Z

I have tried upgrading to Pyannote 3.1, and the problem persists. The alignment is pretty useless - even in a very controlled environment (ie. studio recording, BBC podcast, with 3 speakers), it is missing quiet a bit.

Anyone had success in making this better?

nikola1975 · 2024-05-21T10:29:47Z

Ok, I figured out what I was doing wrong. I will leave the comment here in case someone has similar problem and will close the issue.

When sending to diarization, I was using segments created by the transcription process. Segments were too long (ie. 3-5 sentences), which meant that sometimes speakers were changing in between and the model took the one that was the most common in that segment. I have now changed and am sending segments created by the alignment process, where segments are much shorter and the result is much better.

drstuggels · 2024-05-23T10:25:18Z

@nikola1975 I am having the same issue, but your solution (the default code example in the README) doesn't solve it. Here's my code:

options = {
    "max_new_tokens": None,
    "clip_timestamps": None,
    "hallucination_silence_threshold": None
}

model = whisperx.load_model("large-v3", device, compute_type=compute_type,  download_root=model_dir, language=language, asr_options=options)
audio = whisperx.load_audio(file_path)
result = model.transcribe(audio, batch_size=batch_size, chunk_size=10, print_progress=True)

model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

diarize_model = whisperx.DiarizationPipeline(use_auth_token=HF_TOKEN, device=device)
diarize_segments = diarize_model(audio, min_speakers=min_speakers)
result = whisperx.assign_word_speakers(diarize_segments, result)

nikola1975 · 2024-05-24T09:13:50Z

You are getting poor results from the diarization, or is it wrongly recognizing speakers? My results are not 100% precise now, but they are relatively close to it. I am not sure what are your expectations :)

I suppose you are using Pyannote 3.1 model? Try to run diarization through this link and check if you are getting the same results:
https://huggingface.co/spaces/pyannote/pretrained-pipelines

nikola1975 closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diarization precision - is there way to improve it? #804

Diarization precision - is there way to improve it? #804

nikola1975 commented May 14, 2024

nikola1975 commented May 21, 2024

nikola1975 commented May 21, 2024

drstuggels commented May 23, 2024

nikola1975 commented May 24, 2024

Diarization precision - is there way to improve it? #804

Diarization precision - is there way to improve it? #804

Comments

nikola1975 commented May 14, 2024

nikola1975 commented May 21, 2024

nikola1975 commented May 21, 2024

drstuggels commented May 23, 2024

nikola1975 commented May 24, 2024