Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diarization precision - is there way to improve it? #804

Closed
nikola1975 opened this issue May 14, 2024 · 4 comments
Closed

Diarization precision - is there way to improve it? #804

nikola1975 opened this issue May 14, 2024 · 4 comments

Comments

@nikola1975
Copy link

I am running speaker diarization, with Pyannote 3.0.1, and am struggling to improve results. The change of speakers is recognized in English relatively well, but the alignment is bit hit and miss. Sometimes the whole sentences with the next speaker are left in the previous speaker's segment and similar.

Here is the audio file:
https://s3.eu-central-2.wasabisys.com/qira/12/2024/5/inourtime-hobsbawm_6min_1_1715693802051/inourtime-hobsbawm_6min_1_1715693802051.mp3

Here is the example from the beginning of the file, the last sentence in the first segment is spoken by the SPEAKER_02, and it stays within the SPEAKER_01. Then it starts the new segment at the start of the next sentence.

Any way to improve this?

    {
        "text": "In the first couple of sentences of your lecture, you say the fundamental assumption behind the various movements of the avant-garde in the arts, which dominated the past century, was that relations between art and society had changed fundamentally, that old ways of looking at the world were inadequate, and new ways must be found. This assumption was correct. Can you tell us why you think that assumption is correct? Largely, it seems to me, because the world in which we live, which is determined by.",
        "start": 0.009,
        "end": 30.009,
        "sentence_spans": [
            [
                0,
                485
            ]
        ],
        "speaker": "SPEAKER_01"
    },
    {
        "text": "enormous changes in technology, by enormous changes in industrialization and in the consequences of industrialization, really produce a number of both experiences and realities which simply cannot be adequately expressed in the old idiom unless that idiom is suitable to expressing something of the 20th century type.",
        "start": 30.811,
        "end": 60.794,
        "sentence_spans": [
            [
                0,
                319
            ]
        ],
        "speaker": "SPEAKER_02"
    },
@nikola1975
Copy link
Author

I have tried upgrading to Pyannote 3.1, and the problem persists. The alignment is pretty useless - even in a very controlled environment (ie. studio recording, BBC podcast, with 3 speakers), it is missing quiet a bit.

Anyone had success in making this better?

@nikola1975
Copy link
Author

Ok, I figured out what I was doing wrong. I will leave the comment here in case someone has similar problem and will close the issue.

When sending to diarization, I was using segments created by the transcription process. Segments were too long (ie. 3-5 sentences), which meant that sometimes speakers were changing in between and the model took the one that was the most common in that segment. I have now changed and am sending segments created by the alignment process, where segments are much shorter and the result is much better.

@drstuggels
Copy link

@nikola1975 I am having the same issue, but your solution (the default code example in the README) doesn't solve it. Here's my code:

options = {
    "max_new_tokens": None,
    "clip_timestamps": None,
    "hallucination_silence_threshold": None
}

model = whisperx.load_model("large-v3", device, compute_type=compute_type,  download_root=model_dir, language=language, asr_options=options)
audio = whisperx.load_audio(file_path)
result = model.transcribe(audio, batch_size=batch_size, chunk_size=10, print_progress=True)

model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

diarize_model = whisperx.DiarizationPipeline(use_auth_token=HF_TOKEN, device=device)
diarize_segments = diarize_model(audio, min_speakers=min_speakers)
result = whisperx.assign_word_speakers(diarize_segments, result)

@nikola1975
Copy link
Author

You are getting poor results from the diarization, or is it wrongly recognizing speakers? My results are not 100% precise now, but they are relatively close to it. I am not sure what are your expectations :)

I suppose you are using Pyannote 3.1 model? Try to run diarization through this link and check if you are getting the same results:
https://huggingface.co/spaces/pyannote/pretrained-pipelines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants