Initial transcription support #494

FlakM · 2023-01-15T21:19:25Z

Hi! So this is the initial MR for getting the ball rolling on incorporating the transcriptions created for issue #301. The idea is that the transcriptions should be a plain json file and they should be displayed only for the pages where the relevant transcription is already present.

This is in no way a ready code, just an initial setup, maybe someone will have an easier time picking it up now 👍

Features I'd like to see:

links at timestamp to set playback of local web player to given time
link to GitHub repo to the corresponding JSON with transcriptions to enable easy edits
some nice formatting of the text

Unfortunately, whisper AI is currently cutting the sentences strangely - this should be fixed in sometime in the future.
I'd be happy to rerun them then and backport modifications.

gerbrent · 2023-01-17T18:02:12Z

sounds asthough this PR should be marked WIP?

and... very exciting!!!!

FlakM · 2023-01-17T19:03:04Z

I've added the WIP flag but this is misleading since I'm not currently able to work on it a lot. With the limited time I get I'd rather focus on improving transcriptions and maybe preparing POC with search which was my initial goal.

The current code requires some love to improve the looks (little HTML, some CSS and maybe javascript to set the correct time in a web player)

It seems like a perfect opportunity for someone to pick up a nice task. I'd be more than happy to "mentor" as much as I can

ChanceM · 2023-01-17T19:44:13Z

Just a thought do we want this to be on a separate page or should it ideally be embedded into the episode page. That would also be better for the JS interaction. I was thinking maybe tabs "show notes" and "transcript"

FlakM · 2023-01-17T20:13:45Z

Currently, it is embedded inside the episode site:

Whisper is sadly cutting them strangely for some files.

FlakM · 2023-01-22T10:34:33Z

Since it has not received much attention I've picked it up. For now setting playback time based on timestamp is not possible but folks at podverse will soon add it podverse/podverse-web#1071 (reply in thread) 💪

FlakM · 2023-01-23T21:30:46Z

Over the weekend I've tried to give it a run, I've uploaded fresh transcripts (90sh) for different episodes:
Here are the screens:

As mentioned above it is currently impossible to set the current time in podverse online player (well unless we proxy podverse player on the same domain but this would open a can of warms). Transcripts are imperfect but are easily editable by users even using an online GitHub client 👍 for the newest ones I'll definitely want to run the large.en model.

@gerbrent @ChanceM @pagdot @noblepayne (people mentioned in other issues) please provide feedback 😄

pagdot · 2023-01-24T08:59:06Z

Can you also upload the code to run the transcriptions? Could imagine it also being in another repository, but it would allow others to also work on it or just be inspired :)

FlakM · 2023-01-24T16:56:53Z

Can you also upload the code to run the transcriptions? Could imagine it also being in another repository, but it would allow others to also work on it or just be inspired :)

The sources are available in my repository Jupiter search I still have a lot of work there but on x86 machine it should be as simple as downloading a model and running inference using docker image

elreydetoda · 2023-02-19T13:14:42Z

I definitely agree with @ChanceM (src):

Just a thought do we want this to be on a separate page or should it ideally be embedded into the episode page. That would also be better for the JS interaction. I was thinking maybe tabs "show notes" and "transcript"

for a final solution we should have some type of separate page or tabbed area. For now though, I think that it just being inline is fine for an initial PoC.

@FlakM, once this is merged (or even before then) could you collect a list of enhancements that we could do for transcriptions? Maybe this one we'll consider as closing #301 and we have another one for enhancing the transcription experience. Then we can reference the old issue in the new one, so anyone that wants to make that leap (from PoC -> enhanced) has that link. We can eventually break each of those tasks out in their own GH issues (to allow individuals to work on them separately), but for now I think just doing a single issue would be nice (till we do some spring cleaning on some of these issues 😅 )

FlakM · 2023-02-19T19:17:10Z

Hello @elreydetoda as I've mentioned in the matrix (hehe) this work has been taken over by JB crew and if I'm not mistaken they have different ideas about how the transcripts are to be generated.

If there is an actual decision to host transcripts on s3, not a GitHub repo, then this MR should probably get closed and new one should be created to ingest data from s3. As for further improvements here is the list of my personal acceptance criteria that I would add:

transcripts should be open for edits - mistakes are bound to happen. Some might be offensive
there should be only single source of truth - edits are automatically shown in all places (RSS feed, web site, search index etc)
format should be open for future extension - for instance currently whisper is not supporting speaker diarization but it is completely possible that some framework might in the future add support for it. AFAIK there might be some work in whisper cpp project whisper : mark speakers/voices (diarization) ggerganov/whisper.cpp#64 (comment). There definitely will be progress in the quality of the tooling it would be awesome to be able to update the transcripts then
on the web version clicking on the timestamp should set the current playback time and append time to URL. Hugo should also honour URLs with timestamps to enable sharing a particular moment
there should be public documentation on how to run a whole s2t pipeline and hopefully it should be automated

elreydetoda · 2023-02-21T11:40:20Z

Hello @elreydetoda as I've mentioned in the matrix (hehe) this work has been taken over by JB crew and if I'm not mistaken they have different ideas about how the transcripts are to be generated.

Hello @FlakM 😁

Yep, no problem I remember seeing it (thank you for the reminder 🙃) . I just wanted to get my feedback about the longer term goal to be added to the PR/issue about this feature.

If there is an actual decision to host transcripts on s3, not a GitHub repo, then this MR should probably get closed and new one should be created to ingest data from s3. As for further improvements here is the list of my personal acceptance criteria that I would add:

transcripts should be open for edits - mistakes are bound to happen. Some might be offensive

there should be only single source of truth - edits are automatically shown in all places (RSS feed, web site, search index etc)

format should be open for future extension - for instance currently whisper is not supporting speaker diarization but it is completely possible that some framework might in the future add support for it. AFAIK there might be some work in whisper cpp project whisper : mark speakers/voices (diarization) ggerganov/whisper.cpp#64 (comment). There definitely will be progress in the quality of the tooling it would be awesome to be able to update the transcripts then

on the web version clicking on the timestamp should set the current playback time and append time to URL. Hugo should also honour URLs with timestamps to enable sharing a particular moment

there should be public documentation on how to run a whole s2t pipeline and hopefully it should be automated

I completely agree with all of these criteria/features! Whatever I can do to convey these point I'll definitely try to get them all included (if I'm asked/consultanted about this feature). I can't guarantee it'll happen (since in the end it's up to the JB team), but IMO I definitely think 1, 2, & 5 (of the points you listed above) should be considered critical (even for an MVP) to ensure the transcript doesn't offend someone and reflect badly on JB. If they did, it would allow anyone in the community to quickly fix it and that would fix it for everything/one.

Honestly, (just thinking out loud here) would you think a good alternative to an s3 bucket could be something like GH pages to actually host just the raw text of the transcripts (in whatever format they need to be in).
That way the transcripts could just be hosted in a repo (probably a separate one to simplify separation of concerns) and then published via a GH action workflow.

Plus IIRC GH pages already has some type of CDN in front of it. If it doesn't, since it's just text, we could just put cloudflare in front of it too.

feature: initial transcription support

f46b3ba

FlakM changed the title ~~WIP: initial transcription support~~ Initial transcription support Jan 15, 2023

FlakM changed the title ~~Initial transcription support~~ WIP: Initial transcription support Jan 17, 2023

chore: add initial bunch of transcripts

5521570

FlakM changed the title ~~WIP: Initial transcription support~~ Initial transcription support Jan 23, 2023

CGBassPlayer added enhancement New feature, enhancement, or request research Only doing research, and might not be implemented P2.0 podcasting 2.0 feature labels Jan 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial transcription support #494

Initial transcription support #494

FlakM commented Jan 15, 2023 •

edited

gerbrent commented Jan 17, 2023

FlakM commented Jan 17, 2023

ChanceM commented Jan 17, 2023

FlakM commented Jan 17, 2023 •

edited

FlakM commented Jan 22, 2023

FlakM commented Jan 23, 2023

pagdot commented Jan 24, 2023

FlakM commented Jan 24, 2023

elreydetoda commented Feb 19, 2023

FlakM commented Feb 19, 2023

elreydetoda commented Feb 21, 2023 •

edited

Initial transcription support #494

Are you sure you want to change the base?

Initial transcription support #494

Conversation

FlakM commented Jan 15, 2023 • edited

gerbrent commented Jan 17, 2023

FlakM commented Jan 17, 2023

ChanceM commented Jan 17, 2023

FlakM commented Jan 17, 2023 • edited

FlakM commented Jan 22, 2023

FlakM commented Jan 23, 2023

pagdot commented Jan 24, 2023

FlakM commented Jan 24, 2023

elreydetoda commented Feb 19, 2023

FlakM commented Feb 19, 2023

elreydetoda commented Feb 21, 2023 • edited

FlakM commented Jan 15, 2023 •

edited

FlakM commented Jan 17, 2023 •

edited

elreydetoda commented Feb 21, 2023 •

edited