Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial transcription support #494

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

FlakM
Copy link

@FlakM FlakM commented Jan 15, 2023

Hi! So this is the initial MR for getting the ball rolling on incorporating the transcriptions created for issue #301. The idea is that the transcriptions should be a plain json file and they should be displayed only for the pages where the relevant transcription is already present.

This is in no way a ready code, just an initial setup, maybe someone will have an easier time picking it up now 馃憤

Features I'd like to see:

  • links at timestamp to set playback of local web player to given time
  • link to GitHub repo to the corresponding JSON with transcriptions to enable easy edits
  • some nice formatting of the text

Unfortunately, whisper AI is currently cutting the sentences strangely - this should be fixed in sometime in the future.
I'd be happy to rerun them then and backport modifications.

@FlakM FlakM changed the title WIP: initial transcription support Initial transcription support Jan 15, 2023
@gerbrent
Copy link
Collaborator

sounds asthough this PR should be marked WIP?

and... very exciting!!!!

@FlakM FlakM changed the title Initial transcription support WIP: Initial transcription support Jan 17, 2023
@FlakM
Copy link
Author

FlakM commented Jan 17, 2023

I've added the WIP flag but this is misleading since I'm not currently able to work on it a lot. With the limited time I get I'd rather focus on improving transcriptions and maybe preparing POC with search which was my initial goal.

The current code requires some love to improve the looks (little HTML, some CSS and maybe javascript to set the correct time in a web player)

It seems like a perfect opportunity for someone to pick up a nice task. I'd be more than happy to "mentor" as much as I can

@ChanceM
Copy link
Contributor

ChanceM commented Jan 17, 2023

Just a thought do we want this to be on a separate page or should it ideally be embedded into the episode page. That would also be better for the JS interaction. I was thinking maybe tabs "show notes" and "transcript"

@FlakM
Copy link
Author

FlakM commented Jan 17, 2023

Currently, it is embedded inside the episode site:

transcripts

Whisper is sadly cutting them strangely for some files.

@FlakM
Copy link
Author

FlakM commented Jan 22, 2023

Since it has not received much attention I've picked it up. For now setting playback time based on timestamp is not possible but folks at podverse will soon add it podverse/podverse-web#1071 (reply in thread) 馃挭

@FlakM
Copy link
Author

FlakM commented Jan 23, 2023

Over the weekend I've tried to give it a run, I've uploaded fresh transcripts (90sh) for different episodes:
Here are the screens:

01
02

As mentioned above it is currently impossible to set the current time in podverse online player (well unless we proxy podverse player on the same domain but this would open a can of warms). Transcripts are imperfect but are easily editable by users even using an online GitHub client 馃憤 for the newest ones I'll definitely want to run the large.en model.

@gerbrent @ChanceM @pagdot @noblepayne (people mentioned in other issues) please provide feedback 馃槃

@FlakM FlakM changed the title WIP: Initial transcription support Initial transcription support Jan 23, 2023
@pagdot
Copy link
Contributor

pagdot commented Jan 24, 2023

Can you also upload the code to run the transcriptions? Could imagine it also being in another repository, but it would allow others to also work on it or just be inspired :)

@FlakM
Copy link
Author

FlakM commented Jan 24, 2023

Can you also upload the code to run the transcriptions? Could imagine it also being in another repository, but it would allow others to also work on it or just be inspired :)

The sources are available in my repository Jupiter search I still have a lot of work there but on x86 machine it should be as simple as downloading a model and running inference using docker image

@elreydetoda
Copy link
Collaborator

I definitely agree with @ChanceM (src):

Just a thought do we want this to be on a separate page or should it ideally be embedded into the episode page. That would also be better for the JS interaction. I was thinking maybe tabs "show notes" and "transcript"

for a final solution we should have some type of separate page or tabbed area. For now though, I think that it just being inline is fine for an initial PoC.

@FlakM, once this is merged (or even before then) could you collect a list of enhancements that we could do for transcriptions? Maybe this one we'll consider as closing #301 and we have another one for enhancing the transcription experience. Then we can reference the old issue in the new one, so anyone that wants to make that leap (from PoC -> enhanced) has that link. We can eventually break each of those tasks out in their own GH issues (to allow individuals to work on them separately), but for now I think just doing a single issue would be nice (till we do some spring cleaning on some of these issues 馃槄 )

@FlakM
Copy link
Author

FlakM commented Feb 19, 2023

Hello @elreydetoda as I've mentioned in the matrix (hehe) this work has been taken over by JB crew and if I'm not mistaken they have different ideas about how the transcripts are to be generated.

If there is an actual decision to host transcripts on s3, not a GitHub repo, then this MR should probably get closed and new one should be created to ingest data from s3. As for further improvements here is the list of my personal acceptance criteria that I would add:

  • transcripts should be open for edits - mistakes are bound to happen. Some might be offensive
  • there should be only single source of truth - edits are automatically shown in all places (RSS feed, web site, search index etc)
  • format should be open for future extension - for instance currently whisper is not supporting speaker diarization but it is completely possible that some framework might in the future add support for it. AFAIK there might be some work in whisper cpp project whisper : mark speakers/voices (diarization)聽ggerganov/whisper.cpp#64 (comment). There definitely will be progress in the quality of the tooling it would be awesome to be able to update the transcripts then
  • on the web version clicking on the timestamp should set the current playback time and append time to URL. Hugo should also honour URLs with timestamps to enable sharing a particular moment
  • there should be public documentation on how to run a whole s2t pipeline and hopefully it should be automated

@elreydetoda
Copy link
Collaborator

elreydetoda commented Feb 21, 2023

Hello @elreydetoda as I've mentioned in the matrix (hehe) this work has been taken over by JB crew and if I'm not mistaken they have different ideas about how the transcripts are to be generated.

Hello @FlakM 馃榿

Yep, no problem I remember seeing it (thank you for the reminder 馃檭) . I just wanted to get my feedback about the longer term goal to be added to the PR/issue about this feature.

If there is an actual decision to host transcripts on s3, not a GitHub repo, then this MR should probably get closed and new one should be created to ingest data from s3. As for further improvements here is the list of my personal acceptance criteria that I would add:

  • transcripts should be open for edits - mistakes are bound to happen. Some might be offensive
  • there should be only single source of truth - edits are automatically shown in all places (RSS feed, web site, search index etc)
  • format should be open for future extension - for instance currently whisper is not supporting speaker diarization but it is completely possible that some framework might in the future add support for it. AFAIK there might be some work in whisper cpp project whisper : mark speakers/voices (diarization)聽ggerganov/whisper.cpp#64 (comment). There definitely will be progress in the quality of the tooling it would be awesome to be able to update the transcripts then
  • on the web version clicking on the timestamp should set the current playback time and append time to URL. Hugo should also honour URLs with timestamps to enable sharing a particular moment
  • there should be public documentation on how to run a whole s2t pipeline and hopefully it should be automated

I completely agree with all of these criteria/features! Whatever I can do to convey these point I'll definitely try to get them all included (if I'm asked/consultanted about this feature). I can't guarantee it'll happen (since in the end it's up to the JB team), but IMO I definitely think 1, 2, & 5 (of the points you listed above) should be considered critical (even for an MVP) to ensure the transcript doesn't offend someone and reflect badly on JB. If they did, it would allow anyone in the community to quickly fix it and that would fix it for everything/one.

Honestly, (just thinking out loud here) would you think a good alternative to an s3 bucket could be something like GH pages to actually host just the raw text of the transcripts (in whatever format they need to be in).
That way the transcripts could just be hosted in a repo (probably a separate one to simplify separation of concerns) and then published via a GH action workflow.

Plus IIRC GH pages already has some type of CDN in front of it. If it doesn't, since it's just text, we could just put cloudflare in front of it too.

@CGBassPlayer CGBassPlayer added enhancement New feature, enhancement, or request research Only doing research, and might not be implemented P2.0 podcasting 2.0 feature labels Jan 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature, enhancement, or request P2.0 podcasting 2.0 feature research Only doing research, and might not be implemented
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants