New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Enhancement] Use Mozilla Readability for content extraction #16
Comments
@Dhravya Are you accepting pull requests and enhancements? If so, what scope is acceptable for those changes? Some enhancements may not be integrate-able without structural changes. 🤔 |
We accept any kind of pull requests. Structural changes can be discussed here! |
This is very helpful. One problem we had was reliable extraction of text from the website and tbh, we just gave up on it. This would actually be very helpful. |
Can this also be used by our browser rendering agent? Shouldn't be too big of a change, readability looks like a one line API We can also use this https://r.jina.ai/dhravya.dev |
Text extraction is difficult, this is one of the few tools I've found that do it somewhat reliably. I was doing light work on a personal knowledge archiver (Automatically doing text extract of webpages I visit with archivebox archiving for later vecorization and second-brain retrieval). You can probably tell why your projects excites me so much :) |
haha! Thanks douglas! I am looking at JinaAI and FireCrawl by Mendable https://x.com/mendableai/status/1780289422644109686 Readability and Jina are genuinely the only two good solutions i could find |
Great insights @Dhravya !! I'll start working on this |
Alright @jayeshp19, I'll assign it to you |
The task is to basically call the function in https://github.com/mozilla/readability while getting content. Both in the extension and the getPageContent API route. |
Thanks I'll start working on it |
Firefox reader mode simplifies many websites by extracting just the content, the library that does this is open source: https://github.com/mozilla/readability
Page extraction could use this to better target page content. If the backend can accept content directly (Instead of trying to go extract it itself), it's conceivable that the browser extension could use this as a method to extract that content early.
The text was updated successfully, but these errors were encountered: