[Feature Enhancement] Use Mozilla Readability for content extraction #16

douglasg14b · 2024-04-16T16:18:53Z

Firefox reader mode simplifies many websites by extracting just the content, the library that does this is open source: https://github.com/mozilla/readability

Page extraction could use this to better target page content. If the backend can accept content directly (Instead of trying to go extract it itself), it's conceivable that the browser extension could use this as a method to extract that content early.

douglasg14b · 2024-04-16T16:21:18Z

@Dhravya Are you accepting pull requests and enhancements? If so, what scope is acceptable for those changes? Some enhancements may not be integrate-able without structural changes. 🤔

Dhravya · 2024-04-16T19:04:22Z

We accept any kind of pull requests. Structural changes can be discussed here!

Dhravya · 2024-04-16T19:05:34Z

This is very helpful. One problem we had was reliable extraction of text from the website and tbh, we just gave up on it.

This would actually be very helpful.

Dhravya · 2024-04-16T19:07:29Z

Can this also be used by our browser rendering agent? Shouldn't be too big of a change, readability looks like a one line API

We can also use this https://r.jina.ai/dhravya.dev

douglasg14b · 2024-04-16T21:55:19Z

Text extraction is difficult, this is one of the few tools I've found that do it somewhat reliably. I was doing light work on a personal knowledge archiver (Automatically doing text extract of webpages I visit with archivebox archiving for later vecorization and second-brain retrieval). You can probably tell why your projects excites me so much :)

Dhravya · 2024-04-18T08:37:25Z

haha! Thanks douglas!

I am looking at JinaAI and FireCrawl by Mendable https://x.com/mendableai/status/1780289422644109686
It looks like most are just using turndown, Firecrawl is even server side so it doesn't work with sites that have captcha etc.

Readability and Jina are genuinely the only two good solutions i could find

jayeshp19 · 2024-04-26T11:05:43Z

Great insights @Dhravya !! I'll start working on this

Dhravya · 2024-04-27T02:57:00Z

Alright @jayeshp19, I'll assign it to you

Dhravya · 2024-04-27T02:57:50Z

The task is to basically call the function in https://github.com/mozilla/readability while getting content.

Both in the extension and the getPageContent API route.

jayeshp19 · 2024-04-27T04:48:55Z

Thanks I'll start working on it

Dhravya assigned jayeshp19 Apr 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Enhancement] Use Mozilla Readability for content extraction #16

[Feature Enhancement] Use Mozilla Readability for content extraction #16

douglasg14b commented Apr 16, 2024 •

edited

douglasg14b commented Apr 16, 2024

Dhravya commented Apr 16, 2024

Dhravya commented Apr 16, 2024

Dhravya commented Apr 16, 2024

douglasg14b commented Apr 16, 2024 •

edited

Dhravya commented Apr 18, 2024

jayeshp19 commented Apr 26, 2024

Dhravya commented Apr 27, 2024

Dhravya commented Apr 27, 2024

jayeshp19 commented Apr 27, 2024

[Feature Enhancement] Use Mozilla Readability for content extraction #16

[Feature Enhancement] Use Mozilla Readability for content extraction #16

Comments

douglasg14b commented Apr 16, 2024 • edited

douglasg14b commented Apr 16, 2024

Dhravya commented Apr 16, 2024

Dhravya commented Apr 16, 2024

Dhravya commented Apr 16, 2024

douglasg14b commented Apr 16, 2024 • edited

Dhravya commented Apr 18, 2024

jayeshp19 commented Apr 26, 2024

Dhravya commented Apr 27, 2024

Dhravya commented Apr 27, 2024

jayeshp19 commented Apr 27, 2024

douglasg14b commented Apr 16, 2024 •

edited

douglasg14b commented Apr 16, 2024 •

edited