Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Strip non-content tags, headers, footers #1

Open
oliviermills opened this issue Apr 16, 2024 · 5 comments · May be fixed by #16
Open

[Feat] Strip non-content tags, headers, footers #1

oliviermills opened this issue Apr 16, 2024 · 5 comments · May be fixed by #16

Comments

@oliviermills
Copy link

oliviermills commented Apr 16, 2024

The markdown would be much more useful if you stripped headers/footers and other tags like filters etc that is not core content (i.e. low value for RAG/context). Either using tag or class-based removal from the html or using something like Mozilla's Readability or both! Highly opinionated class-based removal is risky but produces high value content and less noise.

For example a language selector in a header gets produced and should be stripped:

[Skip to main content](#main-content)

Select LanguageEnglishAfrikaansAlbanianArabicArmenianAzerbaijaniBasqueBelarusianBengaliBosnianBulgarianCatalanCebuanoChinese (Simplified)Chinese (Traditional)CroatianCzechDanishDutchEsperantoEstonianFilipinoFinnishFrenchGalicianGeorgianGermanGreekGujaratiHaitian CreoleHausaHebrewHindiHmongHungarianIcelandicIgboIndonesianIrishItalianJapaneseJavaneseKannadaKhmerKoreanLaoLatinLatvianLithuanianMacedonianMalayMalteseMaoriMarathiMongolianNepaliNorwegianPersianPolishPortuguesePunjabiRomanianRussianSerbianSlovakSlovenianSomaliSpanishSwahiliSwedishTamilTeluguThaiTurkishUkrainianUrduVietnameseWelshYiddishYorubaZulu

Here is a starter list.. should probably test against a couple thousand random pages and use an LLM like haiku with vision as judge.

const exclude = [
  'header', '.header', '.top', '.navbar', '#header',
  'footer', '.footer', '.bottom', '#footer',
  '.sidebar', '.side', '.aside', '#sidebar',
  '.modal', '.popup', '#modal', '.overlay',
  '.ad', '.ads', '.advert', '#ad',
  '.lang-selector', '.language', '#language-selector',
  '.social', '.social-media', '.social-links', '#social',
  '.menu', '.navigation', 'nav', '#nav',
  '.breadcrumbs', '#breadcrumbs',
  '.form', 'form', '#search-form',
  'script', 'noscript'
];
@oliviermills oliviermills changed the title Strip non-content tags, headers, footers [Feat] Strip non-content tags, headers, footers Apr 16, 2024
@calebpeffer
Copy link
Contributor

So, we've defaulted towards removing less, because (like you said) highly opinionated removal is risky and its easy to do further cleaning on the output with regex.

Like the idea of readability as an option. Great suggestion!

@nickscamara
Copy link
Member

nickscamara commented Apr 18, 2024

@oliviermills thank you for this. Just merged an option to remove non content tags. #14

This is just a start and I think there is room for other improvements here.

@nickscamara
Copy link
Member

Let me know if you have any feedback!

@oliviermills oliviermills linked a pull request Apr 18, 2024 that will close this issue
@oliviermills
Copy link
Author

oliviermills commented Apr 18, 2024

I suggest a cleaner function per my PR #16 .. its slightly less aggressive but needs integration testing (#15) to see if it affects the md conversion. I checked turndown and any customizations within the code base here and it doesn't use style so that should be ok.

@nickscamara
Copy link
Member

Awesome, thanks @oliviermills! Will be checking it out soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants