Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove 'cookies' text when removing headers/footers, etc #78

Open
tractorjuice opened this issue Apr 27, 2024 · 4 comments
Open

Remove 'cookies' text when removing headers/footers, etc #78

tractorjuice opened this issue Apr 27, 2024 · 4 comments
Labels
bug Something isn't working enhancement New feature or request question Further information is requested

Comments

@tractorjuice
Copy link

Remove any cookies text when removing headers and footers.
Many sites in Europe will display a cookie acceptance message
Sometimes, this is the only text returned.

Sometimes it captures something like:

"Skip to main content\n\nCookies \n------------------------------\n\nWe use some essential cookies to make this service work.\n\nWe\u2019d also like to use analytics cookies so we can understand how you use the service and make improvements.\n\nAccept analytics cookies Reject analytics cookies How we use cookies\n\nYou can change your cookie settings\n at any time.\n\nHide cookie message\n\n"

@nickscamara
Copy link
Member

Huge! @tractorjuice can you send us an example of an url where this shows up?

@nickscamara nickscamara added enhancement New feature or request question Further information is requested labels Apr 27, 2024
@tractorjuice
Copy link
Author

tractorjuice commented Apr 28, 2024

@nickscamara
Copy link
Member

@tractorjuice Thanks! That's very helpful.

@nickscamara nickscamara added the bug Something isn't working label May 6, 2024
@fhederdos
Copy link

Actually, cookie banners are preventing the crawler from successfully accessing and crawling certain websites at all. This problem has been observed on multiple sites, including both public institutions and news websites.
Eg:
Public Institution: https://www.salzburg.gv.at/
Newspaper: https://www.derstandard.at/
When the crawler attempts to access these sites, it encounters cookie consent banners that block further actions. As a result, the crawler is unable to navigate past the initial page and cannot gather any content from the website.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants