-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore pages that have a 404 status code #82
Comments
We could have one 404 error page by website |
As long as you're aware that is an opinionated choice :) some sites have custom 404s by section of the site etc, some will keep the original URL like in my screenshot, some will redirect to a dedicated 404 URL, some will show a 404 page with a 200 response.. Web crawling is messy! Perhaps this could be a configuration thing, but that's up to you :) |
A good solution can be to hash a 404 or 200 webpage. This way if the page is specific to this URL it is saved, if not we could make a symbolic link to the generic one. |
Yea I think it's tricky. If it's legitimately just a bad link to a page that was never existed or a href that was relative when it shouldn't have been you might hit an infinite loop (i've seen this in practise). |
Humm ok. We have more serious issues and very little time currently, we will give this a try latter |
Yea no rush :) |
Currently suckit will save pages even if they are indicated as not found by the webserver. I think this is erroneous behaviour.
Eg this page on my site that 404s was saved to disk.
Chrome dev tools:
File explorer:
The text was updated successfully, but these errors were encountered: