Ignore pages that have a 404 status code #82

tbillington · 2020-05-10T10:02:59Z

Currently suckit will save pages even if they are indicated as not found by the webserver. I think this is erroneous behaviour.

Eg this page on my site that 404s was saved to disk.

Chrome dev tools:

File explorer:

Skallwar · 2020-05-10T14:32:47Z

We could have one 404 error page by website

tbillington · 2020-05-12T01:51:08Z

As long as you're aware that is an opinionated choice :) some sites have custom 404s by section of the site etc, some will keep the original URL like in my screenshot, some will redirect to a dedicated 404 URL, some will show a 404 page with a 200 response.. Web crawling is messy!

Perhaps this could be a configuration thing, but that's up to you :)

Skallwar · 2021-01-04T19:02:41Z

A good solution can be to hash a 404 or 200 webpage. This way if the page is specific to this URL it is saved, if not we could make a symbolic link to the generic one.

tbillington · 2021-01-04T22:49:59Z

Yea I think it's tricky. If it's legitimately just a bad link to a page that was never existed or a href that was relative when it shouldn't have been you might hit an infinite loop (i've seen this in practise).

Skallwar · 2021-01-05T08:53:07Z

Humm ok. We have more serious issues and very little time currently, we will give this a try latter

tbillington · 2021-01-05T22:47:18Z

Yea no rush :)

Skallwar added enhancement New feature or request good first issue Good for newcomers labels May 10, 2020

Skallwar added help wanted Extra attention is needed and removed good first issue Good for newcomers labels May 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore pages that have a 404 status code #82

Ignore pages that have a 404 status code #82

tbillington commented May 10, 2020

Skallwar commented May 10, 2020

tbillington commented May 12, 2020

Skallwar commented Jan 4, 2021

tbillington commented Jan 4, 2021

Skallwar commented Jan 5, 2021

tbillington commented Jan 5, 2021

Ignore pages that have a 404 status code #82

Ignore pages that have a 404 status code #82

Comments

tbillington commented May 10, 2020

Skallwar commented May 10, 2020

tbillington commented May 12, 2020

Skallwar commented Jan 4, 2021

tbillington commented Jan 4, 2021

Skallwar commented Jan 5, 2021

tbillington commented Jan 5, 2021