Save screenshot/HTML on first occurrence of error in error statistics #2280

metalwarrior665 · 2024-01-11T18:36:41Z

Which package is the feature request for? If unsure which one to select, leave blank

None

Feature

There is already a robust system of organizing error statistics. For some scrapers, I use an "ErrorSnapshotter" approach where on the first occurrence of each type of error, a screenshot and/or HTML is stored in KV store for further analysis. We should also store a link (either Apify or local path) to the snapshot KV records next to the error statistics count.

Here is an example of how the stats file looks now
https://api.apify.com/v2/key-value-stores/L7DclFFX3fHuPCne9/records/SDK_CRAWLER_STATISTICS_0

Motivation

Useful for faster default debugging, especially for "one of thousands" type of errors or when other users are running the scraper.

Ideal solution or implementation, and any additional constraints

Implementation ideas
The current implementation is hidden under several function calls so it is a bit tricky to add a completely new functionality.
Tha main classes are Statistics and ErrorTracker.

The dirty solution would be to send the crawling context through the function calls and then just dynamically figure out if it is Puppeteer, Playwright, or HTML body and use the appropriate snapshotting method from context.
The more proper way would probably be to use generics all the way down but I haven't explored that option.
Or do a larger refactor

Keep in mind

Some errors happen before any page is created or opened, before navigation happens, or after the page is already closed (maybe then a response object is still available to store HTML?)
We need to generate unique filenames. I like if the filenames carry some information so one idea is a hash of the full error object path from ErrorTracker + the first 30 (or 50?) characters of the error for easy reading.

Alternative solutions or implementations

No response

Other context

No response

B4nan · 2024-01-12T12:57:13Z

Sounds like a duplicate of #1771, maybe we should close that one?

metalwarrior665 · 2024-01-12T14:55:03Z

Haha, so I wasted some time :) Closed the old one.

This commit introduces the ErrorSnapshotter class to the crawlee package, providing functionality to capture screenshots and HTML snapshots when an error occurs during web crawling. This functionality is opt-in, and can be enabled via the crawler options: ```ts const crawler = new BasicCrawler({ // ... statisticsOptions: { saveErrorSnapshots: true, }, }); ``` Closes #2280 --------- Co-authored-by: Martin Adámek <banan23@gmail.com>

metalwarrior665 added the feature Issues that represent new features or improvements to existing features. label Jan 11, 2024

B4nan added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 15, 2024

HamzaAlwan self-assigned this Feb 6, 2024

B4nan mentioned this issue Feb 12, 2024

feat: implement ErrorSnapshotter for error context capture #2332

Merged

B4nan closed this as completed in #2332 May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save screenshot/HTML on first occurrence of error in error statistics #2280

Save screenshot/HTML on first occurrence of error in error statistics #2280

metalwarrior665 commented Jan 11, 2024

B4nan commented Jan 12, 2024

metalwarrior665 commented Jan 12, 2024

Save screenshot/HTML on first occurrence of error in error statistics #2280

Save screenshot/HTML on first occurrence of error in error statistics #2280

Comments

metalwarrior665 commented Jan 11, 2024

Which package is the feature request for? If unsure which one to select, leave blank

Feature

Motivation

Ideal solution or implementation, and any additional constraints

Alternative solutions or implementations

Other context

B4nan commented Jan 12, 2024

metalwarrior665 commented Jan 12, 2024