You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
featureIssues that represent new features or improvements to existing features.t-toolingIssues with this label are in the ownership of the tooling team.
Which package is the feature request for? If unsure which one to select, leave blank
None
Feature
There is already a robust system of organizing error statistics. For some scrapers, I use an "ErrorSnapshotter" approach where on the first occurrence of each type of error, a screenshot and/or HTML is stored in KV store for further analysis. We should also store a link (either Apify or local path) to the snapshot KV records next to the error statistics count.
Useful for faster default debugging, especially for "one of thousands" type of errors or when other users are running the scraper.
Ideal solution or implementation, and any additional constraints
Implementation ideas
The current implementation is hidden under several function calls so it is a bit tricky to add a completely new functionality.
Tha main classes are Statistics and ErrorTracker.
The dirty solution would be to send the crawling context through the function calls and then just dynamically figure out if it is Puppeteer, Playwright, or HTML body and use the appropriate snapshotting method from context.
The more proper way would probably be to use generics all the way down but I haven't explored that option.
Or do a larger refactor
Keep in mind
Some errors happen before any page is created or opened, before navigation happens, or after the page is already closed (maybe then a response object is still available to store HTML?)
We need to generate unique filenames. I like if the filenames carry some information so one idea is a hash of the full error object path from ErrorTracker + the first 30 (or 50?) characters of the error for easy reading.
Alternative solutions or implementations
No response
Other context
No response
The text was updated successfully, but these errors were encountered:
This commit introduces the ErrorSnapshotter class to the crawlee
package, providing functionality to capture screenshots and HTML
snapshots when an error occurs during web crawling.
This functionality is opt-in, and can be enabled via the crawler
options:
```ts
const crawler = new BasicCrawler({
// ...
statisticsOptions: {
saveErrorSnapshots: true,
},
});
```
Closes#2280
---------
Co-authored-by: Martin Adámek <banan23@gmail.com>
featureIssues that represent new features or improvements to existing features.t-toolingIssues with this label are in the ownership of the tooling team.
Which package is the feature request for? If unsure which one to select, leave blank
None
Feature
There is already a robust system of organizing error statistics. For some scrapers, I use an "ErrorSnapshotter" approach where on the first occurrence of each type of error, a screenshot and/or HTML is stored in KV store for further analysis. We should also store a link (either Apify or local path) to the snapshot KV records next to the error statistics count.
Here is an example of how the stats file looks now
https://api.apify.com/v2/key-value-stores/L7DclFFX3fHuPCne9/records/SDK_CRAWLER_STATISTICS_0
Motivation
Useful for faster default debugging, especially for "one of thousands" type of errors or when other users are running the scraper.
Ideal solution or implementation, and any additional constraints
Implementation ideas
The current implementation is hidden under several function calls so it is a bit tricky to add a completely new functionality.
Tha main classes are Statistics and ErrorTracker.
Keep in mind
Alternative solutions or implementations
No response
Other context
No response
The text was updated successfully, but these errors were encountered: