Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save screenshot/HTML on first occurrence of error in error statistics #2280

Closed
metalwarrior665 opened this issue Jan 11, 2024 · 2 comments · Fixed by #2332
Closed

Save screenshot/HTML on first occurrence of error in error statistics #2280

metalwarrior665 opened this issue Jan 11, 2024 · 2 comments · Fixed by #2332
Assignees
Labels
feature Issues that represent new features or improvements to existing features. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@metalwarrior665
Copy link
Member

Which package is the feature request for? If unsure which one to select, leave blank

None

Feature

There is already a robust system of organizing error statistics. For some scrapers, I use an "ErrorSnapshotter" approach where on the first occurrence of each type of error, a screenshot and/or HTML is stored in KV store for further analysis. We should also store a link (either Apify or local path) to the snapshot KV records next to the error statistics count.

Here is an example of how the stats file looks now
https://api.apify.com/v2/key-value-stores/L7DclFFX3fHuPCne9/records/SDK_CRAWLER_STATISTICS_0

Motivation

Useful for faster default debugging, especially for "one of thousands" type of errors or when other users are running the scraper.

Ideal solution or implementation, and any additional constraints

Implementation ideas
The current implementation is hidden under several function calls so it is a bit tricky to add a completely new functionality.
Tha main classes are Statistics and ErrorTracker.

  1. The dirty solution would be to send the crawling context through the function calls and then just dynamically figure out if it is Puppeteer, Playwright, or HTML body and use the appropriate snapshotting method from context.
  2. The more proper way would probably be to use generics all the way down but I haven't explored that option.
  3. Or do a larger refactor

Keep in mind

  1. Some errors happen before any page is created or opened, before navigation happens, or after the page is already closed (maybe then a response object is still available to store HTML?)
  2. We need to generate unique filenames. I like if the filenames carry some information so one idea is a hash of the full error object path from ErrorTracker + the first 30 (or 50?) characters of the error for easy reading.

Alternative solutions or implementations

No response

Other context

No response

@metalwarrior665 metalwarrior665 added the feature Issues that represent new features or improvements to existing features. label Jan 11, 2024
@B4nan
Copy link
Member

B4nan commented Jan 12, 2024

Sounds like a duplicate of #1771, maybe we should close that one?

@metalwarrior665
Copy link
Member Author

Haha, so I wasted some time :) Closed the old one.

@B4nan B4nan added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 15, 2024
@HamzaAlwan HamzaAlwan self-assigned this Feb 6, 2024
B4nan added a commit that referenced this issue May 16, 2024
This commit introduces the ErrorSnapshotter class to the crawlee
package, providing functionality to capture screenshots and HTML
snapshots when an error occurs during web crawling.

This functionality is opt-in, and can be enabled via the crawler
options:

```ts
const crawler = new BasicCrawler({
  // ...
  statisticsOptions: {
    saveErrorSnapshots: true,
  },
});
```

Closes #2280

---------

Co-authored-by: Martin Adámek <banan23@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issues that represent new features or improvements to existing features. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants