Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only non-relevant page components returned #20

Open
RamXX opened this issue Apr 16, 2024 · 4 comments
Open

Only non-relevant page components returned #20

RamXX opened this issue Apr 16, 2024 · 4 comments

Comments

@RamXX
Copy link

RamXX commented Apr 16, 2024

Fantastic project. Thank you!

Here is a page that (one would think) is straightforward to parse: https://access.redhat.com/security/cve/CVE-2023-45853 . However, none of the relevant information in the page makes it to the parsed version, only the corporate links and "scaffolding".

I figured I'd report it in case this can highlight some areas of improvement. Thanks again!

@hanxiao
Copy link
Member

hanxiao commented Apr 16, 2024

Thanks for reporting, will dig in.

@hanxiao
Copy link
Member

hanxiao commented Apr 16, 2024

image

found the problem, somehow this site doesn't even work with chrome->view source code view-source:https://access.redhat.com/security/cve/CVE-2023-45853. because it requires js to be running,

so using stream mode solves the problem:

curl -H "Accept: text/event-stream" -H 'x-no-cache: true' https://r.jina.ai/https://access.redhat.com/security/cve/CVE-2023-45853

pay attention to the last chunk in the event stream, it should give you:

image

@Joelokon
Copy link

Thank🙏

@RamXX
Copy link
Author

RamXX commented Apr 21, 2024

Thanks a lot! I'll make a note whenever I can't parse a site, to attempt this mechanism. Wondering if we should keep this open basically to ensure it gets in the documentation. Otherwise we can just close it. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants