Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL loses information in the conversion #41

Open
tomshaffner opened this issue Apr 26, 2024 · 2 comments
Open

URL loses information in the conversion #41

tomshaffner opened this issue Apr 26, 2024 · 2 comments

Comments

@tomshaffner
Copy link

Converting the URL at https://www.eventbrite.com/e/brian-mclaren-wisdom-and-courage-for-a-world-falling-apart-tickets-823891721317 results in the loss of several headings (Date and Time, Location, Refund Policy, etc.) in Jina reader. Rendered result is this:

Title: Brian McLaren | “Wisdom and Courage for a World Falling Apart"

URL Source: https://www.eventbrite.com/e/brian-mclaren-wisdom-and-courage-for-a-world-falling-apart-tickets-823891721317

Markdown Content:
Brian McLaren, author, speaker, activist, and public theologian notes that the challenge of living well and maintaining resilience in turbulent times requires new ways of thinking, becoming, and belonging. Facing nations, ecosystems, economies, religions, and other institutions in disarray, we are called to a spiritual transformation in our own lives that will express itself in transformation in our world.

Childcare for ages 2-12 will be provided by reservation through April 15.

@Manamama
Copy link

Manamama commented May 4, 2024

Ditto , testing with a random JavaScript Adblocked:

curl_jina https://wyborcza.pl/7,75399,30940180,waza-sie-losy-donbasu-kramatorsk-i-slowiansk-czekaja-na-uderzenie.html#S.MT-K.C-B.1-

->

Title: Wyborcza.pl URL Source: https://wyborcza.pl/7,75399,30940180,waza-sie-losy-donbasu-kramatorsk-i-slowiansk-czekaja-na-uderzenie.html Markdown Content: Wyborcza.pl body { font-family: Arial, sans-serif; font-size: 13px; } h1 { font-size: 16px; } a { color: #146cb4; text-decoration: none; } a:hover, a:focus { color: #b00126; } body .msg-container { position: absolute; top: 0px; bottom: 0px; left: 0px; right: 0px; } body #message { margin: 10% auto; width: 60%; background: #ededed; padding-bottom: 16px; text-align: center; } body #message img { margin-left: 16px; } body #message h1 { margin: 10px 16px 16px 16px; } body #message p { margin: 0 16px 10px 16px; } #message, #info-adblock, #info-ups { display: none; } Image 1: a red and black logo on a black background Wyłącz AdBlocka/uBlocka =======================

etc.

So a quick advice, without reading the code: Puppetter, Selenium... - the way archive.vn etc probably do it.

@Manamama
Copy link

Manamama commented May 5, 2024

Ditto on:

curl -H "Accept: text/event-stream"  https://r.jina.ai/https://www.wsj.com/us-news/education/student-campus-protests-veteran-activist-groups-17ccd094?mod=us-news_lead_story  

Title: wsj.com URL Source: https://www.wsj.com/us-news/education/student-campus-protests-veteran-activist-groups-17ccd094?mod=us-news_lead_story Markdown Content: wsj.com#cmsg{animation: A 1.5s;}@Keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}var dd={'rt':'c','cid':'AHrlqAAAAAMAztT8fNUhJjMA_0SGaw==','hsh':'D428D51E28968797BC27FB9153435D','t':'fe','s':47129,'e':'c1fd16f5873ded61d7412c9885980cebcfe2371a0ffa409288dd419843c15c12','host':'geo.captcha-delivery.com'} .../Audio/Recordings $

Adding e.g. -H "Accept: text/event-stream" does noy change the results, all probably due to this server trap there: 'host':'geo.captcha-delivery.com'}"

Yet even lynx dump handles it:

please contact Dow Jones Reprints at 1-800-843-0008 or visit www.djreprints.com. https://www.wsj.com/us-news/education/student-campus-protests-veteran-a ctivist-groups-17ccd094 Activist Groups Trained Students for Months Before Campus Protests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants