New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URL loses information in the conversion #41
Comments
Ditto , testing with a random JavaScript Adblocked: -> Title: Wyborcza.pl URL Source: https://wyborcza.pl/7,75399,30940180,waza-sie-losy-donbasu-kramatorsk-i-slowiansk-czekaja-na-uderzenie.html Markdown Content: Wyborcza.pl body { font-family: Arial, sans-serif; font-size: 13px; } h1 { font-size: 16px; } a { color: #146cb4; text-decoration: none; } a:hover, a:focus { color: #b00126; } body .msg-container { position: absolute; top: 0px; bottom: 0px; left: 0px; right: 0px; } body #message { margin: 10% auto; width: 60%; background: #ededed; padding-bottom: 16px; text-align: center; } body #message img { margin-left: 16px; } body #message h1 { margin: 10px 16px 16px 16px; } body #message p { margin: 0 16px 10px 16px; } #message, #info-adblock, #info-ups { display: none; } Wyłącz AdBlocka/uBlocka ======================= etc. So a quick advice, without reading the code: Puppetter, Selenium... - the way archive.vn etc probably do it. |
Ditto on:
Title: wsj.com URL Source: https://www.wsj.com/us-news/education/student-campus-protests-veteran-activist-groups-17ccd094?mod=us-news_lead_story Markdown Content: wsj.com#cmsg{animation: A 1.5s;}@Keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}var dd={'rt':'c','cid':'AHrlqAAAAAMAztT8fNUhJjMA_0SGaw==','hsh':'D428D51E28968797BC27FB9153435D','t':'fe','s':47129,'e':'c1fd16f5873ded61d7412c9885980cebcfe2371a0ffa409288dd419843c15c12','host':'geo.captcha-delivery.com'} .../Audio/Recordings $ Adding e.g. -H "Accept: text/event-stream" does noy change the results, all probably due to this server trap there: 'host':'geo.captcha-delivery.com'}" Yet even lynx dump handles it:
|
Converting the URL at https://www.eventbrite.com/e/brian-mclaren-wisdom-and-courage-for-a-world-falling-apart-tickets-823891721317 results in the loss of several headings (Date and Time, Location, Refund Policy, etc.) in Jina reader. Rendered result is this:
URL Source: https://www.eventbrite.com/e/brian-mclaren-wisdom-and-courage-for-a-world-falling-apart-tickets-823891721317
Markdown Content:
Brian McLaren, author, speaker, activist, and public theologian notes that the challenge of living well and maintaining resilience in turbulent times requires new ways of thinking, becoming, and belonging. Facing nations, ecosystems, economies, religions, and other institutions in disarray, we are called to a spiritual transformation in our own lives that will express itself in transformation in our world.
Childcare for ages 2-12 will be provided by reservation through April 15.
The text was updated successfully, but these errors were encountered: