How to scrape a dynamic website? #71

vChavezB · 2022-02-04T11:54:19Z

I am trying to export a localhost website that is generated with this project:

The project generates a localhost website, and each time the user interacts clicks a link the project receives a GET request and the website generates the HTML. This means that the HTML is generated each time the user access a link through their browser. At the moment the project does not export the website to html or pdf. For this reason I want to know how could I recursively get all the hyperlinks and then generate the HTML version. Would this be possible with autoscraper?

yafethtb · 2022-11-23T11:16:43Z

It seems no one answer this yet. I don't know if the developers see this or not. But let me help you here. From the scraper file they create, they are using static scraper libraries like requests and BeautifulSoup. Dynamic website needs browser engine to execute the JavaScript parts of the web. Python has some libraries like Selenium or Playwright that using browser engine to render the JavaScript from dynamic webs and extract the HTML from them. But it seems autoscraper didn't use them. Maybe they will, or maybe not. As for November 23rd, 2022, I don't see any dynamic web scraper libraries used in the core file of this program.

P.S: Correct me if I'm wrong.

lrq3000 · 2022-11-24T22:19:04Z

You can supply a html argument to scraper.build() to use the output of your preferred HTML fetcher, so it should be compatible with Selenium with a bit of manual programming.

irthomasthomas mentioned this issue Sep 7, 2023

Scraping irthomasthomas/undecidability#30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to scrape a dynamic website? #71

How to scrape a dynamic website? #71

vChavezB commented Feb 4, 2022

yafethtb commented Nov 23, 2022 •

edited

lrq3000 commented Nov 24, 2022 •

edited

How to scrape a dynamic website? #71

How to scrape a dynamic website? #71

Comments

vChavezB commented Feb 4, 2022

yafethtb commented Nov 23, 2022 • edited

lrq3000 commented Nov 24, 2022 • edited

yafethtb commented Nov 23, 2022 •

edited

lrq3000 commented Nov 24, 2022 •

edited