Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to scrape a dynamic website? #71

Open
vChavezB opened this issue Feb 4, 2022 · 2 comments
Open

How to scrape a dynamic website? #71

vChavezB opened this issue Feb 4, 2022 · 2 comments

Comments

@vChavezB
Copy link

vChavezB commented Feb 4, 2022

I am trying to export a localhost website that is generated with this project:

https://github.com/HBehrens/puncover

The project generates a localhost website, and each time the user interacts clicks a link the project receives a GET request and the website generates the HTML. This means that the HTML is generated each time the user access a link through their browser. At the moment the project does not export the website to html or pdf. For this reason I want to know how could I recursively get all the hyperlinks and then generate the HTML version. Would this be possible with autoscraper?

@yafethtb
Copy link

yafethtb commented Nov 23, 2022

It seems no one answer this yet. I don't know if the developers see this or not. But let me help you here. From the scraper file they create, they are using static scraper libraries like requests and BeautifulSoup. Dynamic website needs browser engine to execute the JavaScript parts of the web. Python has some libraries like Selenium or Playwright that using browser engine to render the JavaScript from dynamic webs and extract the HTML from them. But it seems autoscraper didn't use them. Maybe they will, or maybe not. As for November 23rd, 2022, I don't see any dynamic web scraper libraries used in the core file of this program.

P.S: Correct me if I'm wrong.

@lrq3000
Copy link

lrq3000 commented Nov 24, 2022

You can supply a html argument to scraper.build() to use the output of your preferred HTML fetcher, so it should be compatible with Selenium with a bit of manual programming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants