Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provides an extension to Crawler4J for crawling dynamic web pages #236

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ngsoftwaredev
Copy link

This pull request originates from a project I did for a customer where we had to crawl one of their public websites (made with ReactJS) for building a search engine on it.

Starting from an initial approach in Python, it wasn't easy to maintain or to operate. So I went on taking advantage of Crawler4J since I actually hate it that everytime some dynamic website has to be crawled it's done in Python. I'd rather have a much better and easy to operate Java implementation!

It took some effort to wire together Crawler4j (which I used extensively in past projects) and a headless browser.

The README mentions the possibility to extend the parser for such a usage scenario, so I thought it would actually be nice to make this much simpler for anyone wishing to address dynamic content crawling with Crawler4j.

The implementation makes use of Selenium's WebDriver API to interface with a headless browser. It deals with the fact that WebDriver is not thread-safe, so basically we instantiate one headless browser per crawler thread.

Also it offers a programmatic approach for waiting for dynamic content loading.

I'll be glad to at least have your feedback on this PR.

Things that could be done to enhance it:

  • update the README (I'll definitely do it if you consider merging this)
  • maybe package this feature as some sort of extension instead of having it in the core module?

Best regards from France,

Nicolas

@ngsoftwaredev ngsoftwaredev force-pushed the feature/dynamic_crawler branch 5 times, most recently from 440c6e1 to 57680fe Compare August 23, 2023 11:21
@ngsoftwaredev
Copy link
Author

One criticism to this implementation is that it's not happening at the PageFetcher level. However doing it this way would require some refactoring, while my approach is only using extension.

One bug I noticed is that I'm not enforcing CrawlConfig#maxDownloadSize when fetching the page with WebDriver. I should fix this.

…g. using client-side rendering). The implementation uses Selenium's WebDriver API to interface with a headless browser.
Copy link
Owner

@rzo1 rzo1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contributions. I didn't had time to test it (yet) but will obviously run a little crawl with it :-)

I have added some comments / questions.

@rzo1
Copy link
Owner

rzo1 commented Aug 23, 2023

Maybe we should add some (basic) tests, too.

@ngsoftwaredev
Copy link
Author

Thanks for reviewing :-)

I've added the missing file headers, I'll start looking into your other comments.

@ngsoftwaredev
Copy link
Author

ngsoftwaredev commented Aug 23, 2023

OK so I've fixed the defects you found. Let me know if the changes make sense to you.

I'll see for adding some basic tests. Not quite sure what I can unit test though.

@rzo1
Copy link
Owner

rzo1 commented Aug 23, 2023

I'll see for adding some basic tests. Not quite sure what I can unit test though.

I was thinking more into integration testing, but might be a bit hard to implement for the gain we get from it; something like spin up a webserver (testcontainers maybe) + some silly JS-based webpage. But don't think it has any priority ;-)

@rzo1
Copy link
Owner

rzo1 commented Aug 23, 2023

OK so I've fixed the defects you found. Let me know if the changes make sense to you.

The changes look good to me. I will try to find some time to run some on our JS-based university website.

@ngsoftwaredev
Copy link
Author

I see that the Java 17 build fails. I reproduce the same problem locally on the master branch, so it doesn't seem to be related to my changes.

@rzo1
Copy link
Owner

rzo1 commented Aug 23, 2023

It is a flaky test ;-) - from a unit test side everything is ok.

@ngsoftwaredev
Copy link
Author

OK, I realized I should make the wait for dynamic content loading more generic, I'll make a last commit to address this.

@rzo1
Copy link
Owner

rzo1 commented Aug 24, 2023

I downloaded https://github.com/mozilla/geckodriver/releases/download/v0.33.0/geckodriver-v0.33.0-linux64.tar.gz and tried to run the given example.
It resulted in a

2023-08-24 07:36:14,785 INFO  [main] frontier.SleepycatFrontierConfiguration (SleepycatFrontierConfiguration.java:66) - Deleted contents of: /tmp/crawler4j/frontier ( as you have configured resumable crawling to false )
2023-08-24 07:36:15,940 INFO  [main] crawler.CrawlController (CrawlController.java:248) - Crawler 1 started
2023-08-24 07:36:15,941 INFO  [main] crawler.CrawlController (CrawlController.java:248) - Crawler 2 started
Exception in thread "main" java.lang.RuntimeException: error on thread [Crawler 1]
	at edu.uci.ics.crawler4j.crawler.CrawlController.lambda$start$0(CrawlController.java:280)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.openqa.selenium.SessionNotCreatedException: Could not start a new session. Response code 400. Message: binary is not a Firefox executable 
Host info: host: 'node-147', ip: '127.0.1.1'
Build info: version: '4.11.0', revision: '040bc5406b'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.2.0-26-generic', java.version: '11.0.20'
Driver info: org.openqa.selenium.firefox.FirefoxDriver
Command: [null, newSession {capabilities=[Capabilities {acceptInsecureCerts: true, browserName: firefox, moz:debuggerAddress: true, moz:firefoxOptions: {args: [--disable-gpu, --user-agent=Mozilla/5.0 (X..., --headless, --window-size=1920,1080, --ignore-certificate-errors], binary: /usr/bin/firefox}}]}]
	at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:140)
	at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:96)
	at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:68)
	at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:163)
	at org.openqa.selenium.remote.service.DriverCommandExecutor.invokeExecute(DriverCommandExecutor.java:196)
	at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:171)
	at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:518)
	at org.openqa.selenium.remote.RemoteWebDriver.startSession(RemoteWebDriver.java:232)
	at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:159)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:156)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:151)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:132)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:127)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:112)
	at edu.uci.ics.crawler4j.crawler.dynamic.DynamicWebCrawler.newWebDriverInstance(DynamicWebCrawler.java:65)
	at edu.uci.ics.crawler4j.crawler.dynamic.DynamicWebCrawler.onStart(DynamicWebCrawler.java:44)
	at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:294)
	... 1 more

Process finished with exit code 1

It seems, that the default Ubuntu Firefox installation is coming via snap. I switched that to the official Mozilla ppa but no luck either.

$ whereis firefox
firefox: /usr/bin/firefox /usr/lib/firefox /etc/firefox /snap/bin/firefox /usr/share/man/man1/firefox.1.gz

After this switch, I am getting

Exception in thread "main" java.lang.RuntimeException: error on thread [Crawler 1]
	at edu.uci.ics.crawler4j.crawler.CrawlController.lambda$start$0(CrawlController.java:280)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.openqa.selenium.SessionNotCreatedException: Could not start a new session. Response code 400. Message: binary is not a Firefox executable 
Host info: host: 'node-147', ip: '127.0.1.1'
Build info: version: '4.11.0', revision: '040bc5406b'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.2.0-26-generic', java.version: '11.0.20'
Driver info: org.openqa.selenium.firefox.FirefoxDriver
Command: [null, newSession {capabilities=[Capabilities {acceptInsecureCerts: true, browserName: firefox, moz:debuggerAddress: true, moz:firefoxOptions: {args: [--user-agent=Mozilla/5.0 (X..., --headless, --window-size=1920,1080, --ignore-certificate-errors, --disable-gpu], binary: /usr/lib/firefox/firefox.sh}}]}]
	at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:140)
	at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:96)
	at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:68)
	at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:163)
	at org.openqa.selenium.remote.service.DriverCommandExecutor.invokeExecute(DriverCommandExecutor.java:196)
	at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:171)
	at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:518)
	at org.openqa.selenium.remote.RemoteWebDriver.startSession(RemoteWebDriver.java:232)
	at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:159)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:156)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:151)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:132)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:127)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:112)
	at edu.uci.ics.crawler4j.crawler.dynamic.DynamicWebCrawler.newWebDriverInstance(DynamicWebCrawler.java:66)
	at edu.uci.ics.crawler4j.crawler.dynamic.DynamicWebCrawler.onStart(DynamicWebCrawler.java:45)
	at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:294)
	... 1 more

This looks interesting because /usr/bin/firefox is just a symlink to the sh script in the default installation.

It seems, that I am doing something obviously wrong. Any pointers?

@rzo1
Copy link
Owner

rzo1 commented Aug 24, 2023

Ok seems to be Ubuntu 22 - snap - yada yada related ;-) - will test in a VM setup.


There seems to be a problem if more than one crawler thread is used (at least on Windows)

Exception in thread "main" java.lang.RuntimeException: error on thread [Crawler 1]
	at edu.uci.ics.crawler4j.crawler.CrawlController.lambda$start$0(CrawlController.java:280)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.openqa.selenium.remote.NoSuchDriverException: Unable to obtain: Capabilities {acceptInsecureCerts: true, browserName: firefox, moz:debuggerAddress: true, moz:firefoxOptions: {args: [--window-size=1920,1080, --headless, --ignore-certificate-errors, --user-agent=Mozilla/5.0 (X..., --profile-root=/tmp/crawler4j, --disable-gpu]}}, error Command failed with code: 65, executed: [C:\Users\zowallar\AppData\Local\Temp\selenium-manager121128585160018224746421227753725\selenium-manager.exe, --browser, firefox, --output, json]
Der Prozess kann nicht auf die Datei zugreifen, da sie von einem anderen Prozess verwendet wird. (os error 32)
Build info: version: '4.11.0', revision: '040bc5406b'
System info: os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '11.0.1'
Driver info: driver.version: FirefoxDriver
	at org.openqa.selenium.remote.service.DriverFinder.getPath(DriverFinder.java:25)
	at org.openqa.selenium.remote.service.DriverFinder.getPath(DriverFinder.java:13)
	at org.openqa.selenium.firefox.FirefoxDriver.generateExecutor(FirefoxDriver.java:141)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:132)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:127)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:112)
	at edu.uci.ics.crawler4j.crawler.dynamic.DynamicWebCrawler.newWebDriverInstance(DynamicWebCrawler.java:66)
	at edu.uci.ics.crawler4j.crawler.dynamic.DynamicWebCrawler.onStart(DynamicWebCrawler.java:45)
	at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:294)
	... 1 more
Caused by: org.openqa.selenium.WebDriverException: Command failed with code: 65, executed: [C:\Users\zowallar\AppData\Local\Temp\selenium-manager121128585160018224746421227753725\selenium-manager.exe, --browser, firefox, --output, json]
Der Prozess kann nicht auf die Datei zugreifen, da sie von einem anderen Prozess verwendet wird. (os error 32)
Build info: version: '4.11.0', revision: '040bc5406b'
System info: os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '11.0.1'
Driver info: driver.version: FirefoxDriver
	at org.openqa.selenium.manager.SeleniumManager.runCommand(SeleniumManager.java:151)
	at org.openqa.selenium.manager.SeleniumManager.getDriverPath(SeleniumManager.java:273)
	at org.openqa.selenium.remote.service.DriverFinder.getPath(DriverFinder.java:22)
	... 9 more

Process finished with exit code -1

``

Setting the number of crawler threads in the example to 1 will work on Windows.

@rzo1
Copy link
Owner

rzo1 commented Aug 24, 2023

Looks like I need to setup an additional linux-based VM. Any distro suggestions for testing? ;-)

@ngsoftwaredev
Copy link
Author

Actually I did my testing with chromedriver, I have to look into Geckodriver a bit more.

What I have already seen is that there is a way to explicitly set the Firefox binary. That might help, I should probably expose it. I'll let you know how I fare, it might be a couple days though before I can resume work on this.

@rzo1
Copy link
Owner

rzo1 commented Aug 24, 2023

No problem. I was able to get it to work using the chromedriver ;-) - think we would need to mention https://googlechromelabs.github.io/chrome-for-testing/ in the README.

@rzo1
Copy link
Owner

rzo1 commented Aug 24, 2023

@schwzr If you want to give it a try to provide feedback, feel free ;-)

@rzo1 rzo1 mentioned this pull request Sep 20, 2023
/**
* Absolute path to the driver binary on the local filesystem.
*/
private Path webDriverPath;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngsoftwaredev
Copy link
Author

ngsoftwaredev commented Nov 13, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants