Provides an extension to Crawler4J for crawling dynamic web pages #236

ngsoftwaredev · 2023-08-23T11:05:08Z

This pull request originates from a project I did for a customer where we had to crawl one of their public websites (made with ReactJS) for building a search engine on it.

Starting from an initial approach in Python, it wasn't easy to maintain or to operate. So I went on taking advantage of Crawler4J since I actually hate it that everytime some dynamic website has to be crawled it's done in Python. I'd rather have a much better and easy to operate Java implementation!

It took some effort to wire together Crawler4j (which I used extensively in past projects) and a headless browser.

The README mentions the possibility to extend the parser for such a usage scenario, so I thought it would actually be nice to make this much simpler for anyone wishing to address dynamic content crawling with Crawler4j.

The implementation makes use of Selenium's WebDriver API to interface with a headless browser. It deals with the fact that WebDriver is not thread-safe, so basically we instantiate one headless browser per crawler thread.

Also it offers a programmatic approach for waiting for dynamic content loading.

I'll be glad to at least have your feedback on this PR.

Things that could be done to enhance it:

update the README (I'll definitely do it if you consider merging this)
maybe package this feature as some sort of extension instead of having it in the core module?

Best regards from France,

Nicolas

ngsoftwaredev · 2023-08-23T13:21:52Z

One criticism to this implementation is that it's not happening at the PageFetcher level. However doing it this way would require some refactoring, while my approach is only using extension.

One bug I noticed is that I'm not enforcing CrawlConfig#maxDownloadSize when fetching the page with WebDriver. I should fix this.

…g. using client-side rendering). The implementation uses Selenium's WebDriver API to interface with a headless browser.

rzo1

Thanks for this contributions. I didn't had time to test it (yet) but will obviously run a little crawl with it :-)

I have added some comments / questions.

crawler4j-commons/src/main/java/edu/uci/ics/crawler4j/crawler/DynamicCrawlConfig.java

crawler4j-core/src/main/java/edu/uci/ics/crawler4j/crawler/dynamic/DynamicCrawlerHelper.java

crawler4j-core/src/main/java/edu/uci/ics/crawler4j/crawler/dynamic/DynamicWebCrawler.java

crawler4j-core/src/main/java/edu/uci/ics/crawler4j/parser/dynamic/DynamicParser.java

crawler4j-core/src/main/java/edu/uci/ics/crawler4j/parser/dynamic/DynamicTikaHtmlParser.java

rzo1 · 2023-08-23T13:36:12Z

Maybe we should add some (basic) tests, too.

ngsoftwaredev · 2023-08-23T13:41:12Z

Thanks for reviewing :-)

I've added the missing file headers, I'll start looking into your other comments.

ngsoftwaredev · 2023-08-23T14:10:06Z

OK so I've fixed the defects you found. Let me know if the changes make sense to you.

I'll see for adding some basic tests. Not quite sure what I can unit test though.

rzo1 · 2023-08-23T14:14:40Z

I'll see for adding some basic tests. Not quite sure what I can unit test though.

I was thinking more into integration testing, but might be a bit hard to implement for the gain we get from it; something like spin up a webserver (testcontainers maybe) + some silly JS-based webpage. But don't think it has any priority ;-)

crawler4j-examples/crawler4j-examples-dynamic/pom.xml

rzo1 · 2023-08-23T14:17:08Z

OK so I've fixed the defects you found. Let me know if the changes make sense to you.

The changes look good to me. I will try to find some time to run some on our JS-based university website.

ngsoftwaredev · 2023-08-23T14:36:09Z

I see that the Java 17 build fails. I reproduce the same problem locally on the master branch, so it doesn't seem to be related to my changes.

rzo1 · 2023-08-23T14:47:05Z

It is a flaky test ;-) - from a unit test side everything is ok.

ngsoftwaredev · 2023-08-23T15:33:56Z

OK, I realized I should make the wait for dynamic content loading more generic, I'll make a last commit to address this.

rzo1 · 2023-08-24T05:57:31Z

I downloaded https://github.com/mozilla/geckodriver/releases/download/v0.33.0/geckodriver-v0.33.0-linux64.tar.gz and tried to run the given example.
It resulted in a

2023-08-24 07:36:14,785 INFO  [main] frontier.SleepycatFrontierConfiguration (SleepycatFrontierConfiguration.java:66) - Deleted contents of: /tmp/crawler4j/frontier ( as you have configured resumable crawling to false )
2023-08-24 07:36:15,940 INFO  [main] crawler.CrawlController (CrawlController.java:248) - Crawler 1 started
2023-08-24 07:36:15,941 INFO  [main] crawler.CrawlController (CrawlController.java:248) - Crawler 2 started
Exception in thread "main" java.lang.RuntimeException: error on thread [Crawler 1]
	at edu.uci.ics.crawler4j.crawler.CrawlController.lambda$start$0(CrawlController.java:280)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.openqa.selenium.SessionNotCreatedException: Could not start a new session. Response code 400. Message: binary is not a Firefox executable 
Host info: host: 'node-147', ip: '127.0.1.1'
Build info: version: '4.11.0', revision: '040bc5406b'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.2.0-26-generic', java.version: '11.0.20'
Driver info: org.openqa.selenium.firefox.FirefoxDriver
Command: [null, newSession {capabilities=[Capabilities {acceptInsecureCerts: true, browserName: firefox, moz:debuggerAddress: true, moz:firefoxOptions: {args: [--disable-gpu, --user-agent=Mozilla/5.0 (X..., --headless, --window-size=1920,1080, --ignore-certificate-errors], binary: /usr/bin/firefox}}]}]
	at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:140)
	at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:96)
	at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:68)
	at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:163)
	at org.openqa.selenium.remote.service.DriverCommandExecutor.invokeExecute(DriverCommandExecutor.java:196)
	at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:171)
	at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:518)
	at org.openqa.selenium.remote.RemoteWebDriver.startSession(RemoteWebDriver.java:232)
	at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:159)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:156)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:151)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:132)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:127)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:112)
	at edu.uci.ics.crawler4j.crawler.dynamic.DynamicWebCrawler.newWebDriverInstance(DynamicWebCrawler.java:65)
	at edu.uci.ics.crawler4j.crawler.dynamic.DynamicWebCrawler.onStart(DynamicWebCrawler.java:44)
	at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:294)
	... 1 more

Process finished with exit code 1

It seems, that the default Ubuntu Firefox installation is coming via snap. I switched that to the official Mozilla ppa but no luck either.

$ whereis firefox
firefox: /usr/bin/firefox /usr/lib/firefox /etc/firefox /snap/bin/firefox /usr/share/man/man1/firefox.1.gz

After this switch, I am getting

Exception in thread "main" java.lang.RuntimeException: error on thread [Crawler 1]
	at edu.uci.ics.crawler4j.crawler.CrawlController.lambda$start$0(CrawlController.java:280)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.openqa.selenium.SessionNotCreatedException: Could not start a new session. Response code 400. Message: binary is not a Firefox executable 
Host info: host: 'node-147', ip: '127.0.1.1'
Build info: version: '4.11.0', revision: '040bc5406b'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.2.0-26-generic', java.version: '11.0.20'
Driver info: org.openqa.selenium.firefox.FirefoxDriver
Command: [null, newSession {capabilities=[Capabilities {acceptInsecureCerts: true, browserName: firefox, moz:debuggerAddress: true, moz:firefoxOptions: {args: [--user-agent=Mozilla/5.0 (X..., --headless, --window-size=1920,1080, --ignore-certificate-errors, --disable-gpu], binary: /usr/lib/firefox/firefox.sh}}]}]
	at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:140)
	at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:96)
	at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:68)
	at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:163)
	at org.openqa.selenium.remote.service.DriverCommandExecutor.invokeExecute(DriverCommandExecutor.java:196)
	at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:171)
	at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:518)
	at org.openqa.selenium.remote.RemoteWebDriver.startSession(RemoteWebDriver.java:232)
	at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:159)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:156)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:151)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:132)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:127)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:112)
	at edu.uci.ics.crawler4j.crawler.dynamic.DynamicWebCrawler.newWebDriverInstance(DynamicWebCrawler.java:66)
	at edu.uci.ics.crawler4j.crawler.dynamic.DynamicWebCrawler.onStart(DynamicWebCrawler.java:45)
	at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:294)
	... 1 more

This looks interesting because /usr/bin/firefox is just a symlink to the sh script in the default installation.

It seems, that I am doing something obviously wrong. Any pointers?

rzo1 · 2023-08-24T06:33:19Z

Ok seems to be Ubuntu 22 - snap - yada yada related ;-) - will test in a VM setup.

There seems to be a problem if more than one crawler thread is used (at least on Windows)

Exception in thread "main" java.lang.RuntimeException: error on thread [Crawler 1]
	at edu.uci.ics.crawler4j.crawler.CrawlController.lambda$start$0(CrawlController.java:280)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.openqa.selenium.remote.NoSuchDriverException: Unable to obtain: Capabilities {acceptInsecureCerts: true, browserName: firefox, moz:debuggerAddress: true, moz:firefoxOptions: {args: [--window-size=1920,1080, --headless, --ignore-certificate-errors, --user-agent=Mozilla/5.0 (X..., --profile-root=/tmp/crawler4j, --disable-gpu]}}, error Command failed with code: 65, executed: [C:\Users\zowallar\AppData\Local\Temp\selenium-manager121128585160018224746421227753725\selenium-manager.exe, --browser, firefox, --output, json]
Der Prozess kann nicht auf die Datei zugreifen, da sie von einem anderen Prozess verwendet wird. (os error 32)
Build info: version: '4.11.0', revision: '040bc5406b'
System info: os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '11.0.1'
Driver info: driver.version: FirefoxDriver
	at org.openqa.selenium.remote.service.DriverFinder.getPath(DriverFinder.java:25)
	at org.openqa.selenium.remote.service.DriverFinder.getPath(DriverFinder.java:13)
	at org.openqa.selenium.firefox.FirefoxDriver.generateExecutor(FirefoxDriver.java:141)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:132)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:127)
	at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:112)
	at edu.uci.ics.crawler4j.crawler.dynamic.DynamicWebCrawler.newWebDriverInstance(DynamicWebCrawler.java:66)
	at edu.uci.ics.crawler4j.crawler.dynamic.DynamicWebCrawler.onStart(DynamicWebCrawler.java:45)
	at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:294)
	... 1 more
Caused by: org.openqa.selenium.WebDriverException: Command failed with code: 65, executed: [C:\Users\zowallar\AppData\Local\Temp\selenium-manager121128585160018224746421227753725\selenium-manager.exe, --browser, firefox, --output, json]
Der Prozess kann nicht auf die Datei zugreifen, da sie von einem anderen Prozess verwendet wird. (os error 32)
Build info: version: '4.11.0', revision: '040bc5406b'
System info: os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '11.0.1'
Driver info: driver.version: FirefoxDriver
	at org.openqa.selenium.manager.SeleniumManager.runCommand(SeleniumManager.java:151)
	at org.openqa.selenium.manager.SeleniumManager.getDriverPath(SeleniumManager.java:273)
	at org.openqa.selenium.remote.service.DriverFinder.getPath(DriverFinder.java:22)
	... 9 more

Process finished with exit code -1

``

Setting the number of crawler threads in the example to 1 will work on Windows.

rzo1 · 2023-08-24T06:53:40Z

Looks like I need to setup an additional linux-based VM. Any distro suggestions for testing? ;-)

ngsoftwaredev · 2023-08-24T07:54:50Z

Actually I did my testing with chromedriver, I have to look into Geckodriver a bit more.

What I have already seen is that there is a way to explicitly set the Firefox binary. That might help, I should probably expose it. I'll let you know how I fare, it might be a couple days though before I can resume work on this.

rzo1 · 2023-08-24T08:13:39Z

No problem. I was able to get it to work using the chromedriver ;-) - think we would need to mention https://googlechromelabs.github.io/chrome-for-testing/ in the README.

rzo1 · 2023-08-24T13:10:36Z

@schwzr If you want to give it a try to provide feedback, feel free ;-)

valfirst · 2023-11-13T20:24:26Z

crawler4j-commons/src/main/java/edu/uci/ics/crawler4j/crawler/DynamicCrawlConfig.java

+    /**
+     * Absolute path to the driver binary on the local filesystem.
+     */
+    private Path webDriverPath;


AFAIK it's not needed nowadays: https://www.selenium.dev/documentation/selenium_manager/#automated-driver-management

ngsoftwaredev · 2023-11-13T23:58:19Z

Thanks for the heads up! It's been quite a while since I issued the PR, and more work is to be done on it. I've lacked the time so far but fingers crossed I hope I can get back to it soon.

…

On Mon, Nov 13, 2023, 21:24 Valery Yatsynovich ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In crawler4j-commons/src/main/java/edu/uci/ics/crawler4j/crawler/DynamicCrawlConfig.java <#236 (comment)>: > + public enum WebDriverType { + chrome, + firefox + } + + private WebDriverType webDriverType = WebDriverType.firefox; + + /** + * If dynamic content crawling is used, the maximum time to wait for dynamic content loading in seconds. + */ + private int maxWaitForDynamicContentInSeconds = 2; + + /** + * Absolute path to the driver binary on the local filesystem. + */ + private Path webDriverPath; AFAIK it's not needed nowadays: https://www.selenium.dev/documentation/selenium_manager/#automated-driver-management — Reply to this email directly, view it on GitHub <#236 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AG7U4UIAID7S2223VXDNFO3YEJ6YNAVCNFSM6AAAAAA33JIWB6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTOMRYGIYTANZYHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

ngsoftwaredev force-pushed the feature/dynamic_crawler branch 5 times, most recently from 440c6e1 to 57680fe Compare August 23, 2023 11:21

Provides an extension to Crawler4J for crawling dynamic web pages (e.…

4950572

…g. using client-side rendering). The implementation uses Selenium's WebDriver API to interface with a headless browser.

ngsoftwaredev force-pushed the feature/dynamic_crawler branch from d076e1e to 4950572 Compare August 23, 2023 13:35

rzo1 reviewed Aug 23, 2023

View reviewed changes

Added missing file headers

163cc4c

Review fixes

4c495e8

rzo1 reviewed Aug 23, 2023

View reviewed changes

crawler4j-examples/crawler4j-examples-dynamic/pom.xml Show resolved Hide resolved

Make the wait for dynamic content loading logic more generic.

cad4bc1

rzo1 mentioned this pull request Sep 20, 2023

Ajax support #245

Open

valfirst reviewed Nov 13, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provides an extension to Crawler4J for crawling dynamic web pages #236

Provides an extension to Crawler4J for crawling dynamic web pages #236

ngsoftwaredev commented Aug 23, 2023

ngsoftwaredev commented Aug 23, 2023

rzo1 left a comment

rzo1 commented Aug 23, 2023 •

edited

ngsoftwaredev commented Aug 23, 2023

ngsoftwaredev commented Aug 23, 2023 •

edited

rzo1 commented Aug 23, 2023

rzo1 commented Aug 23, 2023

ngsoftwaredev commented Aug 23, 2023

rzo1 commented Aug 23, 2023

ngsoftwaredev commented Aug 23, 2023

rzo1 commented Aug 24, 2023 •

edited

rzo1 commented Aug 24, 2023 •

edited

rzo1 commented Aug 24, 2023

ngsoftwaredev commented Aug 24, 2023

rzo1 commented Aug 24, 2023

rzo1 commented Aug 24, 2023

valfirst Nov 13, 2023

ngsoftwaredev commented Nov 13, 2023 via email

Provides an extension to Crawler4J for crawling dynamic web pages #236

Are you sure you want to change the base?

Provides an extension to Crawler4J for crawling dynamic web pages #236

Conversation

ngsoftwaredev commented Aug 23, 2023

ngsoftwaredev commented Aug 23, 2023

rzo1 left a comment

Choose a reason for hiding this comment

rzo1 commented Aug 23, 2023 • edited

ngsoftwaredev commented Aug 23, 2023

ngsoftwaredev commented Aug 23, 2023 • edited

rzo1 commented Aug 23, 2023

rzo1 commented Aug 23, 2023

ngsoftwaredev commented Aug 23, 2023

rzo1 commented Aug 23, 2023

ngsoftwaredev commented Aug 23, 2023

rzo1 commented Aug 24, 2023 • edited

rzo1 commented Aug 24, 2023 • edited

rzo1 commented Aug 24, 2023

ngsoftwaredev commented Aug 24, 2023

rzo1 commented Aug 24, 2023

rzo1 commented Aug 24, 2023

valfirst Nov 13, 2023

Choose a reason for hiding this comment

ngsoftwaredev commented Nov 13, 2023 via email

rzo1 commented Aug 23, 2023 •

edited

ngsoftwaredev commented Aug 23, 2023 •

edited

rzo1 commented Aug 24, 2023 •

edited

rzo1 commented Aug 24, 2023 •

edited