-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provides an extension to Crawler4J for crawling dynamic web pages #236
base: master
Are you sure you want to change the base?
Conversation
440c6e1
to
57680fe
Compare
One criticism to this implementation is that it's not happening at the PageFetcher level. However doing it this way would require some refactoring, while my approach is only using extension. One bug I noticed is that I'm not enforcing |
…g. using client-side rendering). The implementation uses Selenium's WebDriver API to interface with a headless browser.
d076e1e
to
4950572
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contributions. I didn't had time to test it (yet) but will obviously run a little crawl with it :-)
I have added some comments / questions.
crawler4j-commons/src/main/java/edu/uci/ics/crawler4j/crawler/DynamicCrawlConfig.java
Show resolved
Hide resolved
crawler4j-commons/src/main/java/edu/uci/ics/crawler4j/crawler/DynamicCrawlConfig.java
Outdated
Show resolved
Hide resolved
crawler4j-commons/src/main/java/edu/uci/ics/crawler4j/crawler/DynamicCrawlConfig.java
Outdated
Show resolved
Hide resolved
crawler4j-commons/src/main/java/edu/uci/ics/crawler4j/crawler/DynamicCrawlConfig.java
Outdated
Show resolved
Hide resolved
crawler4j-core/src/main/java/edu/uci/ics/crawler4j/crawler/dynamic/DynamicCrawlerHelper.java
Show resolved
Hide resolved
crawler4j-core/src/main/java/edu/uci/ics/crawler4j/crawler/dynamic/DynamicWebCrawler.java
Outdated
Show resolved
Hide resolved
crawler4j-core/src/main/java/edu/uci/ics/crawler4j/crawler/dynamic/DynamicWebCrawler.java
Outdated
Show resolved
Hide resolved
crawler4j-core/src/main/java/edu/uci/ics/crawler4j/crawler/dynamic/DynamicWebCrawler.java
Outdated
Show resolved
Hide resolved
crawler4j-core/src/main/java/edu/uci/ics/crawler4j/parser/dynamic/DynamicParser.java
Show resolved
Hide resolved
crawler4j-core/src/main/java/edu/uci/ics/crawler4j/parser/dynamic/DynamicTikaHtmlParser.java
Show resolved
Hide resolved
Maybe we should add some (basic) tests, too. |
Thanks for reviewing :-) I've added the missing file headers, I'll start looking into your other comments. |
OK so I've fixed the defects you found. Let me know if the changes make sense to you. I'll see for adding some basic tests. Not quite sure what I can unit test though. |
I was thinking more into integration testing, but might be a bit hard to implement for the gain we get from it; something like spin up a webserver (testcontainers maybe) + some silly JS-based webpage. But don't think it has any priority ;-) |
The changes look good to me. I will try to find some time to run some on our JS-based university website. |
I see that the Java 17 build fails. I reproduce the same problem locally on the master branch, so it doesn't seem to be related to my changes. |
It is a flaky test ;-) - from a unit test side everything is ok. |
OK, I realized I should make the wait for dynamic content loading more generic, I'll make a last commit to address this. |
I downloaded https://github.com/mozilla/geckodriver/releases/download/v0.33.0/geckodriver-v0.33.0-linux64.tar.gz and tried to run the given example. 2023-08-24 07:36:14,785 INFO [main] frontier.SleepycatFrontierConfiguration (SleepycatFrontierConfiguration.java:66) - Deleted contents of: /tmp/crawler4j/frontier ( as you have configured resumable crawling to false )
2023-08-24 07:36:15,940 INFO [main] crawler.CrawlController (CrawlController.java:248) - Crawler 1 started
2023-08-24 07:36:15,941 INFO [main] crawler.CrawlController (CrawlController.java:248) - Crawler 2 started
Exception in thread "main" java.lang.RuntimeException: error on thread [Crawler 1]
at edu.uci.ics.crawler4j.crawler.CrawlController.lambda$start$0(CrawlController.java:280)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.openqa.selenium.SessionNotCreatedException: Could not start a new session. Response code 400. Message: binary is not a Firefox executable
Host info: host: 'node-147', ip: '127.0.1.1'
Build info: version: '4.11.0', revision: '040bc5406b'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.2.0-26-generic', java.version: '11.0.20'
Driver info: org.openqa.selenium.firefox.FirefoxDriver
Command: [null, newSession {capabilities=[Capabilities {acceptInsecureCerts: true, browserName: firefox, moz:debuggerAddress: true, moz:firefoxOptions: {args: [--disable-gpu, --user-agent=Mozilla/5.0 (X..., --headless, --window-size=1920,1080, --ignore-certificate-errors], binary: /usr/bin/firefox}}]}]
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:140)
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:96)
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:68)
at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:163)
at org.openqa.selenium.remote.service.DriverCommandExecutor.invokeExecute(DriverCommandExecutor.java:196)
at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:171)
at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:518)
at org.openqa.selenium.remote.RemoteWebDriver.startSession(RemoteWebDriver.java:232)
at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:159)
at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:156)
at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:151)
at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:132)
at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:127)
at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:112)
at edu.uci.ics.crawler4j.crawler.dynamic.DynamicWebCrawler.newWebDriverInstance(DynamicWebCrawler.java:65)
at edu.uci.ics.crawler4j.crawler.dynamic.DynamicWebCrawler.onStart(DynamicWebCrawler.java:44)
at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:294)
... 1 more
Process finished with exit code 1
It seems, that the default Ubuntu Firefox installation is coming via
After this switch, I am getting Exception in thread "main" java.lang.RuntimeException: error on thread [Crawler 1]
at edu.uci.ics.crawler4j.crawler.CrawlController.lambda$start$0(CrawlController.java:280)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.openqa.selenium.SessionNotCreatedException: Could not start a new session. Response code 400. Message: binary is not a Firefox executable
Host info: host: 'node-147', ip: '127.0.1.1'
Build info: version: '4.11.0', revision: '040bc5406b'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.2.0-26-generic', java.version: '11.0.20'
Driver info: org.openqa.selenium.firefox.FirefoxDriver
Command: [null, newSession {capabilities=[Capabilities {acceptInsecureCerts: true, browserName: firefox, moz:debuggerAddress: true, moz:firefoxOptions: {args: [--user-agent=Mozilla/5.0 (X..., --headless, --window-size=1920,1080, --ignore-certificate-errors, --disable-gpu], binary: /usr/lib/firefox/firefox.sh}}]}]
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:140)
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:96)
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:68)
at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:163)
at org.openqa.selenium.remote.service.DriverCommandExecutor.invokeExecute(DriverCommandExecutor.java:196)
at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:171)
at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:518)
at org.openqa.selenium.remote.RemoteWebDriver.startSession(RemoteWebDriver.java:232)
at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:159)
at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:156)
at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:151)
at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:132)
at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:127)
at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:112)
at edu.uci.ics.crawler4j.crawler.dynamic.DynamicWebCrawler.newWebDriverInstance(DynamicWebCrawler.java:66)
at edu.uci.ics.crawler4j.crawler.dynamic.DynamicWebCrawler.onStart(DynamicWebCrawler.java:45)
at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:294)
... 1 more This looks interesting because It seems, that I am doing something obviously wrong. Any pointers? |
Ok seems to be Ubuntu 22 - snap - yada yada related ;-) - will test in a VM setup. There seems to be a problem if more than one crawler thread is used (at least on Windows) Exception in thread "main" java.lang.RuntimeException: error on thread [Crawler 1]
at edu.uci.ics.crawler4j.crawler.CrawlController.lambda$start$0(CrawlController.java:280)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.openqa.selenium.remote.NoSuchDriverException: Unable to obtain: Capabilities {acceptInsecureCerts: true, browserName: firefox, moz:debuggerAddress: true, moz:firefoxOptions: {args: [--window-size=1920,1080, --headless, --ignore-certificate-errors, --user-agent=Mozilla/5.0 (X..., --profile-root=/tmp/crawler4j, --disable-gpu]}}, error Command failed with code: 65, executed: [C:\Users\zowallar\AppData\Local\Temp\selenium-manager121128585160018224746421227753725\selenium-manager.exe, --browser, firefox, --output, json]
Der Prozess kann nicht auf die Datei zugreifen, da sie von einem anderen Prozess verwendet wird. (os error 32)
Build info: version: '4.11.0', revision: '040bc5406b'
System info: os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '11.0.1'
Driver info: driver.version: FirefoxDriver
at org.openqa.selenium.remote.service.DriverFinder.getPath(DriverFinder.java:25)
at org.openqa.selenium.remote.service.DriverFinder.getPath(DriverFinder.java:13)
at org.openqa.selenium.firefox.FirefoxDriver.generateExecutor(FirefoxDriver.java:141)
at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:132)
at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:127)
at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:112)
at edu.uci.ics.crawler4j.crawler.dynamic.DynamicWebCrawler.newWebDriverInstance(DynamicWebCrawler.java:66)
at edu.uci.ics.crawler4j.crawler.dynamic.DynamicWebCrawler.onStart(DynamicWebCrawler.java:45)
at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:294)
... 1 more
Caused by: org.openqa.selenium.WebDriverException: Command failed with code: 65, executed: [C:\Users\zowallar\AppData\Local\Temp\selenium-manager121128585160018224746421227753725\selenium-manager.exe, --browser, firefox, --output, json]
Der Prozess kann nicht auf die Datei zugreifen, da sie von einem anderen Prozess verwendet wird. (os error 32)
Build info: version: '4.11.0', revision: '040bc5406b'
System info: os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '11.0.1'
Driver info: driver.version: FirefoxDriver
at org.openqa.selenium.manager.SeleniumManager.runCommand(SeleniumManager.java:151)
at org.openqa.selenium.manager.SeleniumManager.getDriverPath(SeleniumManager.java:273)
at org.openqa.selenium.remote.service.DriverFinder.getPath(DriverFinder.java:22)
... 9 more
Process finished with exit code -1
``
Setting the number of crawler threads in the example to 1 will work on Windows. |
Looks like I need to setup an additional linux-based VM. Any distro suggestions for testing? ;-) |
Actually I did my testing with chromedriver, I have to look into Geckodriver a bit more. What I have already seen is that there is a way to explicitly set the Firefox binary. That might help, I should probably expose it. I'll let you know how I fare, it might be a couple days though before I can resume work on this. |
No problem. I was able to get it to work using the chromedriver ;-) - think we would need to mention https://googlechromelabs.github.io/chrome-for-testing/ in the README. |
@schwzr If you want to give it a try to provide feedback, feel free ;-) |
/** | ||
* Absolute path to the driver binary on the local filesystem. | ||
*/ | ||
private Path webDriverPath; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK it's not needed nowadays: https://www.selenium.dev/documentation/selenium_manager/#automated-driver-management
Thanks for the heads up! It's been quite a while since I issued the PR, and
more work is to be done on it.
I've lacked the time so far but fingers crossed I hope I can get back to it
soon.
…On Mon, Nov 13, 2023, 21:24 Valery Yatsynovich ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In
crawler4j-commons/src/main/java/edu/uci/ics/crawler4j/crawler/DynamicCrawlConfig.java
<#236 (comment)>:
> + public enum WebDriverType {
+ chrome,
+ firefox
+ }
+
+ private WebDriverType webDriverType = WebDriverType.firefox;
+
+ /**
+ * If dynamic content crawling is used, the maximum time to wait for dynamic content loading in seconds.
+ */
+ private int maxWaitForDynamicContentInSeconds = 2;
+
+ /**
+ * Absolute path to the driver binary on the local filesystem.
+ */
+ private Path webDriverPath;
AFAIK it's not needed nowadays:
https://www.selenium.dev/documentation/selenium_manager/#automated-driver-management
—
Reply to this email directly, view it on GitHub
<#236 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG7U4UIAID7S2223VXDNFO3YEJ6YNAVCNFSM6AAAAAA33JIWB6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTOMRYGIYTANZYHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
This pull request originates from a project I did for a customer where we had to crawl one of their public websites (made with ReactJS) for building a search engine on it.
Starting from an initial approach in Python, it wasn't easy to maintain or to operate. So I went on taking advantage of Crawler4J since I actually hate it that everytime some dynamic website has to be crawled it's done in Python. I'd rather have a much better and easy to operate Java implementation!
It took some effort to wire together Crawler4j (which I used extensively in past projects) and a headless browser.
The README mentions the possibility to extend the parser for such a usage scenario, so I thought it would actually be nice to make this much simpler for anyone wishing to address dynamic content crawling with Crawler4j.
The implementation makes use of Selenium's WebDriver API to interface with a headless browser. It deals with the fact that WebDriver is not thread-safe, so basically we instantiate one headless browser per crawler thread.
Also it offers a programmatic approach for waiting for dynamic content loading.
I'll be glad to at least have your feedback on this PR.
Things that could be done to enhance it:
Best regards from France,
Nicolas