Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for crawling from secondary IP address #2409

Open
teammakdi opened this issue Apr 8, 2024 · 1 comment
Open

Support for crawling from secondary IP address #2409

teammakdi opened this issue Apr 8, 2024 · 1 comment
Labels
feature Issues that represent new features or improvements to existing features. t-c&c Team covering store and finance matters. t-console Issues with this label are in the ownership of the console team. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@teammakdi
Copy link

Which package is the feature request for? If unsure which one to select, leave blank

@crawlee/http (HttpCrawler)

Feature

Hi, I see with both HttpCrawler and PuppeteerCrawler, ProxyConfiguration is supported which needs a HTTP proxy server. However my use case is to use the secondary IP address for crawling purposes.

Motivation

Raw axios supports requesting from a secondary IP address present on the machine. Example

const httpsAgent = new https.Agent({
    localAddress: 'x.x.x.x',
    localPort: xxxx
});

await axios.get('https://api.ipify.org', {
  httpsAgent
})
.then(response => {
  console.log('HTTPS Agent: ', response.data); // prints secondary IP address
})
.catch(err => {
    console.error(err);
})

Was wondering if it could be possible with the crawlee HttpCrawler i.e. with got library. Not sure if it would be feasible with the PuppeteerCrawler.

Ideal solution or implementation, and any additional constraints

Alternative solutions or implementations

No response

Other context

No response

@teammakdi teammakdi added the feature Issues that represent new features or improvements to existing features. label Apr 8, 2024
@mtrunkat mtrunkat added t-tooling Issues with this label are in the ownership of the tooling team. t-console Issues with this label are in the ownership of the console team. t-c&c Team covering store and finance matters. labels Apr 10, 2024
@teammakdi
Copy link
Author

teammakdi commented Apr 22, 2024

For http crawler, this was relatively easy.

preNavigationHooks: [
        async (crawlingContext, gotOptions) => {
            gotOptions.localAddress = secondaryIpAddress
        }
    ]

Setting gotOptions.localAddress works.

Still looking out for PuppeteerCrawler

I was able to work it out with squid proxy by creating a http proxy server, however was looking with direct secondary IP based approaches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issues that represent new features or improvements to existing features. t-c&c Team covering store and finance matters. t-console Issues with this label are in the ownership of the console team. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

2 participants