Get current URL in customCrawl() #364

popstas · 2020-04-27T13:01:49Z

What is the current behavior?
No information about current URL in customCrawl()

What is the motivation / use case for changing the behavior?
I'm want to skip request, but add URL to csv for some files like zip, doc, pdf.
My code that do it - https://github.com/viasite/sites-scraper/blob/59449b1b03/src/scrap-site.js#L240-L255

Proposal
Add crawler to customCrawl:
customCrawl: async (page, crawl, crawler)

I tried to store currentURL with requeststarted event, but it fail when more when concurrency > 1.

What do you think about it? I can make PR.

The text was updated successfully, but these errors were encountered:

yujiosaka#364

kulikalov · 2020-10-17T06:35:12Z

Hey @popstas
This is a valid proposal. I had the same issue. Yeah, pls do the PR. Also pls do not forget to add related info to docs. It was a while since you've posted this, so, pls let me know if you are still willing to do this.

iamprageeth · 2022-06-19T06:28:26Z

We can use preRequest option to skip urls. we can persist or do anything to the url in there

JacksonSabol · 2022-07-09T18:20:29Z

2 years since the issue was opened, but if others in the future are looking to get the current URL, it's available in the result object of a customCrawl. Specifically result.options.url. Something like this should do the trick:

customCrawl: async (page, crawl) => {
    await page.setRequestInterception(true);
    await page.on('request', async request => await request.continue());
    await page.on('error', async err => console.log(new Error(err)));

    const result = await crawl();
    const currentUrl = result.options.url;
    // ... whatever logic you want
    return result;
}

popstas added a commit to popstas/headless-chrome-crawler that referenced this issue May 5, 2020

fix: pass crawler to customCrawl

57567bd

yujiosaka#364

kulikalov added the feature label Oct 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get current URL in customCrawl() #364

Get current URL in customCrawl() #364

popstas commented Apr 27, 2020

kulikalov commented Oct 17, 2020

iamprageeth commented Jun 19, 2022

JacksonSabol commented Jul 9, 2022

Get current URL in customCrawl() #364

Get current URL in customCrawl() #364

Comments

popstas commented Apr 27, 2020

kulikalov commented Oct 17, 2020

iamprageeth commented Jun 19, 2022

JacksonSabol commented Jul 9, 2022