Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get current URL in customCrawl() #364

Open
popstas opened this issue Apr 27, 2020 · 3 comments
Open

Get current URL in customCrawl() #364

popstas opened this issue Apr 27, 2020 · 3 comments
Labels

Comments

@popstas
Copy link

popstas commented Apr 27, 2020

What is the current behavior?
No information about current URL in customCrawl()

What is the motivation / use case for changing the behavior?
I'm want to skip request, but add URL to csv for some files like zip, doc, pdf.
My code that do it - https://github.com/viasite/sites-scraper/blob/59449b1b03/src/scrap-site.js#L240-L255

Proposal
Add crawler to customCrawl:
customCrawl: async (page, crawl, crawler)

I tried to store currentURL with requeststarted event, but it fail when more when concurrency > 1.

What do you think about it? I can make PR.

popstas added a commit to popstas/headless-chrome-crawler that referenced this issue May 5, 2020
@kulikalov
Copy link
Contributor

Hey @popstas
This is a valid proposal. I had the same issue. Yeah, pls do the PR. Also pls do not forget to add related info to docs. It was a while since you've posted this, so, pls let me know if you are still willing to do this.

@iamprageeth
Copy link

We can use preRequest option to skip urls. we can persist or do anything to the url in there

@JacksonSabol
Copy link

2 years since the issue was opened, but if others in the future are looking to get the current URL, it's available in the result object of a customCrawl. Specifically result.options.url. Something like this should do the trick:

customCrawl: async (page, crawl) => {
    await page.setRequestInterception(true);
    await page.on('request', async request => await request.continue());
    await page.on('error', async err => console.log(new Error(err)));

    const result = await crawl();
    const currentUrl = result.options.url;
    // ... whatever logic you want
    return result;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants