Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck for no reason? #314

Open
crapthings opened this issue Aug 31, 2018 · 6 comments
Open

Stuck for no reason? #314

crapthings opened this issue Aug 31, 2018 · 6 comments
Labels

Comments

@crapthings
Copy link

crapthings commented Aug 31, 2018

i have a list that has 1.8k urls, but when

await crawler.queue(urls)

it seems stuck randomly without timeout?

const fs = require('fs')
const _ = require('lodash')
const writeJsonFile = require('write-json-file')
const HCCrawler = require('headless-chrome-crawler')
const RedisCache = require('headless-chrome-crawler/cache/redis')
const cache = new RedisCache({ host: '127.0.0.1', port: 6379 })

let urls = getUrls()
let count = urls.length

async function p1() {
  const crawler = await HCCrawler.launch({
    cache,
    persistCache: true,

    evaluatePage: (() => ({
      title: $('#litZQMC').text(),
      html: $('#divScroll').html()
    })),

    onSuccess: async resp => {
      const { result: { title, html } } = resp
      if (fs.existsSync(`files/${title}.txt`)) {
        console.log('skip', count--, title)
      } else {
        await writeJsonFile(`files/${title}.txt`, html)
        console.log('done', count--, title)
      }
    },

    onError: err => {
      console.log(err)
    }
  })

  await crawler.queue(urls)
  await crawler.onIdle()
  await crawler.close()
}

async function queue() {
  await p1()
}

queue()
  • Version:
    1.8.0
  • Platform / OS version:
    osx
  • Node.js version:
    v8.11.3
@bloody-ux
Copy link

I have the same situation, but not randomly, it just stuck with chrome process killed after several minutes.

@davidebaldini
Copy link

Did anyone find a solution or workaround?

No exception is thrown and no error is printed. I still have a few chrome processes running when the script gets stuck.

@davidebaldini
Copy link

davidebaldini commented Sep 25, 2018

I discovered what causes blocks in my case.

The blocks happen when a tab is pointed toward (_page.goto()) a page containing flash. There, the browser shows a warning dialog which is not detected by _handleDialog() in crawler.js, and causes an infinite delay in _collectLinks().

Solution (works for me):
the first part of _collectLinks() needs to be changed to:

  /**
   * @param {!string} baseUrl
   * @return {!Promise<!Array<!string>>}
   * @private
   */
  async _collectLinks(baseUrl) {
    const links = [];
      await Promise.race([
        new Promise(function(resolve, reject) {
          setTimeout(resolve, 10000);
        }),
        this._page.exposeFunction('pushToLinks', link => {
          const _link = resolveUrl(link, baseUrl);
          if (_link) links.push(_link);
        })
      ]);
      console.log("PASSED");

Possibly this modification causes a memory leak, but it worksForMe.

@BubuAnabelas
Copy link

Maybe @yujiosaka could look more into this since it's clearly an easy to reproduce and easy to fix bug.

@popstas
Copy link

popstas commented Mar 5, 2020

Try to add args: ['--no-sandbox'] to crawler options.

@kulikalov kulikalov added the bug label Oct 26, 2020
@kulikalov
Copy link
Contributor

is anyone willing making a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants