Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow performance scraping coinmarketcap.com #2

Open
viktorius007 opened this issue Jul 29, 2018 · 1 comment
Open

Very slow performance scraping coinmarketcap.com #2

viktorius007 opened this issue Jul 29, 2018 · 1 comment

Comments

@viktorius007
Copy link

The following example code executes in 1.3s on my MacBook...


use Scraper\Scrape\Crawler\Types\GeneralCrawler;
use Scraper\Scrape\Extractor\Types\MultipleRowExtractor;

require_once(__DIR__ . '/../vendor/autoload.php');
date_default_timezone_set('UTC');

// Create crawler
$crawler = new GeneralCrawler('https://coinmarketcap.com/');

// Setup configuration
$configuration = new \Scraper\Structure\Configuration();
$configuration->setTargetXPath('//table[@id="currencies"]');
$configuration->setRowXPath('.//tbody/tr');
$configuration->setFields(
    [
      new \Scraper\Structure\TextField(
          [
              'name'  => 'Rank',
              'xpath' => '//td[1]',
          ]
      ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Name',
                'xpath' => '//td[2]',
            ]
        ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Market Cap',
                'xpath' => '//td[3]',
            ]
        ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Price',
                'xpath' => '//td[4]',
            ]
        ),
        new \Scraper\Structure\RegexField(
            [
                'name'  => '% Change',
                'xpath' => '//td[7]',
                'regex' => '/(.*)%/'
            ]
        ),
    ]
);

// Extract  data
$extractor = new MultipleRowExtractor($crawler, $configuration);
$data = $extractor->extract();
print_r($data);

However this slightly tweaked version takes 4.5 minutes!


use Scraper\Scrape\Crawler\Types\GeneralCrawler;
use Scraper\Scrape\Extractor\Types\MultipleRowExtractor;

require_once(__DIR__ . '/../vendor/autoload.php');
date_default_timezone_set('UTC');

// Create crawler
$crawler = new GeneralCrawler('https://coinmarketcap.com/currencies/volume/monthly/');

// Setup configuration
$configuration = new \Scraper\Structure\Configuration();
$configuration->setTargetXPath('//table[@id="currencies-volume"]');
$configuration->setRowXPath('.//tbody/tr');
$configuration->setFields(
    [
      new \Scraper\Structure\TextField(
          [
              'name'  => 'Rank',
              'xpath' => './/td[1]',
          ]
      ),
      new \Scraper\Structure\TextField(
          [
              'name'  => 'Name',
              'xpath' => './/td[2]',
          ]
      ),
      new \Scraper\Structure\TextField(
          [
              'name'  => 'Symbol',
              'xpath' => './/td[3]',
          ]
      ),
      new \Scraper\Structure\TextField(
          [
              'name'  => 'Volume_1D',
              'xpath' => './/td[4]',
          ]
      ),
      new \Scraper\Structure\TextField(
          [
              'name'  => 'Volume_7D',
              'xpath' => './/td[5]',
          ]
      ),
      new \Scraper\Structure\TextField(
          [
              'name'  => 'Volume_30D',
              'xpath' => './/td[6]',
          ]
      ),
    ]
);

// Extract  data
$extractor = new MultipleRowExtractor($crawler, $configuration);
$data = $extractor->extract();

print_r(array_slice($data, 0, 10));

Are you able to confirm this performance problem on your system, and if so, then why is there such a performance hit?

@viktorius007
Copy link
Author

@rajan61005co I would really appreciate your opinion on the above mentioned problem. I am at a loss to understand why it is so slow. Today the same script take 9.5 minutes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant