This Crawler is for crawl some Blog URL like that, and return some url that found on that url, then insert it into mysql.
-
Clone this repository
git clone https://github.com/clasense4/scrapy-blog-crawler.git
-
Edit
blog_crawler/settings.py
change yourscrapy
,redis
andMySQL
setting -
Edit
DEPTH_LIMIT
if You want deeper crawling -
Edit
blog_crawler/spiders/blog_spiders.py
, change list of blog in this line:start_urls = ['http://tmcblog.com']
-
Insert this query :
CREATE TABLE `scrapy_blog` ( `id` int(11) NOT NULL AUTO_INCREMENT, `url_from` varchar(255) NOT NULL, `url_found` varchar(255) NOT NULL, `url_referer` tinytext NOT NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1 CREATE TABLE `scrapy_blog_master` ( `id` int(11) NOT NULL AUTO_INCREMENT, `url_master` varchar(255) NOT NULL, `class` varchar(50) DEFAULT 'UNIQ', PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='latin1_swedish_ci'
-
Make sure to start your redis server
$> src/redis-server
-
And start your crawler with this command
$> scrapy crawl blog_spider
-
At 20 December 2012, that script give me
2070 rows
, withDEPTH_LIMIT = 1
-
At 8 January 2013, this script give me
749 Unique urls
. And it save in redis server usingsadd
command
The script is still sucks, not follow scrapy standards, use at your own risks.
If You can improve this with scrapy pipelines
, I'm really appreciated that.
mail me at clasense4[at]gmail[dot]com