终于看见了一个我能看得懂的爬虫。。。虽然有bug #2

anmingyu11 · 2020-03-21T02:51:17Z

 posts = selector.xpath('//div[@class="articleh normal_post"]')  # + selector.xpath('//div[@class="articleh odd"]')
        
        for index, post in enumerate(posts):
            link = post.xpath('span[@class="l3 a3"]/a/@href').extract()
            if link:
                if link[0].startswith('/'):
                    link = "http://guba.eastmoney.com/" + link[0][1:]
                else:
                    link = "http://guba.eastmoney.com/" + link[0]

                if link in self._existed_urls:
                    continue

            # drop set-top or ad post
            type = post.xpath('span[@class="l3 a3"]/em/@class').extract()
            if type:
                type = type[0]
                if type == 'ad' or type == 'settop' or type == 'hinfo':
                    continue
            else:
                type = 'normal'

            read_count = post.xpath('span[@class="l1 a1"]/text()').extract()
            comment_count = post.xpath('span[@class="l2 a2"]/text()').extract()
            username = post.xpath('span[@class="l4 a4"]/a/font/text()').extract()
            updated_time = post.xpath('span[@class="l5 a5"]/text()').extract()
            print('read_count:', read_count)
            print('comment_count:', comment_count)
            print('username:', username)
            print('updated_time:', updated_time)
            if not read_count or not comment_count or not username or not updated_time:
                print('break')
                continue

            item = PostItem()
            item['stock_id'] = stock_id
            item['read_count'] = int(read_count[0])
            item['comment_count'] = int(comment_count[0])
            item['username'] = username[0].strip('\r\n').strip()
            item['updated_time'] = updated_time[0]
            item['url'] = link

            if link:
                yield Request(url=link, meta={'item': item, 'PhantomJS': True}, callback=self.parse_post)

        if page < self.total_pages:
            stock_id = self.stock_id
            request = Request(LIST_URL.format(stock_id=self.stock_id, page=page + 1))
            request.meta['stock_id'] = stock_id
            request.meta['page'] = page + 1
            yield request
```

东方股吧的标签变了，
而且你用的LIST_URL也有些问题，目前看来只有上证指数是用的你这里些的LISTURL的格式，我试了下沪深三百，LISTURL不一样，还得做特殊处理。

The text was updated successfully, but these errors were encountered:

ZHANGM41 · 2020-04-12T12:56:05Z

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

anmingyu11 · 2020-04-14T01:39:17Z

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

@ZHANGM41

这个现在爬不了，你得改，因为股吧的页面结构变了。

ZHANGM41 · 2020-04-14T10:02:35Z

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

@ZHANGM41

这个现在爬不了，你得改，因为股吧的页面结构变了。

谢谢！我改了之后发现好像有反爬，爬了一阵就重新到另一个无关网页了…不知道这个能换ip解决嘛？

anmingyu11 · 2020-04-15T02:08:33Z

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

@ZHANGM41
这个现在爬不了，你得改，因为股吧的页面结构变了。

谢谢！我改了之后发现好像有反爬，爬了一阵就重新到另一个无关网页了…不知道这个能换ip解决嘛？

@ZHANGM41

你说的没错，有反爬，我用的付费ip代理爬的，单机爬是不可以的。

ZHANGM41 · 2020-04-15T09:45:42Z

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

@ZHANGM41
这个现在爬不了，你得改，因为股吧的页面结构变了。

谢谢！我改了之后发现好像有反爬，爬了一阵就重新到另一个无关网页了…不知道这个能换ip解决嘛？

@ZHANGM41

你说的没错，有反爬，我用的付费ip代理爬的，单机爬是不可以的。

好的非常感谢!

shizhu13 · 2021-07-26T14:17:03Z

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

@ZHANGM41
这个现在爬不了，你得改，因为股吧的页面结构变了。

谢谢！我改了之后发现好像有反爬，爬了一阵就重新到另一个无关网页了…不知道这个能换ip解决嘛？

@ZHANGM41

你说的没错，有反爬，我用的付费ip代理爬的，单机爬是不可以的。

环境配置是什么？能分享一下吗？希望大家可以加个微信互相讨论，我会建个群，大家专门讨论爬虫的

shizhu13 · 2021-07-26T14:18:12Z

希望大家可以加个微信互相讨论，我会建个群，大家专门讨论爬虫的，我的微信： 876983033

c976237222 · 2023-06-07T17:01:14Z

希望大家可以加个微信互相讨论，我会建个群，大家专门讨论爬虫的，我的微信： 876983033

同学您这个问题解决了吗还方便加微信吗

c976237222 · 2023-06-10T18:17:41Z

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

@ZHANGM41
这个现在爬不了，你得改，因为股吧的页面结构变了。

谢谢！我改了之后发现好像有反爬，爬了一阵就重新到另一个无关网页了…不知道这个能换ip解决嘛？

@ZHANGM41
你说的没错，有反爬，我用的付费ip代理爬的，单机爬是不可以的。

好的非常感谢!

同学您还有可以使用的代码吗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

终于看见了一个我能看得懂的爬虫。。。虽然有bug #2

终于看见了一个我能看得懂的爬虫。。。虽然有bug #2

anmingyu11 commented Mar 21, 2020

ZHANGM41 commented Apr 12, 2020

anmingyu11 commented Apr 14, 2020

ZHANGM41 commented Apr 14, 2020

anmingyu11 commented Apr 15, 2020

ZHANGM41 commented Apr 15, 2020

shizhu13 commented Jul 26, 2021

shizhu13 commented Jul 26, 2021

c976237222 commented Jun 7, 2023

c976237222 commented Jun 10, 2023

终于看见了一个我能看得懂的爬虫。。。虽然有bug #2

终于看见了一个我能看得懂的爬虫。。。虽然有bug #2

Comments

anmingyu11 commented Mar 21, 2020

ZHANGM41 commented Apr 12, 2020

anmingyu11 commented Apr 14, 2020

ZHANGM41 commented Apr 14, 2020

anmingyu11 commented Apr 15, 2020

ZHANGM41 commented Apr 15, 2020

shizhu13 commented Jul 26, 2021

shizhu13 commented Jul 26, 2021

c976237222 commented Jun 7, 2023

c976237222 commented Jun 10, 2023