Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]A new journey #203

Open
4 of 6 tasks
whg517 opened this issue Oct 21, 2021 · 14 comments
Open
4 of 6 tasks

[RFC]A new journey #203

whg517 opened this issue Oct 21, 2021 · 14 comments
Assignees
Labels

Comments

@whg517
Copy link

whg517 commented Oct 21, 2021

fix #226

Hi, scrapy-redis is one of the most commonly used tools for using scrapy, but IT seems to me that this project has not been maintained for a long time. Some of the states on the project are not updated synchronously.

Given the current updates to the Python and Scrapy versions, I wanted to make some feature contributions to the project. If you can accept, I will arrange the follow-up work.

Tasks:

@Sm4o
Copy link

Sm4o commented Oct 25, 2021

It would be super useful to also add the feature of feeding more context to the spiders. Not just a list of start_urls, but a list of json like so:

{
    "start_urls": [
        {
            "start_url": "https://example.com/",
            "sku": 1234
        }
    ]
}

This was already proposed a while back #156

@whg517
Copy link
Author

whg517 commented Oct 26, 2021

Hello @Sm4o , I wrote an example according to your description. Has this achieved your purpose?

import json

from scrapy import Request, Spider
from scrapy.http import Response

from scrapy_redis.spiders import RedisSpider


class SpiderError(Exception):
    """"""


class BaseParser:
    name = None

    def __init__(self, spider: Spider):
        # use log: self.spider.logger
        self.spider = spider

    def parse(
        self,
        *,
        response: Response,
        **kwargs
    ) -> list[str]:
        raise NotImplementedError('`parse()` must be implemented.')


class HtmlParser(BaseParser):
    name = 'html'

    def parse(
        self,
        *,
        response: Response,
        rows_rule: str | None = '//tr',
        row_start: int | None = 0,
        row_end: int | None = -1,
        cells_rule: str | None = 'td',
        field_rule: str | None = 'text()',
    ) -> list[str]:
        """"""
        raise NotImplementedError('`parse()` must be implemented.')


def parser_factory(name: str, spider: Spider) -> BaseParser:
    if name == 'html':
        return HtmlParser(spider)
    else:
        raise SpiderError(f'Can not find parser name of "{name}"')


class MySpider(RedisSpider):
    name = 'my_spider'

    def make_request_from_data(self, data):
        text = data.decode(encoding=self.redis_encoding)
        params = json.loads(text)
        return Request(
            params.get('url'),
            dont_filter=True,
            meta={
                'parser_name': params.get('parser_name'),
                'parser_params': {
                    'rows_rule': params.get('rows_rule'),  # rows_xpath = '//tbody/tr'
                    'row_start': params.get('index'),  # row_start = 1
                    'row_end': params.get('row_end'),  # row_end = -1
                    'cells_rule': params.get('cells_rule'),  # cells_rule = 'td'
                    'field_rule': params.get('text()'),  # field_rule = 'text()'
                }
            }
        )

    def parse(self, response: Response, **kwargs):
        name = response.meta.get('parser_name')
        params = response.meta.get('parser_params')
        parser = parser_factory(name, self)
        items = parser.parse(response=response, **params)
        for item in items:
            yield item

@LuckyPigeon
Copy link
Collaborator

@rmax
Looks good to me. How do you think?
@Sm4o
@rmax is a little busy recently, if you don't mind. Feel free to work on it!

@rmax
Copy link
Owner

rmax commented Oct 26, 2021

Sounds perfect. Please take the lead!

@LuckyPigeon has been given permissions to the repo.

@Sm4o
Copy link

Sm4o commented Oct 29, 2021

That's exactly what I needed. Thanks a lot!

@whg517
Copy link
Author

whg517 commented Nov 1, 2021

I am working in progress...

@Sm4o
Copy link

Sm4o commented Nov 2, 2021

I'm trying to reach 1500 requests/min but it seems like using a single spider might not be the best. I noticed that scrapy-redis reads urls from redis in batches equal to CONCURRENT_REQUESTS setting. So if I set it to CONCURRENT_REQUESTS=1000 then scrapy-redis waits until all processes are done before requesting another batch of 1000 from redis. I feel like I'm using this tool wrong, so any tips or suggestions would be greatly appreciated

@rmax
Copy link
Owner

rmax commented Nov 2, 2021 via email

@LuckyPigeon
Copy link
Collaborator

@whg517
Please go head!
@Sm4o
What feature are you working for?

@whg517
Copy link
Author

whg517 commented Nov 3, 2021

So far, I have done:

  • Support python 3.7-3.9,scrapy 2.0-2.5 . And all test is fine.
  • Add isort, flake8 to check code
  • Add PEP-517 support
  • Add gh action

Now I'm having some problems with my documentation. I am Chinese, but my English is not very good, so my English expression ability is not strong. I want someone to take over the documentation.

I think the current document is too simplistic. Perhaps we need to rearrange the structure and content of the document.

@LuckyPigeon
Copy link
Collaborator

LuckyPigeon commented Nov 3, 2021

@whg517
Thanks for your contribution! Please file PR for each feature, then I will review it.
Chinese documentations are also welcome, we can rearrange the structure and the content in Chinese version first.
And I can do the translation work.

@whg517
Copy link
Author

whg517 commented Jan 5, 2022

Hello everyone, I will reorganize features later and try to create a new feature PR. As the New Year begins, I still have many plans to do, I will arrange them as soon as possible.

@rmax
Copy link
Owner

rmax commented Jan 5, 2022

@whg517 thanks for the initiative. Could you also include the pros and cons of moving the project to scrapy-plugins org?

@LuckyPigeon
Copy link
Collaborator

@whg517 any progress?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants