Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p92 段子爬取,第一页正则匹配好像没完全匹配上,只匹配到了17个,但是菜鸟工具看正则匹配到了20个 #2

Closed
Mathhub6 opened this issue Jan 15, 2024 · 1 comment

Comments

@Mathhub6
Copy link

Mathhub6 commented Jan 15, 2024

https://xiaohua.zol.com.cn/baoxiaonannv/1.html

运行代码

# 导入模块
import logging

# 匹配内容
import re

# 网页请求
import requests

# 忽略警告
logging.captureWarnings(True)
# 控制时间
import time

# 写入请求网址与请求头
url = "https://xiaohua.zol.com.cn/baoxiaonannv/%d.html"
header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36",
}


# 正则表达式
pattern = re.compile(r'<div class="summary-text">(.*?)</div>')


duanzi = url % (1)
print(duanzi)
requests.packages.urllib3.disable_warnings()
# 获取代码内容,cerify=False不认证
response = requests.get(url=duanzi, headers=header, verify=False, timeout=10).text
# 正则匹配
item = pattern.findall(response, re.S)
time.sleep(2)

response
# print(item)

image

通过正则表达式<div class="summary-text">(.*?)</div>照理来说应该这20个都匹配到了,但是为什么这3个没有匹配到?re.S似乎能含\n但是没有制表符\t。是这个问题吗?那正则表达式该怎么改使得\t也能被匹配
image
image

image
image

image

@sfvsfv
Copy link
Owner

sfvsfv commented Mar 9, 2024

有没有看是哪个没有匹配上呢?然后对比下正则表达式

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants