Skip to content

Releases: code4craft/webmagic

WebMagic-0.10.0

05 Dec 04:51
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: WebMagic-0.9.1...WebMagic-0.10.0

WebMagic-0.9.1

10 Sep 09:19
Compare
Choose a tag to compare

What's Changed

  • [Snyk] Security upgrade net.sourceforge.htmlcleaner:htmlcleaner from 2.26 to 2.29 by @code4craft in #1126
  • fix(sec): upgrade net.sourceforge.htmlcleaner:htmlcleaner to by @dack-su in #1127

New Contributors

Full Changelog: WebMagic-0.9.0...WebMagic-0.9.1

WebMagic-0.9.0

22 Jun 03:25
Compare
Choose a tag to compare

What's Changed

  • 修复 HtmlCleaner 无法正常解析 tr 和 td 标签的问题 by @hooyantsing in #1107
  • 向 webmagic-saxon 组件提供若干新 API,更优雅更灵活更强大 by @hooyantsing in #1108
  • Https绕过host检查 by @Tanky-Zhang in #1112
  • [Snyk] Security upgrade com.jayway.jsonpath:json-path from 2.7.0 to 2.8.0 by @snyk-bot in #1114
  • [Snyk] Security upgrade com.google.guava:guava from 31.1-jre to 32.0.0-jre by @code4craft in #1119

New Contributors

Full Changelog: WebMagic-0.8.0...WebMagic-0.9.0

WebMagic-0.8.0

23 Nov 16:52
Compare
Choose a tag to compare

WebMagic-0.7.6

24 Oct 14:59
Compare
Choose a tag to compare

What's Changed

  • perfect Spider.run to avoid some rare concurrent issue, change the Sp… by @carl-don-it in #1033
  • [Snyk] Security upgrade org.jruby:jruby from 9.2.14.0 to 9.3.0.0 by @snyk-bot in #1036
  • [Snyk] Security upgrade com.jayway.jsonpath:json-path from 2.5.0 to 2.6.0 by @snyk-bot in #1048
  • [Snyk] Security upgrade com.github.dreamhead:moco-core from 1.2.0 to 1.3.0 by @snyk-bot in #1050
  • [Snyk] Security upgrade net.sourceforge.htmlcleaner:htmlcleaner from 2.9 to 2.26 by @snyk-bot in #1056
  • [Snyk] Security upgrade com.fasterxml.jackson.core:jackson-databind from 2.13.0-rc1 to 2.13.0 by @snyk-bot in #1060
  • [Snyk] Security upgrade com.fasterxml.jackson.core:jackson-databind from 2.13.0 to 2.13.2 by @snyk-bot in #1063
  • [Snyk] Security upgrade com.fasterxml.jackson.core:jackson-databind from 2.13.2 to 2.13.2.1 by @snyk-bot in #1064
  • [Snyk] Security upgrade org.jetbrains.kotlin:kotlin-stdlib from 1.1.2-2 to 1.6.0 by @snyk-bot in #1065
  • change dependency versions into properties by @davidhsing in #1067
  • [Snyk] Security upgrade com.alibaba:fastjson from 1.2.75 to 1.2.83 by @code4craft in #1071
  • [Snyk] Security upgrade us.codecraft:xsoup from 0.3.2 to 0.3.4 by @snyk-bot in #1072
  • [Snyk] Security upgrade org.seleniumhq.selenium:selenium-java from 3.141.59 to 4.0.0 by @code4craft in #1075
  • Common the downloader status process and pass error information when … by @vioao in #1082
  • Revert "Common the downloader status process and pass error information when …" by @sutra in #1083
  • Common downloader error process by @vioao in #1085
  • Enhance Jsoup could parse tr td tag directly by @vioao in #1086
  • [Snyk] Security upgrade com.fasterxml.jackson.core:jackson-databind from 2.13.2.1 to 2.13.4 by @snyk-bot in #1087
  • [Snyk] Security upgrade us.codecraft:xsoup from 0.3.4 to 0.3.6 by @snyk-bot in #1088
  • : 是非法字符,无法作为文件名 by @jialigit in #762
  • [Snyk] Security upgrade com.fasterxml.jackson.core:jackson-databind from 2.13.4 to 2.13.4.2 by @code4craft in #1089

New Contributors

Full Changelog: WebMagic-0.7.5...WebMagic-0.7.6

WebMagic-0.7.5

02 Sep 11:35
Compare
Choose a tag to compare

What's Changed

  • [Fix] #698 修复使用Redis,Request丢失附加信息问题 by @jianyun8023 in #702
  • [Fix] 修正错误方法名 by @jianyun8023 in #703
  • fix the typo by @aristotll in #658
  • [Snyk] Fix for 2 vulnerabilities by @snyk-bot in #895
  • [Snyk] Fix for 1 vulnerabilities by @snyk-bot in #897
  • [Snyk] Fix for 1 vulnerabilities by @snyk-bot in #899
  • [Snyk] Fix for 1 vulnerabilities by @snyk-bot in #908
  • [Snyk] Fix for 1 vulnerabilities by @snyk-bot in #911
  • [Snyk] Security upgrade com.github.dreamhead:moco-core from 0.11.0 to 1.0.0 by @snyk-bot in #914
  • [Snyk] Security upgrade com.github.dreamhead:moco-core from 0.11.0 to 1.0.0 by @snyk-bot in #920
  • [Snyk] Security upgrade com.github.dreamhead:moco-core from 0.11.0 to 1.0.0 by @snyk-bot in #922
  • [Snyk] Security upgrade com.github.dreamhead:moco-core from 0.11.0 to 1.0.0 by @snyk-bot in #923
  • [Snyk] Security upgrade com.github.dreamhead:moco-core from 0.11.0 to 1.0.0 by @snyk-bot in #925
  • [Snyk] Fix for 15 vulnerable dependencies by @snyk-bot in #889
  • [Snyk] Fix for 1 vulnerable dependencies by @snyk-bot in #882
  • Add unit tests for us.codecraft.webmagic.utils.NumberUtils by @ThomasPerkins1123 in #885
  • build: manage plugin version & remove build WARNING by @leeyazhou in #939
  • [Snyk] Security upgrade com.alibaba:fastjson from 1.2.68 to 1.2.69 by @snyk-bot in #946
  • [Snyk] Security upgrade org.apache.httpcomponents:httpclient from 4.5.12 to 4.5.13 by @snyk-bot in #955
  • [Snyk] Security upgrade junit:junit from 4.13 to 4.13.1 by @snyk-bot in #957
  • [Snyk] Security upgrade com.google.guava:guava from 29.0-jre to 30.0-android by @snyk-bot in #959
  • 子任务可以使用不同的下载器。。。 by @itranlin in #974
  • 主要是对代理的功能进行了增加和修改 by @yaoqiangpersonal in #976
  • Remove useless imports to fix build. by @sutra in #977
  • SpiderStatus中getPagePerSecond()方法,增加验证逻辑,避免空指针,避免除数为零。 by @yqia182 in #993
  • 增加了List属性的get方法,供SpiderMonitor的子类获取. by @thebirdandfish in #1000
  • [Snyk] Security upgrade com.github.dreamhead:moco-core from 1.1.0 to 1.2.0 by @snyk-bot in #1011
  • 提交可恢复爬取内容例子 by @linweisen in #1013
  • Update to Jedis 3.6.0 by @gkorland in #1025
  • [Snyk] Security upgrade com.jayway.jsonpath:json-path from 2.4.0 to 2.6.0 by @snyk-bot in #1029

New Contributors

Full Changelog: WebMagic-0.7.3...WebMagic-0.7.5

WebMagic-0.7.3

30 Jul 07:43
Compare
Choose a tag to compare

本次更新增加了Downloader模块的一些功能。

#609 修复HttpRequestBody没有默认构造函数导致无法反序列化的bug。
#631 HttpRequestBody的静态构造函数不再抛出UnsupportedEncodingException受检异常。

#571 Page对象增加bytes属性,用于获取二进制数据。下载纯二进制页面时,请设置request.setBinarayContent(true),这样对于二进制内容不会尝试转换为String,减小开销。

#629 在HttpUriRequestConverter中会自动对一些导致URI异常的字符进行转移或过滤。

#610 自动识别编码时,可以识别Content-Type中charset为大写的情况。
#627 支持为Request单独设置页面编码,兼容同一站点多种编码方式的情况。
#613 Page对象增加charset属性,其值为request/site中设置的charset,或者为自动检测的charset(未定义时)。

#606 升级jsonpath到2.4.0
#608 升级jsoup到1.10.3

WebMagic-0.7.2

17 Jun 08:15
Compare
Choose a tag to compare

此次更新修复了0.7.0-0.7.1版本的若干bug。

  1. #594 Request中的HttpRequestBody实现序列化接口。
  2. #596 修复0.7.0开始代理认证无法正确使用的问题。
  3. #601 完善页面状态异常时的错误信息。
  4. #605 修复0.7.0开始重复调用onSuccess和onError函数导致监控出错的问题。

WebMagic-0.7.1

04 Jun 10:28
Compare
Choose a tag to compare

此次更新包含几个比较大的Bugfix,以及一些遗留问题的改进。

  • 修复0.7.0引入的RedisScheduler无法使用的bug。#583
  • 注解模式的JsonPath默认会指定source为RawText,不再会出现自动为头尾加了<html>标签导致无法解析的情况。#589
  • RegexSelector之前版本默认匹配group1,并通过对无捕获组的正则头尾加括号的形式来统一抽取内容。在0.7.1版本改为不再改变正则,而是通过匹配group0还是group1来实现匹配,见#559。新做法可以减少一些特殊用法的出错几率,例如零宽断言(#556)。
  • 重构了ObjectFormatter部分,修复了ObjectFormatter无法初始化参数的bug。 #570

WebMagic-0.7.0

29 May 06:14
Compare
Choose a tag to compare

此次更新重写了HttpClientDownloader,完善了POST等其他Http Method的支持,并重写了代理API,更加简单和便于扩展。

POST支持

  • 新的POST API,支持各种RequestBody #513
Request request = new Request("http://xxx/path");
request.setMethod(HttpConstant.Method.POST);
request.setRequestBody(HttpRequestBody.json("{'id':1}","utf-8"));
  • 移除了老的在request.extra中设置NameValuePair的方式
  • POST请求不再进行去重 #484

代理支持

  • 新的代理APIProxyProvider,支持自由扩展
  • 默认实现SimpleProxyProvider是一个简单的round-robin实现,可以添加任意个数的代理。
HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
SimpleProxyProvider proxyProvider = SimpleProxyProvider.from(new Proxy("127.0.0.1", 1087), new Proxy("127.0.0.1", 1088));
httpClientDownloader.setProxyProvider(proxyProvider);
  • 移除了Site上关于代理配置的setProxy等,代理设置统一到HttpClientDownloader里。

新的SimpleHttpClient

  • 用作简单的单次下载和解析时,使用SimpleHttpClient可以满足需求
SimpleHttpClient simpleHttpClient = new SimpleHttpClient();
GithubRepo model = simpleHttpClient.get("github.com/code4craft/webmagic",GithubRepo.class);

其他改动

  • 为Page中增加状态码和Http头信息 #406
  • 支持Request级别设置Http Header和Cookie
  • 去掉Site.addStartRequest() , 用Spider.addStartRequest()代替 #494
  • HttpClientDownloader大幅重构,将Request转换抽象到HttpUriRequestConverter(之前继承HttpClientDownloader的实现可能需要做相应修改) #524
  • 将CycleRetry和statusCode的判断逻辑从Downloader中迁移到Spider中 #527
  • 通过Page.isDownloadSuccess而不是Page对象本身为空来判断是否下载失败
  • 为PageModel增加不发现新URL的功能 #575
  • 为Site增加了disableCookieManagement属性,在不想使用cookie时使用 #577