keep() instead of addToResult() and sub crawlers #142

otsch · 2024-03-26T17:58:35Z

New methods Step::keep(), Step::keepAs(), Step::keepFromInput() and Step::keepInputAs() as simpler alternatives for Step::addToResult(), Step::addLaterToResult() and Step::keepInputData() which are all deprecated now. The new keep methods add data to a keep array in IO objects. Not creating a Result object and potentially sharing the same Result object for a lot of child outputs, makes the new keep functionality less complex. No need for something like addLaterToResult(). Kept properties can also be used with useInputKey() which is pretty handy.

Another cool new feature are sub crawlers. Any step can now create a sub crawler to fill a property. Example: you have a page about an author with multiple links to detail pages about his books. You can select those links and let a sub crawler fill the author's books property with data from the book detail pages.

Further also introduce a new Step::outputType() method, that returns if a certain step yields outputs that are associate arrays (or objects), scalar values or potentially both (mixed). This helps reduce potential critical problems during a crawler run by validating before the run and throwing an exception (or log error messages).

New methods `Step::keep()`, `Step::keepAs()`, `Step::keepFromInput()` and `Step::keepInputAs()` as simpler alternatives for `Step::addToResult()`, `Step::addLaterToResult()` and `Step::keepInputData()` which are all deprecated now. The new keep methods add data to a keep array in IO objects. Not creating a Result object and potentially sharing the same Result object for a lot of child outputs, makes the new keep functionality less complex. No need for something like `addLaterToResult()`. Kept properties can also be used with `useInputKey()` which is pretty handy. Another cool new feature are sub crawlers. Any step can now create a sub crawler to fill a property. Example: you have a page about an author with multiple links to detail pages about his books. You can select those links and let a sub crawler fill the author's `books` property with data from the book detail pages. Further also introduce a new `Step::outputType()` method, that returns if a certain step yields outputs that are associate arrays (or objects), scalar values or potentially both (mixed). This helps reduce potential critical problems during a crawler run by validating before the run and throwing an exception (or log error messages).

Make reading a compressed cache file work, even when useCompression was not called on the `FileCache` instance.

Add trailing commas in multi line function calls.

The new method is `outputType()`. Method `outputKey()` is an existing method.

otsch added 5 commits March 26, 2024 18:56

Improve dealing with compression in cache files

125e136

Make reading a compressed cache file work, even when useCompression was not called on the `FileCache` instance.

Changes after updating PHP CS Fixer

3562374

Add trailing commas in multi line function calls.

Fix changelog

7888186

The new method is `outputType()`. Method `outputKey()` is an existing method.

Update date in changelog

d65d83f

otsch merged commit 76c0f3c into main Jun 5, 2024
8 checks passed

otsch deleted the feature/keep-and-sub-crawling-procedures branch June 5, 2024 00:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keep() instead of addToResult() and sub crawlers #142

keep() instead of addToResult() and sub crawlers #142

otsch commented Mar 26, 2024

keep() instead of addToResult() and sub crawlers #142

keep() instead of addToResult() and sub crawlers #142

Conversation

otsch commented Mar 26, 2024