Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keep() instead of addToResult() and sub crawlers #142

Merged
merged 5 commits into from
Jun 5, 2024

Conversation

otsch
Copy link
Member

@otsch otsch commented Mar 26, 2024

New methods Step::keep(), Step::keepAs(), Step::keepFromInput() and Step::keepInputAs() as simpler alternatives for Step::addToResult(), Step::addLaterToResult() and Step::keepInputData() which are all deprecated now. The new keep methods add data to a keep array in IO objects. Not creating a Result object and potentially sharing the same Result object for a lot of child outputs, makes the new keep functionality less complex. No need for something like addLaterToResult(). Kept properties can also be used with useInputKey() which is pretty handy.

Another cool new feature are sub crawlers. Any step can now create a sub crawler to fill a property. Example: you have a page about an author with multiple links to detail pages about his books. You can select those links and let a sub crawler fill the author's books property with data from the book detail pages.

Further also introduce a new Step::outputType() method, that returns if a certain step yields outputs that are associate arrays (or objects), scalar values or potentially both (mixed). This helps reduce potential critical problems during a crawler run by validating before the run and throwing an exception (or log error messages).

New methods `Step::keep()`, `Step::keepAs()`, `Step::keepFromInput()`
and `Step::keepInputAs()` as simpler alternatives for
`Step::addToResult()`, `Step::addLaterToResult()` and
`Step::keepInputData()` which are all deprecated now. The new keep
methods add data to a keep array in IO objects. Not creating a Result
object and potentially sharing the same Result object for a lot of child
outputs, makes the new keep functionality less complex. No need for
something like `addLaterToResult()`. Kept properties can also be used
with `useInputKey()` which is pretty handy.

Another cool new feature are sub crawlers. Any step can now create a
sub crawler to fill a property. Example: you have a page about an
author with multiple links to detail pages about his books. You can
select those links and let a sub crawler fill the author's `books`
property with data from the book detail pages.

Further also introduce a new `Step::outputType()` method, that returns
if a certain step yields outputs that are associate arrays (or objects),
scalar values or potentially both (mixed). This helps reduce potential
critical problems during a crawler run by validating before the run and
throwing an exception (or log error messages).
Make reading a compressed cache file work, even when useCompression was
not called on the `FileCache` instance.
Add trailing commas in multi line function calls.
The new method is `outputType()`. Method `outputKey()` is an existing
method.
@otsch otsch merged commit 76c0f3c into main Jun 5, 2024
8 checks passed
@otsch otsch deleted the feature/keep-and-sub-crawling-procedures branch June 5, 2024 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant