Releases · crwlrsoft/crawler

05 Jun 00:10

otsch

v1.8.0

76c0f3c

v1.8.0 Latest

Latest

Added

New methods Step::keep() and Step::keepAs(), as well as Step::keepFromInput() and Step::keepInputAs(), as alternatives to Step::addToResult() (or Step::addLaterToResult()). The keep() method can be called without any argument, to keep all from the output data. It can be called with a string, to keep a certain key or with an array to keep a list of keys. If the step yields scalar value outputs (not an associative array or object with keys) you need to use the keepAs() method with the key you want the output value to have in the kept data. The methods keepFromInput() and keepInputAs() work the same, but uses the input (not the output) that the step receives. Most likely only needed with a first step, to keep data from initial inputs (or in a sub crawler, see below). Kept properties can also be accessed with the Step::useInputKey() method, so you can easily reuse properties from multiple steps ago as input.
New method Step::outputType() with default implementation returning StepOutputType::Mixed. Please consider implementing this method yourself in all your custom steps, because it is going to be required in v2 of the library. It allows detecting (potential) problems in crawling procedures immediately when starting a run instead of failing after already running a while.
New method Step::subCrawlerFor(), allowing to fill output properties from an actual full child crawling procedure. As the first argument, you give it a key from the step's output, that the child crawler uses as input(s). As the second argument you need to provide a Closure that receives a clone of the current Crawler without steps and with initial inputs, set from the current output. In the Closure you then define the crawling procedure by adding steps as you're used to do it, and return it. This allows to achieve nested output data, scraped from different (sub-)pages, more flexible and less complicated as with the usual linear crawling procedure and Step::addToResult().

Deprecated

The Step::addToResult(), Step::addLaterToResult() and Step::keepInputData() methods. Instead, please use the new keep methods. This can cause some migration work for v2, because especially the add to result methods are a pretty central functionality, but the new "keep" methodology (plus the new sub crawler feature) will make a lot of things easier, less complex and the library will most likely work more efficiently in v2.

Fixed

When a cache file was generated with compression, and you're trying to read it with a FileCache instance without compression enabled, it also works. When unserializing the file content fails it tries decoding the string first before unserializing it.

Assets 2

19 Mar 11:40

otsch

v1.7.2

eafa3b6

v1.7.2

Fixed

When the useInputKey() method is used on a step and the defined key does not exist in input, it logs a warning and does not invoke the step instead of throwing an Exception.

Assets 2

11 Mar 12:46

otsch

v1.7.1

f2214f8

v1.7.1

Fixed

A PHP error that happened when the loader returns null for the initial request in the Http::crawl() step.

Assets 2

04 Mar 13:04

otsch

v1.7.0

7e58744

v1.7.0

Added

Allow getting the whole decoded JSON as array with the new Json::all() and also allow to get the whole decoded JSON, when using Json::get(), inside a mapping using either empty string or * as target. Example: Json::get(['all' => '*']). * only works, when there is no key * in the decoded data.

Fixed

Make it work with responses loaded by a headless browser. If decoding the input string fails, it now checks if it could be HTML. If that's the case, it extracts the text content of the <body> and tries to decode this instead.

Assets 2

26 Feb 22:33

otsch

v1.6.2

95e5bdf

v1.6.2

Fixed

When using HttpLoader::cacheOnlyWhereUrl() and a request was redirected (maybe even multiple times), previously all URLs in the chain had to match the filter rule. As this isn't really practicable, now only one of the URLs has to match the rule.

Assets 2

16 Feb 22:28

otsch

v1.6.1

7a1633d

v1.6.1

Changed

Make method HttpLoader::addToCache() public, so steps can update a cached response with an extended version.

Assets 2

13 Feb 02:04

otsch

v1.6.0

33b49fb

v1.6.0

Added

Enable dot notation in Step::addToResult(), so you can get data from nested output, like: $step->addToResult(['url' => 'response.url', 'status' => 'response.status', 'foo' => 'bar']).
When a step adds output properties to the result, and the output contains objects, it tries to serialize those objects to arrays, by calling __serialize(). If you want an object to be serialized differently for that purpose, you can define a toArrayForAddToResult() method in that class. When that method exists, it's preferred to the __serialize() method.
Implemented above-mentioned toArrayForAddToResult() method in the RespondedRequest class, so on every step that somehow yields a RespondedRequest object, you can use the keys url, uri, status, headers and body with the addToResult() method. Previously this only worked for Http steps, because it defines output key aliases (HttpBase::outputKeyAliases()). Now, in combination with the ability to use dot notation when adding data to the result, if your custom step returns nested output like ['response' => RespondedRequest, 'foo' => 'bar'], you can add response data to the result like this $step->addToResult(['url' => 'response.url', 'body' => 'response.body']).

Fixed

Improvement regarding the timing when a store (Store class instance) is called by the crawler with a final crawling result. When a crawling step initiates a crawling result (so, addToResult() was called on the step instance), the crawler has to wait for all child outputs (resulting from one step-input) until it calls the store, because the child outputs can all add data to the same final result object. But previously this was not only the case for all child outputs starting from a step where addToResult() was called, but all children of one initial crawler input. So with this change, in a lot of cases, the store will earlier be called with finished Result objects and memory usage will be lowered.

Assets 2

07 Feb 14:49

otsch

v1.5.3

882bc1e

v1.5.3

Fixed

Merge HttpBaseLoader back to HttpLoader. It's probably not a good idea to have multiple loaders. At least not multiple loaders just for HTTP. It should be enough to publicly expose the HeadlessBrowserLoaderHelper via HttpLoader::browserHelper() for the extension steps. But keep the HttpBase step, to share the general HTTP functionality implemented there.

Assets 2

07 Feb 10:08

otsch

v1.5.2

9f04c17

v1.5.2

Fixed

Issue in GetUrlsFromSitemap (Sitemap::getUrlsFromSitemap()) step when XML content has no line breaks.

Assets 2

06 Feb 22:40

otsch

v1.5.1

9152d00

v1.5.1

Fixed

For being more flexible to build a separate headless browser loader (in an extension package) extract the most basic HTTP loader functionality to a new HttpBaseLoader and important functionality for the headless browser loader to a new HeadlessBrowserLoaderHelper. Further, also share functionality from the Http steps via a new abstract HttpBase step. It's considered a fix, because there's no new functionality, just refactoring existing code for better extendability.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added

Deprecated

Fixed

Fixed

Fixed

Added

Fixed

Fixed

Changed

Added

Fixed

Fixed

Fixed

Fixed

Releases: crwlrsoft/crawler

v1.8.0

Added

Deprecated

Fixed

v1.7.2

Fixed

v1.7.1

Fixed

v1.7.0

Added

Fixed

v1.6.2

Fixed

v1.6.1

Changed

v1.6.0

Added

Fixed

v1.5.3

Fixed

v1.5.2

Fixed

v1.5.1

Fixed