Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apify.call should "survive" migration #4

Open
metalwarrior665 opened this issue Feb 13, 2020 · 21 comments
Open

Apify.call should "survive" migration #4

metalwarrior665 opened this issue Feb 13, 2020 · 21 comments

Comments

@metalwarrior665
Copy link
Member

Right now, Apify.call doesn't save any state upon migration so if the user is not careful, he will loose reference to the called runs and they will be called again.

The current solution is to save the run ID into KV store and do some waiting loop manually.

Ideally Apify.call would do this automatically. Not sure if this can be done without any extra parameters automatically or we the user would have to specify it with a string param.

@jancurn
Copy link
Member

jancurn commented Feb 13, 2020

I can't much imagine how this would work. When you restart the actor, there is no way to make the execution jump right to Apify.call(). The only way would be to let the actor repeat everything else, and then when it performs Apify.call() with same parameters (???) as before, the function would just re-use an older call and return that instead? But what if the parameters are a bit different? And would the users understand this?

@metalwarrior665
Copy link
Member Author

metalwarrior665 commented Feb 13, 2020

Sure, that is why we discuss it here.

Yes, the actor would repeat everything else.

It doesn't have be to necessarily default behavior (if we find it confusing) of Apify.call. But in 99.9% cases, you would like to continue waiting for the current call (without changing the input or settings). For example,the queue and list are also persisted so if you have a crawler and then Apify.call, after restart it will most likely just jump directly into the waiting for Apify.call to return.

@mnmkng
Copy link
Member

mnmkng commented Feb 13, 2020

Technically, we could add an idempotencyKey parameter as we have with webhooks. Apify.call() could then check if such key is present in KVS and if so, load the info from KVS, otherwise start a new call.

@mnmkng
Copy link
Member

mnmkng commented Feb 13, 2020

(yet another use for hidden fields in KVS)

@jancurn
Copy link
Member

jancurn commented Feb 13, 2020

Good idea, we can even generate the idempotencyKey from actor run ID, so that we don't even need to store it.

@mnmkng
Copy link
Member

mnmkng commented Feb 13, 2020

@jancurn But then we could call only once for a given target actor. If I Apify.call multiple times, I need to be able to store those calls separately to be able to rerun them. Or am I missing something?

@jancurn
Copy link
Member

jancurn commented Feb 13, 2020

You can construct the idempotencyKey from the call input params and run ID

@pocesar
Copy link

pocesar commented May 4, 2020

and we need this more than ever now 😬 was going o post about call idempotencyKey, but here it is... so calling call with the same key will either call another actor if it's not running or return the running instance when it was already invoked? how it would behave with calls using { waitSecs: 0 }?

and thinking further, we can have an Apify.utils.waitCall(s) function that waits for runIds, so people can use it to wait manually or it can be used for this idempotencyKey approach internally for Apify.call/callTask

https://github.com/pocesar/actor-spawn-workers/blob/15e3efe2c4969de5af61f967315d0611f1ab5a5b/src/functions.js#L11-L25

@metalwarrior665
Copy link
Member Author

I agree to start with Apify.utils.waitCall (maybe a bit different name - awaitRun?, waitForRunFinish?). We are copy/pasting this over again.

Then we can think further if/how to integrate this into Apify.call.

@jancurn
Copy link
Member

jancurn commented May 6, 2020

Guys, you know there is already an internal function called waitForRunToFinish () that does exactly that, in a slightly smarter way than active polling? See https://github.com/apifytech/apify-js/blob/f2d6e5c1f7c72be9b39d4c05912dd50b89228433/src/actor.js#L34

We can start by documenting it making it public.

@pocesar
Copy link

pocesar commented May 6, 2020

exporting that right away would be nice.

right now we have 2 actors take literally takes a day to complete, and they wait on each others completion.
Apify.call() inside the main actor, that is currently calling the actor again on each migration, leading to many instances running unnecessarily. so we need to start it once, save the call run id to the KV, use { waitSecs: 0 } and actively poll it

@mnmkng
Copy link
Member

mnmkng commented May 6, 2020

Wouldn't that case be better served by a webhook?

@pocesar
Copy link

pocesar commented May 6, 2020

we could, would need some refactoring and moving code outside

@jancurn
Copy link
Member

jancurn commented May 6, 2020

I think there is no harm making waitForRunToFinish() publicly available, we just need to document it well. The taskId param is not very nice, it's only used for the error message, so perhaps we could get rid of it in the public function and catch and rethrow better error if needed.

@mnmkng
Copy link
Member

mnmkng commented May 7, 2020

apify/crawlee#675

@pocesar
Copy link

pocesar commented Sep 30, 2020

I think this could be tackled similarly to the statistic class, but instead of an autoincrement id, quickly hash the input to make it unique between calls, this way allows more than 1 call/callTask:

const run = await Apify.call('apify/send-mail', {
  to: 'email@example',
});
// if the run is done, it returns immediately, otherwise, waits.

the hash can be a simple CRC32 of the JSON.stringify input, for efficiency and unlikely to have collisions inside the same run. hashable string would be like ${nameOfActor}${JSON.stringify(actorInput)}${idempotencyKey} (add the other options as well?). to make it distinct using the same input (who knows), the user need to pass in the new idempotencyKey to options, otherwise it's the current actor run ID, so this would require 0 changes to the platform (and works properly locally)

Calls are then saved to SDK_CALLS KV, with the return metadata from Apify.client.tasks.runTask

{
   "hexHashOfInput": {
      "id": "HG7ML7M8z78YcAPEB", // the runId
      "actorTaskId": "KJHSKHausidyaJKHs", // ditto
      // save the bits we care about to be able to retrieve it later
   }
}

the lookup for the input should be straightforward. querying the state of the run and issuing a waitForRunToFinish in case { waitSecs: 0 }.

Good part of using hashes: doesn't leak information about the call/callTask on KV, hash is space efficient and stable
Bad part: unintelligible and can't know before hand which call is what key in the object, harder to debug (unless it attaches the id to the call function output)

@metalwarrior665
Copy link
Member Author

@pocesar I think this is an ideal candidate for the apify-utils repo :)

BTW: I think your solution might be quite bloated on the storage. You would need to clean-up the records in the call as well. On the other hand, a single JSON might suffer when you call thousands of times and grow too large. Maybe some compromise :) Let's discuss it on apify-utils :)

@pocesar
Copy link

pocesar commented Sep 30, 2020

@metalwarrior665 the thing is I plan to do a couple of utility PAs that other people can call from their actors (like a microservice). having this working at SDK level instead of having the person to require an external dependency on their code would be ideal

@mnmkng
Copy link
Member

mnmkng commented Sep 30, 2020

Guys, great points. Just letting you know that call will be a part of apify-client soon. But the persistence could still be part of SDK I guess. Dunno yet. Step by step. Obviously the client can do the persistence too.

@pocesar
Copy link

pocesar commented Oct 2, 2020

example: had to write all the boilerplate in the README https://apify.com/pocesar/sitemap-to-request-queue otherwise it will do the same things again on migration and waste a couple of minutes to add 0 requests to the queue

@B4nan B4nan transferred this issue from apify/crawlee Jul 20, 2022
@metalwarrior665
Copy link
Member Author

This will be resolved by the new apify-extra package, will close this once it is released

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants