Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The default value of availableMemoryRatio is too low #2423

Closed
barjin opened this issue Apr 16, 2024 · 10 comments
Closed

The default value of availableMemoryRatio is too low #2423

barjin opened this issue Apr 16, 2024 · 10 comments
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@barjin
Copy link
Contributor

barjin commented Apr 16, 2024

In Apify Actor with 4 GB of available memory, the AutoscaledPool refuses to scale up, as it only sees ~1 GB of free memory.

obrazek

This slows down scrapers and might cause higher costs for Crawlee-based Actor users (user is billed per available memory/second, not used memory).

@barjin barjin added the bug Something isn't working. label Apr 16, 2024
@barjin barjin self-assigned this Apr 16, 2024
@barjin
Copy link
Contributor Author

barjin commented Apr 16, 2024

This seems to be caused by the following snippet (line 178):

} else {
const { totalBytes } = await this._getMemoryInfo();
this.maxMemoryBytes = Math.ceil(totalBytes * this.config.get('availableMemoryRatio')!);
this.log.debug(`Setting max memory of this run to ${Math.round(this.maxMemoryBytes / 1024 / 1024)} MB. `
+ 'Use the CRAWLEE_MEMORY_MBYTES or CRAWLEE_AVAILABLE_MEMORY_RATIO environment variable to override it.');

The availableMemoryRatio is by default 0.25, which checks out with our observations. This is probably alright for non-Apify users, but a bit dumb for Actors on Apify, where we should utilize all the available resources.

In such cases, this can be remedied by overriding the defaults with the CRAWLEE_AVAILABLE_MEMORY_RATIO envvar or by passing a customized Configuration instance to the crawler:

new PlaywrightCrawler(
    {}, 
   new Configuration({
      availableMemoryRatio: 1,
   })
);

@barjin barjin changed the title Available memory info is not correct on Apify The default value of availableMemoryRatio is too low Apr 16, 2024
@janbuchar
Copy link
Contributor

janbuchar commented Apr 16, 2024

Humph, it'd make sense to me if Actor.init set the ratio to 1. Or can we set the default value of the env var for all runs on the platform without forcing everybody to update dependencies?

@barjin
Copy link
Contributor Author

barjin commented Apr 17, 2024

Perhaps we can set the default value of availableMemoryRatio in the Apify SDK's Configuration (here)?

We might use the APIFY_IS_AT_HOME envvar to switch between the original default and the new default ratio (~0.9?)

@vladfrangu
Copy link
Member

I'd say lets set it at the base image level, with cheerio and normal node having a higher ratio than browser images, but what do you think would be better?

@janbuchar
Copy link
Contributor

I'd say lets set it at the base image level, with cheerio and normal node having a higher ratio than browser images, but what do you think would be better?

I'm probably missing important info here - if I start a new crawlee project, I get a Dockerfile based on one of the base images, correct?

If I change the crawler type in my code (perfectly legit thing IMO), won't configuration done in the base image just stick? That seems hard to track down...

@vladfrangu
Copy link
Member

If I change the crawler type in my code (perfectly legit thing IMO), won't configuration done in the base image just stick? That seems hard to track down..

this is true, but you should also update the image in that case... I guess this is a rough thing to fix... Maybe we can middleground? Expose an env variable from base images that specify the img type and actor.init decides on default ratio based on it?

Or maybe I'm just high and there's a better solution! I'm just throwing ideas here :D

@B4nan
Copy link
Member

B4nan commented Apr 21, 2024

I realized I haven't commented on this anywhere, we only discussed this with Jindra on Thursday - so here is the thing: we already set this value to 1 on the platform, and it worked just fine until recently. It's done in the SDK in Actor.init here:

https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L203

What I think might have happened is that a wrong config is resolved via AsyncLocalStorage (as by default all places use the global config which resolves to a scoped one via ALS). If that's the case, it could be caused by #2371.

@janbuchar
Copy link
Contributor

What I think might have happened is that a wrong config is resolved via AsyncLocalStorage (as by default all places use the global config which resolves to a scoped one via ALS). If that's the case, it could be caused by #2371.

Could you elaborate how? That ALS is not even in place when you're not working with AdaptivePlaywrightCrawler. Or is this just a hunch that two supposedly independent instances of AsyncLocalStorage may interfere in weird ways?

@B4nan
Copy link
Member

B4nan commented Apr 22, 2024

Yes, it's a hunch, based on years of experience working with ALS, seeing all the weird edge cases myself (been using it before it became stable).

What I am sure about:

  1. we were setting the ratio to 1 (inside Actor.init) since inception
  2. it was working just fine for a very long time (since the initial 3.0 release)
  3. the config resolution depends on ALS
  4. only recently we started adding more ALS usage

It could be as well about some other refactoring, but that particular PR sounds like the ideal first candidate to check.

I haven't tried to reproduce this yet, not sure if it's surfacing always or if it's just a fluke? If it's happening the same all the time, I would first try to revert that PR via patch-package to see if it helps.

Next time let's please at least add a link to slack discussions to the OP for more context.

@B4nan B4nan assigned B4nan and unassigned barjin Apr 22, 2024
@fnesveda fnesveda added t-tooling Issues with this label are in the ownership of the tooling team. t-console Issues with this label are in the ownership of the console team. labels May 2, 2024
@B4nan B4nan removed the t-console Issues with this label are in the ownership of the console team. label May 14, 2024
@B4nan
Copy link
Member

B4nan commented May 14, 2024

I will close this since it's no longer surfacing in the current version and I haven't been able to confirm my hunch from above either (also the PR in question looks safe on a second look, it shouldn't affect more than just the adaptive crawler even if it would be the culprit).

@B4nan B4nan closed this as not planned Won't fix, can't repro, duplicate, stale May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

5 participants