Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Power-tuning involving only cold starts #176

Open
alexcasalboni opened this issue Aug 8, 2022 · 10 comments
Open

Power-tuning involving only cold starts #176

alexcasalboni opened this issue Aug 8, 2022 · 10 comments
Assignees

Comments

@alexcasalboni
Copy link
Owner

alexcasalboni commented Aug 8, 2022

The tool could provide an option to power-tune a given function considering only cold start invocations.

The current logic is based on aliases in order to maximize parallelism and optimize overall speed of the power-tuning process. Unfortunately, it also makes it hard to achieve cold-start-only tuning.

We could implement an alternative logic such as:

When parameter forceColdStarts (or onlyColdStarts) is provided:
[Initializer] Do nothing (no alias or version needs to be created)
[Executor] Invoke $LATEST sequentially after forcing a cold start (by updating power config and an env variable)

This should work fine for all values of num and any power value.

The only drawback is that nothing can be parallelized, which isn't a big issue as long as each invocation is short enough. For example, if the average cold invocation takes 5 seconds, with num=20 and 5 power values, the overall power-tuning process will take about 8 minutes. It's very easy to reach 40+ minutes with 10s invocations and num>50.

@alexcasalboni alexcasalboni self-assigned this Aug 8, 2022
@alexcasalboni
Copy link
Owner Author

This is somehow related to the (closed) issue #123.

@Parro what do you think of this approach?

Twitter thread with Paul Johnston for reference: https://twitter.com/alex_casalboni/status/1556585120332759040

@Parro
Copy link
Contributor

Parro commented Aug 9, 2022

I was thinking of a different approach: we could create a set of different versions of the lambda with the same code in the Initializer step, this way every invocation of a lambda version should spawn a new process with its cold start. This way we could still use parallelization, and even add a new step in the state machine so that we can have the statistics of Duration and Init Duration together in the report.

What do you think about it?

@alexcasalboni
Copy link
Owner Author

@Parro yes, that's what Paul proposed too.

Let's double-check the Lambda quotas :)

Is there any limit regarding the # of aliases per function? Or any API rate-limiting when creating new versions and aliases? I never encountered any limitations since we only create one version/alias per power value.

Let's assume there are no such limitations.

With x power values and num invocations, we'll need to create x * num versions and aliases, so that we can invoke them in parallel. I often run the tool with 5 power values and num=50 (or even 100), so that means 250+ versions and aliases created during initialization.

I'd agree this mechanism is better for the overall execution time, even if the initialization phase will take much longer. As far as I can remember, the creation of new versions/aliases cannot be parallelized. Initializing 4-5 versions currently takes 7-8 seconds. With num=50 it will take more than 6 minutes.

@Parro
Copy link
Contributor

Parro commented Aug 9, 2022

Is there any limit regarding the # of aliases per function?

The only limit I am aware of is the Code storage of 75 GB. In an account with few lambdas it should not be a problem, in an account with dozens of them we could hit the limit... and of course it depends from the size of the lambda under test. We could state clearly in the documentation that it will be used lambdaSize * powerValues * num storage to make the test.

As far as I can remember, the creation of new versions/aliases cannot be parallelized.

It's a lambda limitation? Even if we use a map step in the state machine it would fail?

Anyway, even if the initializing time is long, but we could also state this in the docs to warn the user.

@alexcasalboni
Copy link
Owner Author

It's a lambda limitation? Even if we use a map step in the state machine it would fail?

Yes, because you're always working on $LATEST when creating new versions and aliases.

I've just implemented a first iteration of this (both initializer and cleanr logic). I'm going to run some tests and share the WIP code in a new PR later today.

@alexcasalboni
Copy link
Owner Author

@Parro it works :) Check out the PR #177

@ryancormack
Copy link
Contributor

Hey @alexcasalboni @Parro I was wondering if there's any movement on this? I noticed a few open PRs that seem to be working, but not much recent activity. Is there anything that I could help with if there are some rough edges that need a hand?

If there's 1 PR that might be the direction this moves in (if indeed it is planned to move forward with this feature), then I could just clone that version and deploy that short term.

Thanks

@alexcasalboni
Copy link
Owner Author

@ryancormack thank for checking :) yes, we're definitely moving forward to find the ideal solution for this!

Currently, there are two open PRs using different approaches:

  • Add parameter to power-tune only cold starts #177 (a bit old at this point) is creating num new versions/aliases for each memory configuration, which does work but it's a bit extreme in the amount of overhead it creates and # of API calls - you easily end up creating and destroying hundreds of versions/aliases, and it also creates an upper-bound of approaximately 500 versions/aliases we can create in 15min (reducing the number of configurations*invocations you can test)
  • Add parameter to power-tune only cold starts #206 is moving the version/alias creation into a state machine loop, therefore removing the above constraint, at the expense of making the state machine more complex and expensive to run (for all use cases, not only cold starts)

That said, I'm quite sure the second PR is closer to the direction we'll choose and I'd recommend you clone that version for the time being. I'm currently working with a few colleagues at AWS to speed up the maintenance of this tool, so I'd expect we'll settle on a final solution in the next 30-60 days 🚀

@ryancormack
Copy link
Contributor

Thanks Alex, it's working mostly well. I've created that issue above - I know it was slightly mentioned way up this issue, but I don't know if it's more nuanced and neither PR currently accounts for it.

The tool was super helpful in actually spawning a huge number of cold starts for me, which was really helpful, and I could use Cloudwatch Log Queries to get the 'end user latency' times that I was hoping to get.

@alexcasalboni
Copy link
Owner Author

Quick update on this: we're continuing our work on #206 - it turns out that approach is also useful to solve a SnapStart-related problem.

Apologies for the delay, we should be able to finalize the current implementation in a matter of weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants