-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare DLAMI for ParallelCluster using pcluster build-image #92
base: main
Are you sure you want to change the base?
Conversation
# Estimated build time: ~1h | ||
InstanceType: g4dn.4xlarge | ||
|
||
# Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 20.04) 20240101 / us-west-2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel very hardcoded. Why OSS Nvidia Driver and not closed source?
I have hard time understanding what the DLAMI brings compares to ParalleCluster AMI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel very hardcoded.
Specifying ami id is as much the same as the rest of pcluster or packer. The example purposely includes a comment on ami name + region to quickly communicate what exactly is the parent's ami, and indeed at hte cost of (minor) maintainability overhead.
Why OSS Nvidia Driver and not closed source?
DLAMI added the 'OSS' to ami name since Dec'23. The Jan24 build uses https://github.com/NVIDIA/open-gpu-kernel-modules.git
... what the DLAMI brings compares to ParalleCluster AMI?
- Different release cycle, decoupled from PCluster, allowing flexibility which release cycle to follow.
- Get an approximation (but not exact) to Hyperpod (based on DLAMI).
- It's for users who come from DLAMI and want to continue do so: prebuilt NCCL, mutliple CUDA versions, and other idiosyncrasies of DLAMI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Packer uses the latest AMI name or based on PC version. It is not bound to AMI id specific to a region.
- The OSS driver was added to address linux kernels changes impacting EFA, https://docs.aws.amazon.com/dlami/latest/devguide/important-changes.html. It is fixed now.
- Based on DLAMI to use prebuilt NCCL etc.. Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1/ Packer uses the latest AMI name or based on PC version. It is not bound to AMI id specific to a region.
This is definitely a plus point.
@@ -0,0 +1,30 @@ | |||
Build: | |||
# Estimated build time: ~1h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems long. Why not sticking to packer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems long.
it's about 40-50ish minutes, and packer would take the same time with default EBS setting (125 MB/s). Our packer example is faster because we increase the throughput very high (1000) which also locks the resulted AMI to the built-time EBS throughput.
pcluster build-image
depends on Image Builder and Image Builder seem to only support default EBS throughout (125 MB/s), so at the moment that 40-50min is the build time (hence, being upfront with the commentary).
Why not sticking to packer?
Because all the GPU DL stacks are already installed, and only need to enrich with Slurm and other pcluster requirements (which is what pcluster build-image
is about).
And this method support alinux2 or ub2004 without having to write custom packer and ansible recipes. Whereas now, our packer example is written for alinux2, and not out-of-the-box working with ub2004.
Lastly, installing pcluster cli is straightforward on multiple platform, as it's a standard Python package installation. Packer (+ansible) might have paper cut on different platform (e.g., on OSX, the scp must have its own flag compared to when using Cloud9 as the packer client).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because packer+ansible was not supposed to ssh remotely and should be local.
Ansible roles are completely deficient in this repo. I am working on a complete ansible role re-write+tests. ETA Q1.
2.ami_and_containers/3.pcluster_create_dlami/01.dlami-ub2004-base-gpu.yaml
Outdated
Show resolved
Hide resolved
2.ami_and_containers/3.pcluster_create_dlami/01.dlami-ub2004-base-gpu.yaml
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments
Co-authored-by: enrico-usai <10634438+enrico-usai@users.noreply.github.com>
fa037c7
to
6073a5b
Compare
44e448e
to
1209815
Compare
Issue #, if available: N/A
Description of changes: Example to prepare DLAMI using
pcluster build-image
which does not require additional community tools (ansible
andpacker
).By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.