Prepare DLAMI for ParallelCluster using pcluster build-image #92

verdimrc · 2024-01-05T10:48:17Z

Issue #, if available: N/A

Description of changes: Example to prepare DLAMI using pcluster build-image which does not require additional community tools (ansible and packer).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

mhuguesaws · 2024-01-05T15:29:25Z

2.ami_and_containers/3.pcluster_create_dlami/01.dlami-ub2004-base-gpu.yaml

+  # Estimated build time: ~1h
+  InstanceType: g4dn.4xlarge
+
+  # Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 20.04) 20240101 / us-west-2


Feel very hardcoded. Why OSS Nvidia Driver and not closed source?

I have hard time understanding what the DLAMI brings compares to ParalleCluster AMI?

Feel very hardcoded.

Specifying ami id is as much the same as the rest of pcluster or packer. The example purposely includes a comment on ami name + region to quickly communicate what exactly is the parent's ami, and indeed at hte cost of (minor) maintainability overhead.

Why OSS Nvidia Driver and not closed source?

DLAMI added the 'OSS' to ami name since Dec'23. The Jan24 build uses https://github.com/NVIDIA/open-gpu-kernel-modules.git

... what the DLAMI brings compares to ParalleCluster AMI?

Different release cycle, decoupled from PCluster, allowing flexibility which release cycle to follow.

Get an approximation (but not exact) to Hyperpod (based on DLAMI).

It's for users who come from DLAMI and want to continue do so: prebuilt NCCL, mutliple CUDA versions, and other idiosyncrasies of DLAMI.

Packer uses the latest AMI name or based on PC version. It is not bound to AMI id specific to a region.

The OSS driver was added to address linux kernels changes impacting EFA, https://docs.aws.amazon.com/dlami/latest/devguide/important-changes.html. It is fixed now.

Based on DLAMI to use prebuilt NCCL etc.. Agreed.

1/ Packer uses the latest AMI name or based on PC version. It is not bound to AMI id specific to a region.

This is definitely a plus point.

mhuguesaws · 2024-01-05T15:29:46Z

2.ami_and_containers/3.pcluster_create_dlami/01.dlami-ub2004-base-gpu.yaml

@@ -0,0 +1,30 @@
+Build:
+  # Estimated build time: ~1h


It seems long. Why not sticking to packer?

It seems long.

it's about 40-50ish minutes, and packer would take the same time with default EBS setting (125 MB/s). Our packer example is faster because we increase the throughput very high (1000) which also locks the resulted AMI to the built-time EBS throughput.

pcluster build-image depends on Image Builder and Image Builder seem to only support default EBS throughout (125 MB/s), so at the moment that 40-50min is the build time (hence, being upfront with the commentary).

Why not sticking to packer?

Because all the GPU DL stacks are already installed, and only need to enrich with Slurm and other pcluster requirements (which is what pcluster build-image is about).

And this method support alinux2 or ub2004 without having to write custom packer and ansible recipes. Whereas now, our packer example is written for alinux2, and not out-of-the-box working with ub2004.

Lastly, installing pcluster cli is straightforward on multiple platform, as it's a standard Python package installation. Packer (+ansible) might have paper cut on different platform (e.g., on OSX, the scp must have its own flag compared to when using Cloud9 as the packer client).

It's because packer+ansible was not supposed to ssh remotely and should be local.
Ansible roles are completely deficient in this repo. I am working on a complete ansible role re-write+tests. ETA Q1.

2.ami_and_containers/3.pcluster_create_dlami/01.dlami-ub2004-base-gpu.yaml

mhuguesaws

Left comments

Co-authored-by: enrico-usai <10634438+enrico-usai@users.noreply.github.com>

Prepare DLAMI using pcluster build-image

e759fe4

verdimrc added the enhancement New feature or request label Jan 5, 2024

verdimrc changed the title ~~Prepare DLAMI using pcluster build-image~~ Prepare DLAMI for ParallelCluster using pcluster build-image Jan 5, 2024

verdimrc requested review from sean-smith, KeitaW, perifaws and mhuguesaws January 5, 2024 10:49

Verdi March added 2 commits January 5, 2024 18:59

Update readme

607646c

Rename build spec *.{config-dlami => dlami}-*.yaml

fa17235

mhuguesaws reviewed Jan 5, 2024

View reviewed changes

2.ami_and_containers/3.pcluster_create_dlami/01.dlami-ub2004-base-gpu.yaml Outdated Show resolved Hide resolved

mhuguesaws reviewed Jan 5, 2024

View reviewed changes

2.ami_and_containers/3.pcluster_create_dlami/01.dlami-ub2004-base-gpu.yaml Show resolved Hide resolved

mhuguesaws reviewed Jan 5, 2024

View reviewed changes

Verdi March and others added 3 commits January 8, 2024 15:55

Link to DLAMI release notes; remove bogus tag

1661d2d

Document compability between pcluster build-image and DLAMI

f8ab1f3

Co-authored-by: enrico-usai <10634438+enrico-usai@users.noreply.github.com>

Simplify .yaml files

6073a5b

sean-smith approved these changes May 30, 2024

View reviewed changes

KeitaW force-pushed the pcluster-build-image-dlami branch 2 times, most recently from fa037c7 to 6073a5b Compare June 4, 2024 02:26

KeitaW force-pushed the main branch 3 times, most recently from 44e448e to 1209815 Compare June 4, 2024 02:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare DLAMI for ParallelCluster using pcluster build-image #92

Prepare DLAMI for ParallelCluster using pcluster build-image #92

verdimrc commented Jan 5, 2024

mhuguesaws Jan 5, 2024 •

edited

verdimrc Jan 8, 2024 •

edited

mhuguesaws Jan 8, 2024

verdimrc Jan 10, 2024

mhuguesaws Jan 5, 2024

verdimrc Jan 8, 2024 •

edited

mhuguesaws Jan 8, 2024

mhuguesaws left a comment

Prepare DLAMI for ParallelCluster using pcluster build-image #92

Are you sure you want to change the base?

Prepare DLAMI for ParallelCluster using pcluster build-image #92

Conversation

verdimrc commented Jan 5, 2024

mhuguesaws Jan 5, 2024 • edited

Choose a reason for hiding this comment

verdimrc Jan 8, 2024 • edited

Choose a reason for hiding this comment

mhuguesaws Jan 8, 2024

Choose a reason for hiding this comment

verdimrc Jan 10, 2024

Choose a reason for hiding this comment

mhuguesaws Jan 5, 2024

Choose a reason for hiding this comment

verdimrc Jan 8, 2024 • edited

Choose a reason for hiding this comment

mhuguesaws Jan 8, 2024

Choose a reason for hiding this comment

mhuguesaws left a comment

Choose a reason for hiding this comment

mhuguesaws Jan 5, 2024 •

edited

verdimrc Jan 8, 2024 •

edited

verdimrc Jan 8, 2024 •

edited