Add llama-repices llama3 example #276

KeitaW · 2024-04-19T01:54:30Z

Issue #, if available:

Description of changes:

This test case illustrates different LLM model development steps, starting from finetuning, evaluation, and deployment using Meta's llama-recipe.

…-distributed-training into llama-recipes

…uted-training into llama-recipes

pbelevich

you need to escape $ with \$ when you do cat > .env ... otherwise $XYZ values will be evaluated when you run cat and will be empty in the result .env file

3.test_cases/19.llama-recipes/README.md

3.test_cases/19.llama-recipes/scripts/generate_env_vars.sh

pbelevich · 2024-05-19T19:32:40Z

3.test_cases/19.llama-recipes/README.md

+On the head/login node of the cluster, clone the repository, move to the test case directory.
+
+```bash
+git clone https://github.com/aws-samples/awsome-distributed-training ${FSX_PATH}


I assume that you meant:

mkdir -p ${FSX_PATH}

cd ${FSX_PATH}

git clone https://github.com/aws-samples/awsome-distributed-training

Or you need to change TEST_CASE_PATH to

export TEST_CASE_PATH=\${FSX_PATH}/3.test_cases/19.llama-recipes

because if you git clone to ${FSX_PATH} there it would be ${FSX_PATH}/3.test_cases instead of ${FSX_PATH}/awsome-distributed-training/3.test_cases

pbelevich · 2024-05-19T19:46:45Z

3.test_cases/19.llama-recipes/README.md

+## 2. Build the container
+
+Before running training jobs, you need to use a build docker container image. [Enroot](https://github.com/NVIDIA/enroot) will be used to turn the image into unprivileged sandbox for Slurm. 
+You can build the image on your login node using the option 1 below, but build step may exceed the storage available on the head node so we reccomend building it on a compute node following instructions in option2.


where is option 2?

pbelevich · 2024-05-20T01:17:26Z

3.test_cases/19.llama-recipes/README.md

+In the login node, launch Python process on the head node, run the following:
+
+```bash
+    enroot start --env NVIDIA_VISIBLE_DEVICES=void \


Probably you missed:

mkdir -p ${APPS_PATH} enroot import -o ${ENROOT_IMAGE} dockerd://llama3:latest

pbelevich · 2024-05-20T01:29:04Z

3.test_cases/19.llama-recipes/README.md

+In the login node, launch Python process on the head node, run the following:
+
+```bash
+    enroot start --env NVIDIA_VISIBLE_DEVICES=void \


Also, enroot start failed for me with [ERROR] Command not found: squashfuse on multiple clusters

And [ERROR] Command not found: fuse-overlayfs

Fixed after sudo apt-get install squashfuse fuse-overlayfs

If you set up a cluster with

https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/docker/postinstall.sh

https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/pyxis/postinstall.sh
those should have been installed by default. If you have issue nonetheless, let's discuss offline.

https://github.com/aws-samples/awsome-distributed-training/blob/b6eb957a3de449e5a903dabbb72dbcc51820b407/1.architectures/2.aws-parallelcluster/distributed-training-p4de_postinstall_scripts.yaml

pbelevich · 2024-05-20T01:44:51Z

3.test_cases/19.llama-recipes/README.md

+In the login node, launch Python process on the head node, run the following:
+
+```bash
+    enroot start --env NVIDIA_VISIBLE_DEVICES=void \


Maybe HF login can be done without enroot start?

You can but that requires you to setup Python env with HF locally. Since we have everything in container, I prefer this way. But I agree it's not easy to follow, so packaged it as script in here .

pbelevich · 2024-05-20T01:48:25Z

3.test_cases/19.llama-recipes/README.md

+In this step, you will fine tune llama model, using Alpaca dataset. Use curl command to download the dataset:
+
+```bash
+curl https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json


curl without args prints the content to the console, probably we need to store it somewhere instead?

Yes I was missing -O alpaca_data.json

pbelevich · 2024-05-20T02:46:32Z

3.test_cases/19.llama-recipes/README.md

+The training process will create the following FSDP checkponits.
+
+```bash
+$ ls /fsx/models/meta-llama/Meta-Llama-3-8B-tuned/fine-tuned-meta-llama/Meta-Llama-3-8B/


Sorry, I missed the point how and when was it created, I mean /fsx/models...

Co-authored-by: Pavel Belevich <belevich@amazon.com>

KeitaW · 2024-05-20T12:28:18Z

Thanks @pbelevich for your thoughtful reviews! As discussed offline, let's migrate the contents to torchtitan torchtune stack. I have created draft PR here

add drafts

99479b3

KeitaW self-assigned this Apr 19, 2024

KeitaW added the enhancement New feature or request label Apr 19, 2024

KeitaW added 7 commits April 20, 2024 00:12

update ginignore

d02d981

update the base image

081b991

update

8d3f714

update

f24d88c

update

98c762b

update

9339032

update with vllm

163ee9c

KeitaW changed the title ~~[Draft, not ready for review] Add llama3 example~~ [Draft, not ready for review] Add llama-repices llama3 example Apr 25, 2024

KeitaW added 6 commits May 2, 2024 00:39

update

fee38a2

remove

fcc29a4

update

088a098

ignore lm-evaluation-harness

33215ac

update

083bd3b

update

a6c753e

KeitaW changed the title ~~[Draft, not ready for review] Add llama-repices llama3 example~~ Add llama-repices llama3 example May 7, 2024

KeitaW marked this pull request as ready for review May 7, 2024 02:27

KeitaW requested review from perifaws and awsankur May 7, 2024 02:27

KeitaW force-pushed the llama-recipes branch from ce62197 to a6c753e Compare May 8, 2024 12:02

KeitaW added 4 commits May 13, 2024 07:16

clean up

544db59

Merge branch 'llama-recipes' of https://github.com/aws-samples/awsome…

a199584

…-distributed-training into llama-recipes

update

f152395

Merge branch 'llama-recipes' of github.com:aws-samples/awsome-distrib…

00e3b91

…uted-training into llama-recipes

pbelevich requested changes May 19, 2024

View reviewed changes

pbelevich reviewed May 19, 2024

View reviewed changes

pbelevich requested changes May 20, 2024

View reviewed changes

pbelevich reviewed May 20, 2024

View reviewed changes

KeitaW and others added 10 commits May 20, 2024 21:19

Update 3.test_cases/19.llama-recipes/README.md

9e12f51

Co-authored-by: Pavel Belevich <belevich@amazon.com>

Update 3.test_cases/19.llama-recipes/README.md

77b78fc

Co-authored-by: Pavel Belevich <belevich@amazon.com>

Update 3.test_cases/19.llama-recipes/README.md

d92f5e7

Co-authored-by: Pavel Belevich <belevich@amazon.com>

Update 3.test_cases/19.llama-recipes/README.md

85d5d59

Co-authored-by: Pavel Belevich <belevich@amazon.com>

Update 3.test_cases/19.llama-recipes/README.md

073296c

Co-authored-by: Pavel Belevich <belevich@amazon.com>

Update 3.test_cases/19.llama-recipes/scripts/generate_env_vars.sh

658f27d

Co-authored-by: Pavel Belevich <belevich@amazon.com>

Update 3.test_cases/19.llama-recipes/scripts/generate_env_vars.sh

1b00fc5

Co-authored-by: Pavel Belevich <belevich@amazon.com>

Update 3.test_cases/19.llama-recipes/scripts/generate_env_vars.sh

422bc14

Co-authored-by: Pavel Belevich <belevich@amazon.com>

Update 3.test_cases/19.llama-recipes/scripts/generate_env_vars.sh

2cdb05e

Co-authored-by: Pavel Belevich <belevich@amazon.com>

Update 3.test_cases/19.llama-recipes/scripts/generate_env_vars.sh

117a443

Co-authored-by: Pavel Belevich <belevich@amazon.com>

KeitaW closed this May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add llama-repices llama3 example #276

Add llama-repices llama3 example #276

KeitaW commented Apr 19, 2024 •

edited

pbelevich left a comment •

edited

pbelevich May 19, 2024 •

edited

pbelevich May 19, 2024

pbelevich May 20, 2024

pbelevich May 20, 2024

pbelevich May 20, 2024

pbelevich May 20, 2024

KeitaW May 20, 2024

KeitaW May 20, 2024

pbelevich May 20, 2024

KeitaW May 20, 2024

pbelevich May 20, 2024

KeitaW May 20, 2024

pbelevich May 20, 2024

KeitaW commented May 20, 2024

Add llama-repices llama3 example #276

Add llama-repices llama3 example #276

Conversation

KeitaW commented Apr 19, 2024 • edited

pbelevich left a comment • edited

Choose a reason for hiding this comment

pbelevich May 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KeitaW commented May 20, 2024

KeitaW commented Apr 19, 2024 •

edited

pbelevich left a comment •

edited

pbelevich May 19, 2024 •

edited