Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add llama-repices llama3 example #276

Closed
wants to merge 28 commits into from
Closed

Add llama-repices llama3 example #276

wants to merge 28 commits into from

Conversation

KeitaW
Copy link
Contributor

@KeitaW KeitaW commented Apr 19, 2024

Issue #, if available:

Description of changes:

This test case illustrates different LLM model development steps, starting from finetuning, evaluation, and deployment using Meta's llama-recipe.

@KeitaW KeitaW self-assigned this Apr 19, 2024
@KeitaW KeitaW added the enhancement New feature or request label Apr 19, 2024
@KeitaW KeitaW changed the title [Draft, not ready for review] Add llama3 example [Draft, not ready for review] Add llama-repices llama3 example Apr 25, 2024
@KeitaW KeitaW changed the title [Draft, not ready for review] Add llama-repices llama3 example Add llama-repices llama3 example May 7, 2024
@KeitaW KeitaW marked this pull request as ready for review May 7, 2024 02:27
@KeitaW KeitaW requested review from perifaws and awsankur May 7, 2024 02:27
Copy link
Collaborator

@pbelevich pbelevich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to escape $ with \$ when you do cat > .env ... otherwise $XYZ values will be evaluated when you run cat and will be empty in the result .env file

3.test_cases/19.llama-recipes/README.md Outdated Show resolved Hide resolved
3.test_cases/19.llama-recipes/README.md Outdated Show resolved Hide resolved
3.test_cases/19.llama-recipes/README.md Outdated Show resolved Hide resolved
3.test_cases/19.llama-recipes/README.md Outdated Show resolved Hide resolved
3.test_cases/19.llama-recipes/README.md Outdated Show resolved Hide resolved
3.test_cases/19.llama-recipes/scripts/generate_env_vars.sh Outdated Show resolved Hide resolved
3.test_cases/19.llama-recipes/scripts/generate_env_vars.sh Outdated Show resolved Hide resolved
3.test_cases/19.llama-recipes/scripts/generate_env_vars.sh Outdated Show resolved Hide resolved
3.test_cases/19.llama-recipes/scripts/generate_env_vars.sh Outdated Show resolved Hide resolved
3.test_cases/19.llama-recipes/scripts/generate_env_vars.sh Outdated Show resolved Hide resolved
On the head/login node of the cluster, clone the repository, move to the test case directory.

```bash
git clone https://github.com/aws-samples/awsome-distributed-training ${FSX_PATH}
Copy link
Collaborator

@pbelevich pbelevich May 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that you meant:

  1. mkdir -p ${FSX_PATH}
  2. cd ${FSX_PATH}
  3. git clone https://github.com/aws-samples/awsome-distributed-training

Or you need to change TEST_CASE_PATH to

export TEST_CASE_PATH=\${FSX_PATH}/3.test_cases/19.llama-recipes

because if you git clone to ${FSX_PATH} there it would be ${FSX_PATH}/3.test_cases instead of ${FSX_PATH}/awsome-distributed-training/3.test_cases

## 2. Build the container

Before running training jobs, you need to use a build docker container image. [Enroot](https://github.com/NVIDIA/enroot) will be used to turn the image into unprivileged sandbox for Slurm.
You can build the image on your login node using the option 1 below, but build step may exceed the storage available on the head node so we reccomend building it on a compute node following instructions in option2.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is option 2?

In the login node, launch Python process on the head node, run the following:

```bash
enroot start --env NVIDIA_VISIBLE_DEVICES=void \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably you missed:

mkdir -p ${APPS_PATH}
enroot import -o ${ENROOT_IMAGE}  dockerd://llama3:latest

In the login node, launch Python process on the head node, run the following:

```bash
enroot start --env NVIDIA_VISIBLE_DEVICES=void \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, enroot start failed for me with [ERROR] Command not found: squashfuse on multiple clusters

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And [ERROR] Command not found: fuse-overlayfs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed after sudo apt-get install squashfuse fuse-overlayfs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the login node, launch Python process on the head node, run the following:

```bash
enroot start --env NVIDIA_VISIBLE_DEVICES=void \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe HF login can be done without enroot start?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can but that requires you to setup Python env with HF locally. Since we have everything in container, I prefer this way. But I agree it's not easy to follow, so packaged it as script in here .

In this step, you will fine tune llama model, using Alpaca dataset. Use curl command to download the dataset:

```bash
curl https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curl without args prints the content to the console, probably we need to store it somewhere instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I was missing -O alpaca_data.json

The training process will create the following FSDP checkponits.

```bash
$ ls /fsx/models/meta-llama/Meta-Llama-3-8B-tuned/fine-tuned-meta-llama/Meta-Llama-3-8B/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed the point how and when was it created, I mean /fsx/models...

KeitaW and others added 10 commits May 20, 2024 21:19
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
@KeitaW
Copy link
Contributor Author

KeitaW commented May 20, 2024

Thanks @pbelevich for your thoughtful reviews! As discussed offline, let's migrate the contents to torchtitan torchtune stack. I have created draft PR here

@KeitaW KeitaW closed this May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants