-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add llama-repices llama3 example #276
Conversation
…uted-training into llama-recipes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to escape $ with \$ when you do cat > .env ...
otherwise $XYZ values will be evaluated when you run cat and will be empty in the result .env file
On the head/login node of the cluster, clone the repository, move to the test case directory. | ||
|
||
```bash | ||
git clone https://github.com/aws-samples/awsome-distributed-training ${FSX_PATH} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that you meant:
mkdir -p ${FSX_PATH}
cd ${FSX_PATH}
git clone https://github.com/aws-samples/awsome-distributed-training
Or you need to change TEST_CASE_PATH
to
export TEST_CASE_PATH=\${FSX_PATH}/3.test_cases/19.llama-recipes
because if you git clone
to ${FSX_PATH} there it would be ${FSX_PATH}/3.test_cases
instead of ${FSX_PATH}/awsome-distributed-training/3.test_cases
## 2. Build the container | ||
|
||
Before running training jobs, you need to use a build docker container image. [Enroot](https://github.com/NVIDIA/enroot) will be used to turn the image into unprivileged sandbox for Slurm. | ||
You can build the image on your login node using the option 1 below, but build step may exceed the storage available on the head node so we reccomend building it on a compute node following instructions in option2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is option 2?
In the login node, launch Python process on the head node, run the following: | ||
|
||
```bash | ||
enroot start --env NVIDIA_VISIBLE_DEVICES=void \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably you missed:
mkdir -p ${APPS_PATH}
enroot import -o ${ENROOT_IMAGE} dockerd://llama3:latest
In the login node, launch Python process on the head node, run the following: | ||
|
||
```bash | ||
enroot start --env NVIDIA_VISIBLE_DEVICES=void \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, enroot start
failed for me with [ERROR] Command not found: squashfuse
on multiple clusters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And [ERROR] Command not found: fuse-overlayfs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed after sudo apt-get install squashfuse fuse-overlayfs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you set up a cluster with
- https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/docker/postinstall.sh
- https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/pyxis/postinstall.sh
those should have been installed by default. If you have issue nonetheless, let's discuss offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the login node, launch Python process on the head node, run the following: | ||
|
||
```bash | ||
enroot start --env NVIDIA_VISIBLE_DEVICES=void \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe HF login can be done without enroot start
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can but that requires you to setup Python env with HF locally. Since we have everything in container, I prefer this way. But I agree it's not easy to follow, so packaged it as script in here .
In this step, you will fine tune llama model, using Alpaca dataset. Use curl command to download the dataset: | ||
|
||
```bash | ||
curl https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curl without args prints the content to the console, probably we need to store it somewhere instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I was missing -O alpaca_data.json
The training process will create the following FSDP checkponits. | ||
|
||
```bash | ||
$ ls /fsx/models/meta-llama/Meta-Llama-3-8B-tuned/fine-tuned-meta-llama/Meta-Llama-3-8B/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed the point how and when was it created, I mean /fsx/models...
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Thanks @pbelevich for your thoughtful reviews! As discussed offline, let's migrate the contents to |
Issue #, if available:
Description of changes:
This test case illustrates different LLM model development steps, starting from finetuning, evaluation, and deployment using Meta's llama-recipe.