Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve upon EFA versions script #266

Merged
merged 1 commit into from
May 22, 2024
Merged

Improve upon EFA versions script #266

merged 1 commit into from
May 22, 2024

Conversation

sean-smith
Copy link
Contributor

@sean-smith sean-smith commented Apr 15, 2024

This script adds libfabric, nvidia driver version and cuda version. This covers everything in efa-versions.sh so I removed that script.

$ srun python3 efa-versions.py
+--------------------------+--------------+
|  Package                 |  Version     |
+--------------------------+--------------+
|  EFA installer version:  |  1.26.1      |
+--------------------------+--------------+
|  NCCL Version            |  2.18.5      |
+--------------------------+--------------+
|  Libfabric Version       |  1.18.2      |
+--------------------------+--------------+
|  AWS OFI NCCL version:   |  1.7.3-aws   |
+--------------------------+--------------+
|  Nvidia Driver           |  535.104.12  |
+--------------------------+--------------+
|  CUDA Version:           |  12.1.105    |
+--------------------------+--------------+

And with a container image:

$ srun python3 efa-versions.py --container-image megatron-training
+--------------------------+--------------+--------------+
|  Package                 |  Local       |  Container   |
+--------------------------+--------------+--------------+
|  EFA installer version:  |  1.26.1      |  1.30.0      |
+--------------------------+--------------+--------------+
|  NCCL Version            |  2.18.5      |  None        |
+--------------------------+--------------+--------------+
|  Libfabric Version       |  1.18.2      |  1.19.0      |
+--------------------------+--------------+--------------+
|  AWS OFI NCCL version:   |  1.7.3-aws   |  None        |
+--------------------------+--------------+--------------+
|  Nvidia Driver           |  535.104.12  |  535.104.12  |
+--------------------------+--------------+--------------+
|  CUDA Version:           |  12.1.105    |  12.2.128    |
+--------------------------+--------------+--------------+

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Signed-off-by: Sean Smith <seaam@amazon.com>
@sean-smith
Copy link
Contributor Author

Not ready to merge.

@perifaws
Copy link
Contributor

@sean-smith ready now?

@nghtm
Copy link
Collaborator

nghtm commented May 1, 2024

@sean-smith when I try to run, i get

ubuntu@ip-10-1-22-213:~$ python3 check-efa.py
Traceback (most recent call last):
  File "check-efa.py", line 8, in <module>
    from prettytable import PrettyTable
ModuleNotFoundError: No module named 'prettytable'

@nghtm
Copy link
Collaborator

nghtm commented May 1, 2024

Can we run without pretty-table

@sean-smith
Copy link
Contributor Author

Can we run without pretty-table

No, just need to

sudo apt install python3.8-venv
python3 -m venv venv && source venv/bin/activate
pip install prettytable
python3 efa-versions.py

@nghtm
Copy link
Collaborator

nghtm commented May 2, 2024

If customer wants to run this on compute node (which they likely will), this requires the packages to be installed on compute node, which is sub optimal. Aleternatives to pretty table we can use without needing to install the package?

@sean-smith
Copy link
Contributor Author

If customer wants to run this on compute node (which they likely will), this requires the packages to be installed on compute node, which is sub optimal. Aleternatives to pretty table we can use without needing to install the package?

It's a little bit more nuanced than that - the customer will setup their virtualenv on the headnode in the FSx Lustre filesystem and then use that virtualenv from the compute nodes.

sudo apt install python3.8-venv #installs on headnode
python3 -m venv venv && source venv/bin/activate #installs on headnode
pip install prettytable # installs on headnode fsx
srun python3 efa-versions.py  # runs on compute

@sean-smith sean-smith merged commit 21648f6 into main May 22, 2024
@sean-smith sean-smith deleted the efa-versions branch May 22, 2024 20:49
KeitaW pushed a commit that referenced this pull request Jun 3, 2024
Signed-off-by: Sean Smith <seaam@amazon.com>
KeitaW pushed a commit that referenced this pull request Jun 4, 2024
Signed-off-by: Sean Smith <seaam@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants