Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DeepSpeed Example with MPI Operator #2091

Open
andreyvelich opened this issue Apr 29, 2024 · 8 comments
Open

Add DeepSpeed Example with MPI Operator #2091

andreyvelich opened this issue Apr 29, 2024 · 8 comments

Comments

@andreyvelich
Copy link
Member

Related: #2040

As we discussed multiple times, Kubeflow community are looking for examples on how to use MPI Operator and DeepSpeed.

We should add some example to the MPI Operator: https://github.com/kubeflow/mpi-operator/tree/master/examples/v2beta1 or Training Operator: https://github.com/kubeflow/training-operator/tree/master/examples.

Some pending PRs can be found here as reference:

/good-first-issue
/help
/area example

/cc @alculquicondor @kubeflow/wg-training-leads @kuizhiqing

Copy link

@andreyvelich:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

Related: #2040

As we discussed multiple times, Kubeflow community are looking for examples on how to use MPI Operator and DeepSpeed.

We should add some example to the MPI Operator: https://github.com/kubeflow/mpi-operator/tree/master/examples/v2beta1 or Training Operator: https://github.com/kubeflow/training-operator/tree/master/examples.

Some pending PRs can be found here as reference:

/good-first-issue
/help
/area example

/cc @alculquicondor @kubeflow/wg-training-leads @kuizhiqing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tenzen-y
Copy link
Member

I believe that both (training-operator and mpi-operator) examples would be worth it. But, I think that we should add each example for PyTorchJob with deepspeed and torchrun, and MPIJob v2 with deepspeed and mpirun.

@andreyvelich
Copy link
Member Author

Sure, that sound great @tenzen-y!
It would be great to see the benchmarks for mpirun and torchrun to run DeepSpeed on Kubernetes.

@tenzen-y
Copy link
Member

Sure, that sound great @tenzen-y! It would be great to see the benchmarks for mpirun and torchrun to run DeepSpeed on Kubernetes.

It sounds great, but I guess that there are no significant performance differences between both approaches since the deepspeed uses the NCCL backend even if we use mpirun.

@vsoch
Copy link

vsoch commented Apr 29, 2024

I'm working on an equivalent example for the Flux Operator - but quick question. Will it work OK to test without GPU? I've been trying to get just 3 nodes, each with one nvidia GPU on Google Cloud, and I never get the allocation.

@vsoch
Copy link

vsoch commented Apr 29, 2024

Ah - this looks more promising. https://github.com/kubeflow/mpi-operator/pull/567/files

@andreyvelich
Copy link
Member Author

@tenzen-y Does DeepSpeed only support nccl backend ? E.g. we can't run it with CPUs ?

@tenzen-y
Copy link
Member

tenzen-y commented May 7, 2024

@tenzen-y Does DeepSpeed only support nccl backend ? E.g. we can't run it with CPUs ?

TBH, I don't have any experience only with CPU. But at the first glance, the deepspeed seems to support PyTorch without GPU: https://github.com/microsoft/DeepSpeed/blob/master/.github/workflows/cpu-torch-latest.yml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants