Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core feature] Refactor distributed job using common ReplicaSpec #4408

Open
2 tasks done
troychiu opened this issue Nov 12, 2023 · 2 comments · May be fixed by #5355 or flyteorg/flytekit#2424
Open
2 tasks done

[Core feature] Refactor distributed job using common ReplicaSpec #4408

troychiu opened this issue Nov 12, 2023 · 2 comments · May be fixed by #5355 or flyteorg/flytekit#2424
Assignees
Labels
enhancement New feature or request waiting for reporter Used for when we need input from the bug reporter

Comments

@troychiu
Copy link
Member

Motivation: Why do you think this is important?

Right now, different types of distributed jobs such as tensorflow, PyTorch, ... all have their own ReplicaSpec. Based on this discussion thread #4179 (review), we can have a shared ReplicaSpec in common.proto so that all types of distributed jobs can leverage it.

Goal: What should the final outcome look like, ideally?

In common.proto, we have a ReplicaSpec like

# common.proto
message ReplicaSpec {
    // Number of replicas
    int32 replicas = 1;
    
    // Image used for the replica group
    string image = 2;
    
    // Resources required for the replica group
    core.Resources resources = 3;
    
    // Restart policy determines whether pods will be restarted when they exit
    RestartPolicy restart_policy = 4;
    ...
}

and all types of distributed jobs (tensorflow, PyTorch, ray, ...) share it.

Describe alternatives you've considered

Stay what we have now. That is, all types of distributed job have their own ReplicaSpec.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@troychiu troychiu added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Nov 12, 2023
@troychiu troychiu self-assigned this Nov 12, 2023
@hamersaw
Copy link
Contributor

IMO there needs to be more discussion here. There are certainly many configuration parameters that will differ between ReplicaSpec for each of the distributed environments, and even within each environment. Is the RestartPolicy you have mocked applied to every plugin? If not, this is another parameter that may result in unintended behavior. Is the proposal to use this common.pb message as a field of each replica spec? For example:

// spark.pb
message SparkWorkerReplicaSpec {
    ReplicaSpec replica_spec = 1;
    
    ... spark sepcific configuration
}

// dask.pb
message DaskScheduler {
   RepliaSpec replica_spec = 1;
   
   ...
}

message DaskWorker {
    ReplicaSpec replica_spec = 1;
    
    ....
}

@eapolinario eapolinario added waiting for reporter Used for when we need input from the bug reporter and removed untriaged This issues has not yet been looked at by the Maintainers labels Dec 8, 2023
@MortalHappiness
Copy link
Contributor

#take

MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 11, 2024
Resolves: flyteorg#4408
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 11, 2024
Resolves: flyteorg#4408
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 11, 2024
…uf changes

Resolves: flyteorg#4408
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue May 16, 2024
Resolves: flyteorg/flyte#4408
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue May 17, 2024
Resolves: flyteorg/flyte#4408
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue May 17, 2024
Resolves: flyteorg/flyte#4408
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 18, 2024
Resolves: flyteorg#4408
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 18, 2024
Resolves: flyteorg#4408
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 18, 2024
…uf changes

Resolves: flyteorg#4408
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 18, 2024
Resolves: flyteorg#4408
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 18, 2024
Resolves: flyteorg#4408
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 18, 2024
…uf changes

Resolves: flyteorg#4408
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 21, 2024
Resolves: flyteorg#4408
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 21, 2024
Resolves: flyteorg#4408
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 21, 2024
…uf changes

Resolves: flyteorg#4408
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request waiting for reporter Used for when we need input from the bug reporter
Projects
None yet
4 participants