Skip to content

Latest commit

 

History

History
50 lines (33 loc) · 1.93 KB

parallel_overview.md

File metadata and controls

50 lines (33 loc) · 1.93 KB

veScale Parallel Overview

The overview of veScale n-D parallelism is as follows:

5D

(* is under development)

The Auto-Parallelize block takes the untouched Model from the user and Parallel Plan (given by manual effort, prefined for each model type, or automatically generated from Auto-Plan*) and then parallelizes the single-device model into nD Parallelism across a mesh of devices.

veScale's nD Parallelism follows a decoupled design where each D of parallelism is handled by an independent sub-block (e.g., DModule only handles Tensor & Sequence Parallel, without coupling with other Parallel). In contrast to the conventional coupled design that intertwines all parallelism together, such a decoupled nD Parallelism enjoys composability, debuggability, explainability, and extensibility, all of which are of great value for hyper-scale training in production.

4D Parallelisim API

Our 4D parallelism (Tensor, Sequence, Data, and ZeRO2) is as follows:

# zero model code change
from <HuggingFace> import <ModelCls>, <ModelArgs>

# create fake model without actual memory usage (optional)
fake_model = deferred_init(<ModelCls>, <ModelArgs>)

# initialize 4D device mesh
mesh = init_device_mesh("cuda", (dp_zero_size, tp_sp_size), mesh_dim_names=["DP_ZERO", "TP_SP"])

# parallelize model in tp & sp
from <PredefinedPlan> import sharding_plan
real_tp_sp_model = parallelize_module(fake_model, mesh["TP_SP"], sharding_plan)

# parallelize model in dp
ddp_model = DDP(real_tp_sp_model, mesh["DP_ZERO"])

# parallelize model with zero2
doptimizer = DistributedOptimizer(torch.optim.AdamW, models=[ddp_model])

# train model as if on a single device
for x in range(dataset):
    loss = ddp_model(x)
    loss.backward()
    doptimizer.step()
    doptimizer.zero_grad()

More examples can be found in: <repo>/examples/.

5D Parallelisim API

Coming Soon