Skip to content

Latest commit

 

History

History
37 lines (27 loc) · 2.47 KB

PREFLIGHT.md

File metadata and controls

37 lines (27 loc) · 2.47 KB

Optimization 1: Multihost recommended network settings

We included all the recommended network settings in rto_setup.sh.

preflight.sh will help you apply them based on GCE or GKE platform.

Before you run ML workload on Multihost with GCE or GKE, simply apply bash preflight.sh PLATFORM=[GCE or GKE] to leverage the best DCN network performance.

Here is an example for GCE:

bash preflight.sh PLATFORM=GCE && python3 MaxText/train.py MaxText/configs/base.yml run_name=$YOUR_JOB_NAME

Here is an example for GKE:

bash preflight.sh PLATFORM=GKE && python3 MaxText/train.py MaxText/configs/base.yml run_name=$YOUR_JOB_NAME

Optimization 2: Numa binding (You can only apply this to v4 and v5p)

NUMA binding is recommended for enhanced performance, as it reduces memory latency and maximizes data throughput, ensuring that your high-performance applications operate more efficiently and effectively.

For GCE, preflight.sh will help you install numactl dependency, so you can use it directly, here is an example:

bash preflight.sh PLATFORM=GCE && numactl --membind 0 --cpunodebind=0 python3 MaxText/train.py MaxText/configs/base.yml run_name=$YOUR_JOB_NAME

For GKE, numactl should be built into your docker image from maxtext_dependencies.Dockerfile, so you can use it directly if you built the maxtext docker image. Here is an example

bash preflight.sh PLATFORM=GKE && numactl --membind 0 --cpunodebind=0 python3 MaxText/train.py MaxText/configs/base.yml run_name=$YOUR_JOB_NAME
  1. numactl: This is the command-line tool used for controlling NUMA policy for processes or shared memory. It's particularly useful on multi-socket systems where memory locality can impact performance.
  2. --membind 0: This option binds the memory allocation of the process to node 0. It means the process will allocate memory only from the memory of node 0. If node 0's memory is exhausted and more is required, the process will fail rather than using memory from other nodes.
  3. --cpunodebind=0: This option binds the process to the CPUs of node 0. The process will only run on the CPUs of node 0, not on CPUs of other nodes. This can improve performance by ensuring that the process runs on CPUs that are close to the memory it is using.