Adding AxoNN's 3D tensor parallelism [WIP] #1086

siddharth9820 · 2023-11-28T05:19:07Z

Steps to run -

Install AxoNN (dependencies - Pytorch and mpi4py) -

git clone git@github.com:axonn-ai/axonn.git
cd axonn
git checkout 45647ea
pip install -e .

Preparing a config file to use AxoNN -

First, set "use_axonn_model_parallelism": true
Then set "depth_model_parallel_size", "row_model_parallel_size", "column_model_parallel_size" as per the requirements of your model. The product of these should equal "model_parallel_size".

you can also set "optimize_axonn_communication: true" to enable communication optimizations. These also require you to set the environment variable - export CUDA_DEVICE_MAX_CONNECTIONS=1

At a high level, the matrix multiplications in your model will be sharded over "model_parallel_size" GPUs.

ToDos -

Implement axonn support for parallel_output=True (gpt j residual)
Implement axonn support for LLaMAMLP
Integrate communication optimizations

Quentin-Anthony · 2023-11-28T22:52:20Z

Under testing

siddharth9820 · 2023-11-29T15:52:47Z

Correctness check on 125M.yml with
use_axonn_model_parallelism:true, column_model_parallel_size=1, row_model_parallel_size=1, depth_model_parallel_size=2, model_parallel_size=2 on 2 GPUs.
Dataset - enwiki8

(the loss curve is smoothed over 100 iterations)

siddharth9820 · 2023-11-29T19:55:00Z

@Quentin-Anthony I have updated the install instructions to install axonn from a fixed commit - 3ebc34c

Quentin-Anthony · 2023-11-29T20:50:06Z

@Quentin-Anthony I have updated the install instructions to install axonn from a fixed commit - 3ebc34c

Thanks!

siddharth9820 · 2024-01-18T11:17:27Z

@Quentin-Anthony Pushed some communication optimizations and also updated the instructions to install axonn from a newer commit - 45647ea.

To enable these optimizations, you just need to set "optimize_axonn_communication: true" in your neo-x config files. These also require you to set the environment variable - export CUDA_DEVICE_MAX_CONNECTIONS=1

siddharth9820 added 2 commits November 27, 2023 23:05

modify configs/125M.yml to run without axonn

8b1a9b2

test each tp dim individually set to 2

20d4228

siddharth9820 added the feature request New feature or request label Nov 28, 2023

siddharth9820 assigned Quentin-Anthony Nov 28, 2023

siddharth9820 requested a review from a team as a code owner November 28, 2023 05:19

siddharth9820 requested review from Quentin-Anthony and StellaAthena November 28, 2023 05:19

Update NeoXArgs docs automatically

f1c40e2

siddharth9820 assigned StellaAthena and ShivanshuPurohit and unassigned Quentin-Anthony, StellaAthena and ShivanshuPurohit Nov 28, 2023

Quentin-Anthony marked this pull request as draft November 28, 2023 22:52

Quentin-Anthony self-assigned this Nov 29, 2023

Quentin-Anthony and others added 3 commits December 19, 2023 11:02

Merge branch 'main' into add-axonn-3d-TP

7438b33

Update NeoXArgs docs automatically

23709b6

add communication optimizations part 1

28443c6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding AxoNN's 3D tensor parallelism [WIP] #1086

Adding AxoNN's 3D tensor parallelism [WIP] #1086

siddharth9820 commented Nov 28, 2023 •

edited

Quentin-Anthony commented Nov 28, 2023

siddharth9820 commented Nov 29, 2023

siddharth9820 commented Nov 29, 2023

Quentin-Anthony commented Nov 29, 2023

siddharth9820 commented Jan 18, 2024 •

edited

Adding AxoNN's 3D tensor parallelism [WIP] #1086

Are you sure you want to change the base?

Adding AxoNN's 3D tensor parallelism [WIP] #1086

Conversation

siddharth9820 commented Nov 28, 2023 • edited

Quentin-Anthony commented Nov 28, 2023

siddharth9820 commented Nov 29, 2023

siddharth9820 commented Nov 29, 2023

Quentin-Anthony commented Nov 29, 2023

siddharth9820 commented Jan 18, 2024 • edited

siddharth9820 commented Nov 28, 2023 •

edited

siddharth9820 commented Jan 18, 2024 •

edited