Add llama3 and distributed checkpoint support in NeVA #9101

yaoyu-33 · 2024-05-02T22:58:01Z

What does this PR do ?

Add llama3 and distributed checkpoint support in NeVA

Collection: [multimodal]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

# Conflicts: # examples/multimodal/vision_language_foundation/clip/megatron_clip_pretrain.py

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

# Conflicts: # nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

for more information, see https://pre-commit.ci

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

…neva_pp_support

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

for more information, see https://pre-commit.ci

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

# Conflicts: # nemo/collections/multimodal/data/neva/neva_dataset.py # nemo/collections/nlp/parts/nlp_overrides.py

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com>

…_llama3

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

cuichenx · 2024-05-13T19:20:02Z

examples/multimodal/multimodal_llm/neva/eval/vqa_science.py

@@ -169,7 +169,7 @@ def eval_model(args):
    parser.add_argument("--image-folder", type=str, default="")
    parser.add_argument("--question-file", type=str, default="tables/question.json")
    parser.add_argument("--answers-file", type=str, default="answer.jsonl")
-    parser.add_argument("--conv-mode", type=str, default="llava_v0")
+    parser.add_argument("--conv-mode", type=str, default="llava_v0")  # this flag has no use!


then should we get rid of it..?

cuichenx · 2024-05-13T19:29:00Z

nemo/collections/nlp/modules/common/text_generation_strategy.py

@@ -487,6 +544,7 @@ def __init__(self, model):
            is_multimodal=self.data_cfg.is_multimodal,
            sep_image_conv_front=self.data_cfg.sep_image_conv_front,
            conv_template=self.data_cfg.get("conv_template", "nvgpt"),
+            model_type=self.cfg.mm_cfg.llm.get("model_type", "nvgpt"),


should we rename this to nemotron?

that will break many old checkpoint, we might just keep this way for a while...

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com>

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

mikolajblaz · 2024-05-15T16:35:06Z

nemo/utils/callbacks/dist_ckpt_io.py

@@ -75,19 +80,24 @@ def load_checkpoint(
        else:
            sharded_strategy = None

+        if not strict:
+            for key in list(sharded_state_dict['state_dict'].keys()):


I assume this is a temporary implementation for a strict flag because we also need to notify the user which keys are skipped.

Also, this will work only with Zarr ckpt format, for PyT Distributed it will be different

Also, I don't think it's correct, because we should use the sharded key, not the state dict key (also doesn't account for nested dicts).
Can you try something like this (I didn't run this code so there might be errors)?
This should work for all backends, nested dicts and use correct keys.

from megatron.core.dist_checkpointing.dict_utils import extract_matching_values from megatron.core.dist_checkpointing.mapping import ShardedBase if not strict: sharded_state_dict = self.adjust_non_strict_load(path, sharded_state_dict) ... def adjust_non_strict_load(self, path: _PATH, sharded_state_dict: Dict[str, Any]): ckpt_sharded_metadata = dist_checkpointing.load_tensors_metadata(path) loaded_keys = [] missing_keys = [] unexpected_keys = [] def should_remove_missing_sharded_base(x: Any): if isinstance(x, ShardedBase): if x.key in ckpt_sharded_metadata: loaded_keys.append(x.key) return False else: unexpected_keys.append(x.key) return True return False _, sharded_state_dict = extract_matching_values(sharded_state_dict, should_remove_missing_sharded_base) logging.info(f'The following keys are not in the checkpoint and will not be loaded: {unexpected_keys}') # TODO: compute missing_keys by: # 1. all_gather_object of loaded_keys # 2. missing_keys = ckpt_sharded_metadata.keys() - loaded_keys return sharded_state_dict

mikolajblaz · 2024-05-15T16:37:01Z

nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py

+        sharded_state_dict = super().sharded_state_dict(prefix=prefix, sharded_offsets=sharded_offsets, **kwargs)
+
+        state_dict = self.state_dict(prefix='', keep_vars=True)
+        state_dict.pop('weight')


Is weight not needed at all?

weight already be take care of in super

mikolajblaz · 2024-05-15T16:39:06Z

nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py

+        for layer_name in state_dict.keys():
+            tensor = state_dict[layer_name]
+            layer_key = f'{prefix}{layer_name}'
+            sharded_state_dict[layer_key] = make_sharded_tensor_for_checkpoint(
+                tensor,
+                layer_key,
+                prepend_offsets=sharded_offsets,
+            )


I think this can be replaced with

from megatron.core.transformer.utils import make_sharded_tensors_for_checkpoint ... sharded_state_dict.update(make_sharded_tensors_for_checkpoint(state_dict))

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com>

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

# Conflicts: # nemo/utils/callbacks/dist_ckpt_io.py

nemo/utils/callbacks/dist_ckpt_io.py

+    def adjust_non_strict_load(self, path: _PATH, sharded_state_dict: Dict[str, Any]):
+        ckpt_sharded_metadata = dist_checkpointing.load_tensors_metadata(path)
+        loaded_keys = []
+        missing_keys = []


nemo/utils/callbacks/dist_ckpt_io.py

@@ -12,6 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+import os


# Conflicts: # nemo/collections/multimodal/parts/utils.py

yaoyu-33 and others added 30 commits March 15, 2024 18:27

temp save

f2f267d

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Merge branch 'main' into yuya/neva_seq_pack

b532d1b

temp save 2

b0ace6b

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

update code

0020fe3

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Merge branch 'main' into yuya/neva_seq_pack

76fb748

enable seq packing

a8f2248

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

fix neva and clip

9fab5a5

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Enable parallel seq packing algo and few other fixes

d8474fb

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Merge branch 'main' into yuya/neva_seq_pack

c56ec9b

# Conflicts: # examples/multimodal/vision_language_foundation/clip/megatron_clip_pretrain.py

Pipeline parallel support

e8a9a6d

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Update data preprocess

c5ffa83

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Merge branch 'main' into yuya/neva_pp_support

e11e260

fix few pp issues

2bc5d66

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Merge branch 'yuya/neva_seq_pack' into yuya/neva_pp_support

4843e54

enable sequence packing w/ PP

78034ce

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Fix cu_seqlens in inputs

8561e60

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

add assert

2ac6b27

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Merge branch 'main' into yuya/neva_pp_support

d138b1e

# Conflicts: # nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py

Depend on PP to decide whether do padding

5e6994d

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

0544758

for more information, see https://pre-commit.ci

Add docstring

6af32af

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Merge remote-tracking branch 'origin/yuya/neva_pp_support' into yuya/…

cd513b3

…neva_pp_support

Fix few evaluation issues

3655f7d

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Merge branch 'main' into yuya/neva_pp_support

f54e565

Fix few PP evaluation issues

4bb0313

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Address comments

6efa4fa

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

9c44e30

for more information, see https://pre-commit.ci

add llama3 template

37953d4

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

address comments

56700e7

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Fix license

b5a4c27

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 and others added 8 commits May 9, 2024 19:52

fix peft

bbf0832

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

few fixes for lora

9f5d615

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Merge branch 'main' into yuya/neva_llama3

cb3ae62

# Conflicts: # nemo/collections/multimodal/data/neva/neva_dataset.py # nemo/collections/nlp/parts/nlp_overrides.py

checkpoint updates

4588b6a

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Apply isort and black reformatting

725e353

Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com>

Merge branch 'main' into yuya/neva_llama3

7c38044

Merge remote-tracking branch 'origin/yuya/neva_llama3' into yuya/neva…

9bd4929

…_llama3

bug fix

34bff23

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 changed the title ~~Add llama3 support in NeVA~~ Add llama3 and distributed checkpoint support in NeVA May 13, 2024

yaoyu-33 requested a review from mikolajblaz May 13, 2024 17:59

cuichenx reviewed May 13, 2024

View reviewed changes

yaoyu-33 and others added 3 commits May 13, 2024 16:24

Add neva dist checkpoint converter

bcf6385

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Apply isort and black reformatting

24785dd

Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com>

resolve comments

08dbfa5

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 added the Run CICD label May 14, 2024

mikolajblaz reviewed May 15, 2024

View reviewed changes

yaoyu-33 removed the Run CICD label May 16, 2024

yaoyu-33 and others added 4 commits May 16, 2024 10:49

update neva dist ckpt apis

968a75c

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Apply isort and black reformatting

0b93023

Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com>

fix return

5e5ad9e

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Merge branch 'main' into yuya/neva_llama3

8dc1142

# Conflicts: # nemo/utils/callbacks/dist_ckpt_io.py

github-advanced-security bot found potential problems May 17, 2024

View reviewed changes

yaoyu-33 added the Run CICD label May 17, 2024

Merge branch 'main' into yuya/neva_llama3

5450967

# Conflicts: # nemo/collections/multimodal/parts/utils.py

yaoyu-33 added Run CICD and removed Run CICD labels May 21, 2024

cuichenx approved these changes May 21, 2024

View reviewed changes

yaoyu-33 merged commit d7bb403 into main May 22, 2024
133 checks passed

yaoyu-33 deleted the yuya/neva_llama3 branch May 22, 2024 03:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add llama3 and distributed checkpoint support in NeVA #9101

Add llama3 and distributed checkpoint support in NeVA #9101

yaoyu-33 commented May 2, 2024 •

edited

cuichenx May 13, 2024

cuichenx May 13, 2024

yaoyu-33 May 14, 2024

mikolajblaz May 15, 2024

mikolajblaz May 16, 2024

mikolajblaz May 15, 2024

yaoyu-33 May 16, 2024

mikolajblaz May 15, 2024 •

edited

Add llama3 and distributed checkpoint support in NeVA #9101

Add llama3 and distributed checkpoint support in NeVA #9101

Conversation

yaoyu-33 commented May 2, 2024 • edited

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

cuichenx May 13, 2024

Choose a reason for hiding this comment

cuichenx May 13, 2024

Choose a reason for hiding this comment

yaoyu-33 May 14, 2024

Choose a reason for hiding this comment

mikolajblaz May 15, 2024

Choose a reason for hiding this comment

mikolajblaz May 16, 2024

Choose a reason for hiding this comment

mikolajblaz May 15, 2024

Choose a reason for hiding this comment

yaoyu-33 May 16, 2024

Choose a reason for hiding this comment

mikolajblaz May 15, 2024 • edited

Choose a reason for hiding this comment

yaoyu-33 commented May 2, 2024 •

edited

mikolajblaz May 15, 2024 •

edited