Extend multimodal/speech_llm with lhotse, t5 and bestow supports #9169

zhehuaichen · 2024-05-11T02:46:50Z

What does this PR do ?

In multimodal/speech_llm, add lhotse dataloader support and two models, SALM-T5 and Bestow-GPT. Include example configs.

Main features under speech_llm

Lhotse dataloader support for speech SFT in speech_llm
SALM-style architecture with T5 LLM backbone
Bestow-style architecture (cross-attention based) with GPT LLM backbone

Minor edit in nlp collection:

megatron_base_model.py: handle the case tokenizer.type is not set
megatron_lm_encoder_decoder_model.py: hanlde the case encoder_input is used
megatron_base_prompt_learning_model.py: group the llm init code under init_model function (follow the pattern from megatron_gpt_prompt_learning_model.py) so that it can be overwritten by subclass when needed
megatron/utils.py: in gradient accumulation, handle the case where the batch size from dynamic bucketing is not divisible. This happens when using lhotse dataloader with batch_duration

Collection: [common,nlp,multimodal]

PR Type:

New Feature
Bugfix
Documentation

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

Signed-off-by: stevehuang52 <heh@nvidia.com>

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

…utterance IDs

for more information, see https://pre-commit.ci

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

for more information, see https://pre-commit.ci

… feature/lhotse-integration

… cut.custom)

for more information, see https://pre-commit.ci

…sko/nemo into feature/lhotse-integration

for more information, see https://pre-commit.ci

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

for more information, see https://pre-commit.ci

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

… canary_speechllm1_cross_t5_pr3 Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

…ross_t5_pr3 Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

Signed-off-by: zhehuaichen <zhehuaichen@users.noreply.github.com>

github-advanced-security

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

stevehuang52 · 2024-05-13T03:00:17Z

examples/multimodal/speech_llm/conf/modular_audio_gpt_config_cross_llama_lhotse.yaml

@@ -0,0 +1,361 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.


could you put this file under conf/bestow/* ?

stevehuang52 · 2024-05-13T03:00:55Z

examples/multimodal/speech_llm/conf/modular_audio_gpt_config_llama_lhotse.yaml

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: megatron_audio_gpt_salm_lhotse


could you put this file under conf/salm/* ?

stevehuang52 · 2024-05-13T03:01:16Z

examples/multimodal/speech_llm/conf/modular_audio_t5_config.yaml

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: megatron_audio_t5_salm_lhotse


could you put this file under conf/salm/* ?

stevehuang52 · 2024-05-13T12:45:37Z

nemo/collections/multimodal/speech_llm/models/modular_models_t5.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy


maybe rename file to modular_t5_models.py and the previous one as modular_gpt_models.py, or maybe even change the modular prefix to something else

pzelasko · 2024-05-15T16:10:08Z

nemo/collections/multimodal/speech_llm/data/lhotse_dataset.py

+    vectors = collate_vectors_lhotse(items, padding_value=padding_value)
+    if max_length > vectors.size(1):
+        vectors = torch.cat(
+            [vectors, padding_value * torch.ones(vectors.size(0), max_length - vectors.size(1), dtype=vectors.dtype)],


why do we need to enforce a static shape with padding for every example here?

pzelasko · 2024-05-15T16:13:14Z

nemo/collections/multimodal/speech_llm/data/lhotse_dataset.py

+    return (n + m - 1) // m * m
+
+
+class TextProcessing:


This class needs more documentation on what is it doing, how to use its API, and what are the expected input and output formats. Also, it only has private methods right now, the main API method should be public (no underscore at the beginning).

I'd expect a docstring of kind: this class is used to convert X to Y. in order to do so, it performs A, B, C, and D. the expect format of X is .... the expected format of Y is ...

since it's used to convert text to prompts to token ids, I'd like to see full documentation of the prompt template/schema

the options to init also need documentation, if some are unused/unnecessary they may be removed

pzelasko · 2024-05-15T16:16:45Z

nemo/collections/multimodal/speech_llm/data/lhotse_dataset.py

+        return processed_example
+
+
+def convert_canary_prompt_to_text(prompt, is_canary_tokens_augment):


I understand why this function is built the way it is but for future experiments let's try to move away from canary special token conversion and design a configurable prompting setup instead.

pzelasko · 2024-05-15T16:17:11Z

nemo/collections/multimodal/speech_llm/data/lhotse_dataset.py

+    by Lhotse samplers instead.
+    """
+
+    def __init__(


init args are not documented

pzelasko · 2024-05-15T16:17:37Z

nemo/collections/multimodal/speech_llm/data/lhotse_dataset.py

+        tokens_to_generate: int,
+        pad_to_max_length: bool,
+        max_seq_length: int,
+        noise_cuts: Optional = None,


this is a left-over from before lhotse nemo dataset was refactored; remove noise_cuts in this class.

pzelasko · 2024-05-15T16:19:14Z

nemo/collections/multimodal/speech_llm/data/build_dataset.py

+                conf['manifest_filepath'] = cur_manifest_filepath
+                question_file_set = data_cfg.get('question_file_set', None)
+                if question_file_set is not None:
+                    conf['question_file_set'] = [question_file_set[dataset_idx]]


unless you add question_file_set to LhotseDataLoadingConfig it will be discarded at the beginning of get_lhotse_dataloader_from_config

pzelasko · 2024-05-15T16:20:59Z

nemo/collections/multimodal/speech_llm/data/build_dataset.py

+        from nemo.collections.common.data.lhotse import get_lhotse_dataloader_from_config
+
+        # for eval, we need to create separate dataset so as to report splitted numbers
+        if data_cfg.get('is_tarred', False) or (is_eval == False and is_predict == False):


you shouldn't rely on the flag is_tarred for choosing lhotse dataloader, lhotse supports more formats than nemo json and nemo tar and auto-deduces is_tarred

pzelasko · 2024-05-15T16:46:18Z

examples/multimodal/speech_llm/conf/modular_audio_gpt_config_llama_lhotse.yaml

+  freeze_modality_adapter: False
+  load_audio_encoder: True
+
+  global_batch_size: 128


how do these settings of global/micro batch size work with lhotse dynamic batch sizes?

pzelasko · 2024-05-15T16:48:36Z

examples/multimodal/speech_llm/conf/modular_audio_gpt_config_llama_lhotse.yaml

+      bucketing_batch_size: null
+      use_lhotse: True
+      duration_bins : [2,4,6,8,10,12,14,16,18]
+      lhotse:


this is an old config from before we merged lhotse dataloading to main. please update lhotse related options here (and in other configs if needed)

pzelasko · 2024-05-15T16:48:50Z

examples/multimodal/speech_llm/conf/modular_audio_gpt_config_llama_lhotse.yaml

+        average: null # Average the metric over the dataset. Options: ['macro', 'micro']. Works only for 'F1', 'accuracy' etc. Refer to torchmetrics for metrics where this is supported.
+        num_classes: null
+
+    # test_ds:


stevehuang52

Thanks for the great work, please address the CodeQL issues and see the minor comments.

stevehuang52 · 2024-05-16T07:57:50Z

nemo/collections/multimodal/speech_llm/data/build_dataset.py

+from nemo.utils import logging
+
+
+def build_salm_dataset(model_instance, data_cfg, is_train):


better not include salm in the function name, so as to make it more general

stevehuang52 · 2024-05-16T08:01:05Z

nemo/collections/multimodal/speech_llm/data/lhotse_dataset.py

+    return (n + m - 1) // m * m
+
+
+class TextProcessing:


Is this a copy or a modified version of the TextProcessing class in `audio_text_dataset? If it's a modified version, we should inherit from the parent class and only overwrite the necessary functions.

stevehuang52 · 2024-05-16T08:21:36Z

nemo/collections/multimodal/speech_llm/data/lhotse_dataset.py

+        return processed_example
+
+
+def convert_canary_prompt_to_text(prompt, is_canary_tokens_augment):


can we make this function directly scalable to more languages, and flexible to adapt to new changes in canary prompt?

where is the format "<|fr|" and "<|transcribe|" defined?

stevehuang52 · 2024-05-16T08:22:14Z

nemo/collections/multimodal/speech_llm/data/lhotse_dataset.py

+        random_context_prob: float = 0.0,
+        random_context_positive_percent: float = 0.1,
+    ):
+        from lhotse.dataset import AudioSamples, CutMix


please put the import to the top of the file

stevehuang52 · 2024-05-16T08:23:30Z

nemo/collections/multimodal/speech_llm/data/lhotse_dataset.py

+        self.random_context_prob = random_context_prob
+        self.random_context_positive_percent = random_context_positive_percent
+
+    def _inject_random_context_into_question(self, cut, random_context_num=8, random_context_positive_percent=0.1):


shall we remove random context? it's usage is only limited to word boosting and might hurt performance when doing multi-task training

stevehuang52 · 2024-05-20T19:40:18Z

nemo/collections/multimodal/speech_llm/modules/perception_modules.py

+        *args,
+        **kwargs,
+    ):
+        assert input_embeds.shape[-1] == encoder_states.shape[-1]


please add docstrings

stevehuang52 · 2024-05-20T19:44:57Z

nemo/collections/multimodal/speech_llm/parts/utils/data_utils.py

@@ -155,3 +155,15 @@ def align_feat_seq_list(
        new_seq_list.append(new_seq)
        new_seq_len_list.append(new_seq_len)
    return new_seq_list, new_seq_len_list
+
+
+def to_cuda(inputs, non_blocking=True):


remove this function, use this one https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/parts/mixins/transcription.py#L70

stevehuang52 · 2024-05-20T19:46:44Z

nemo/collections/nlp/modules/common/megatron/utils.py

@@ -417,8 +417,8 @@ def split_list(inputs, num_chunks):
    """
    Split a list into equal sized chunks
    """
+    # if len(inputs) % chunk_size != 0, round down the chunk size


remove this line if it's not used

stevehuang52 · 2024-05-20T19:48:31Z

nemo/collections/nlp/modules/common/megatron/utils.py

@@ -442,8 +442,11 @@ def get_iterator_k_split(batch: Union[Dict, List[torch.Tensor]], num_microbatche

        # Split tensor items
        items = list(tensor_items.items())
-        assert items[0][1].shape[0] % num_microbatches == 0, "Issue with batch size configuration!"


need someone from NLP to review the change on removing the constraint assert items[0][1].shape[0] % num_microbatches == 0

stevehuang52 · 2024-05-20T19:49:16Z

nemo/collections/nlp/models/language_modeling/megatron_base_model.py

@@ -246,12 +246,12 @@ def __init__(self, cfg: DictConfig, trainer: Trainer, no_lm_init=True):
        self.use_fsdp = cfg.get('fsdp', False)

    def setup_transformer_engine_tp_groups(self):
-        """ This should be called after model parallel groups have been initialized


please go though the changes and undo anything that's not necessary

zhehuaichen and others added 30 commits November 24, 2023 18:45

compatible with hybrid model

c2466e9

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

support ConcatAmQueryAudioPerceptionModel

f4191b1

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

support rnnt by setting ++model.perception.is_ctc=False

1d293b0

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

update for speaker counting and misc

e54641f

Signed-off-by: stevehuang52 <heh@nvidia.com>

Merge branch 'main' into feature/lhotse-integration

dbe4542

fix and update infer decoding, add clap encoder (WIP)

5c811b4

Signed-off-by: stevehuang52 <heh@nvidia.com>

update cfg

df325e7

Signed-off-by: stevehuang52 <heh@nvidia.com>

backward support and support overwrite question

3b24e67

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

add more variants

cf874fb

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

Merge branch 'main' into feature/lhotse-integration

f17ea65

Fix a possible issue when multiple datasets have items with the same …

5c3b329

…utterance IDs

[pre-commit.ci] auto fixes from pre-commit.com hooks

f2ccae4

for more information, see https://pre-commit.ci

support use_multi_layer_feat

0d06689

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

Support selectable text field and static batch sizes with Lhotse

2065169

[pre-commit.ci] auto fixes from pre-commit.com hooks

cb39189

for more information, see https://pre-commit.ci

Fixes

b1923e0

Merge remote-tracking branch 'origin/feature/lhotse-integration' into…

c1f2d68

… feature/lhotse-integration

Fixes

0c7b399

Docs fix

3b282aa

Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to…

5034d77

… cut.custom)

Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to…

31b1973

… cut.custom)

[pre-commit.ci] auto fixes from pre-commit.com hooks

0880d44

for more information, see https://pre-commit.ci

Merge branch 'feature/lhotse-integration' of https://github.com/pzela…

30ce202

…sko/nemo into feature/lhotse-integration

[pre-commit.ci] auto fixes from pre-commit.com hooks

5f11fdb

for more information, see https://pre-commit.ci

support distributed_fused_adam

02f0f0a

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

support distributed_fused_adam

378af7c

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

Add support for sharded NeMo manifest files

35412fb

[pre-commit.ci] auto fixes from pre-commit.com hooks

1f2acde

for more information, see https://pre-commit.ci

support megatron_amp_O2

5b58e69

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

Support heterogeneous sampling rates in non tarred NeMo manifests

37cabcc

zhehuaichen added 3 commits May 10, 2024 17:33

Merge remote-tracking branch 'upstream/heh/modular_speechllm_pr' into…

fb8914d

… canary_speechllm1_cross_t5_pr3 Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

minor edit

9499f2e

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

Merge remote-tracking branch 'upstream/main' into canary_speechllm1_c…

e601135

…ross_t5_pr3 Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

github-actions bot added NLP Multi Modal labels May 11, 2024

Apply isort and black reformatting

3cc0432

Signed-off-by: zhehuaichen <zhehuaichen@users.noreply.github.com>

github-advanced-security bot found potential problems May 11, 2024

View reviewed changes

zhehuaichen requested review from titu1994, pzelasko, stevehuang52, arendu and nithinraok May 11, 2024 04:02

zhehuaichen marked this pull request as ready for review May 11, 2024 04:03

zhehuaichen requested review from krishnacpuvvada and blisc May 11, 2024 04:04

stevehuang52 reviewed May 13, 2024

View reviewed changes

pzelasko reviewed May 15, 2024

View reviewed changes

stevehuang52 requested changes May 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend multimodal/speech_llm with lhotse, t5 and bestow supports #9169

Extend multimodal/speech_llm with lhotse, t5 and bestow supports #9169

zhehuaichen commented May 11, 2024

github-advanced-security bot left a comment

stevehuang52 May 13, 2024

stevehuang52 May 13, 2024

stevehuang52 May 13, 2024

stevehuang52 May 13, 2024

pzelasko May 15, 2024

pzelasko May 15, 2024 •

edited

pzelasko May 15, 2024

pzelasko May 15, 2024

pzelasko May 15, 2024

pzelasko May 15, 2024

pzelasko May 15, 2024

pzelasko May 15, 2024

pzelasko May 15, 2024

pzelasko May 15, 2024

stevehuang52 left a comment

stevehuang52 May 16, 2024

stevehuang52 May 16, 2024

stevehuang52 May 16, 2024

stevehuang52 May 16, 2024

stevehuang52 May 16, 2024

stevehuang52 May 20, 2024

stevehuang52 May 20, 2024

stevehuang52 May 20, 2024

stevehuang52 May 20, 2024

stevehuang52 May 20, 2024

		@@ -0,0 +1,361 @@
		# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

		return processed_example


		def convert_canary_prompt_to_text(prompt, is_canary_tokens_augment):

		from nemo.utils import logging


		def build_salm_dataset(model_instance, data_cfg, is_train):

Extend multimodal/speech_llm with lhotse, t5 and bestow supports #9169

Are you sure you want to change the base?

Extend multimodal/speech_llm with lhotse, t5 and bestow supports #9169

Conversation

zhehuaichen commented May 11, 2024

What does this PR do ?

Main features under speech_llm

Minor edit in nlp collection:

github-advanced-security bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pzelasko May 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevehuang52 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pzelasko May 15, 2024 •

edited