[WIP][wenet/LLM] support LLMs #2460

Mddct · 2024-04-07T06:17:34Z

为下一步的SpeechLLM 打基础

TODO

其他pr中引入 model_parallel， fairscale，

Mddct · 2024-04-12T02:51:24Z

为什么不把embeding和out 放到decoderonly里边？

其他模态的注入是从embeding开始的，保持decoder only 有embeding的入参。

如果embeing和out share weight，fsdp 需要embeding 和out 在同一个level上，

我们经常会扩充词表，resize embed 和resize out，放最外层不影响decoderonly

Mddct · 2024-04-14T16:50:31Z

gemma 精度测试

# configs = {"decoder": "decoder_only", "output_dim": 256000, "model_conf": {}}

import torch
from wenet.text.LLM.script.convert_gemma_to_wenet_config_and_ckpt import (
    get_config_for_2b, get_config_for_7b)
from wenet.utils.init_model import init_model

from gemma.model import GemmaForCausalLM
from gemma.config import (get_config_for_2b as google_2b_config_fn,
                          get_config_for_7b as google_7b_config_fn)

import argparse


def get_args():
    parser = argparse.ArgumentParser(description='')
    parser.add_argument(
        '--gemma_ckpt',
        required=True,
        help='https://www.kaggle.com/models/google/gemma/frameworks/pyTorch')
    parser.add_argument(
        '--gemma_tokenizer',
        required=True,
        help='https://www.kaggle.com/models/google/gemma/frameworks/pyTorch')

    parser.add_argument(
        '--wenet_gemma_ckpt',
        required=True,
        help='https://www.kaggle.com/models/google/gemma/frameworks/pyTorch')
    parser.add_argument('--model_size', type=str, required=True)
    args = parser.parse_args()
    return args


args = get_args()
args.jit = False

layers = 18 if args.model_size == '2b' else 28
if args.model_size == '2b':
    config = get_config_for_2b()
else:
    config = get_config_for_7b()
model_conf = {
    'model': 'causal_lm',
    'output_dim': config.vocab_size,
    'decoder': 'decoder_only',
    'tokenizer_conf': {
        "special_tokens": {
            'sos': 0,
            'eos': 1
        }
    }
}
decoder_conf = {}
decoder_conf['n_kv_head'] = config.num_key_value_heads
decoder_conf['head_dim'] = config.head_dim
decoder_conf['hidden_size'] = config.hidden_size
decoder_conf['attention_heads'] = config.num_attention_heads
decoder_conf['linear_units'] = config.intermediate_size
decoder_conf['num_blocks'] = layers

decoder_conf['max_position_embeding'] = 8192
decoder_conf['activation_type'] = 'gelu'
decoder_conf['gelu_approximate'] = 'tanh'
decoder_conf['norm_eps'] = config.rms_norm_eps
decoder_conf['use_sdpa'] = True
model_conf['decoder_conf'] = decoder_conf
model_conf['model_conf'] = {}

args.checkpoint = args.wenet_gemma_ckpt
model, _ = init_model(args, model_conf)
model.eval()

# get google gemma model
if args.model_size == '2b':
    google_config = google_2b_config_fn()
else:
    google_config = google_7b_config_fn()
google_config.tokenizer = args.gemma_tokenizer

google_gemma = GemmaForCausalLM(google_config)
google_gemma.load_weights(
    args.gemma_ckpt)
google_gemma.eval()
scale = google_config.hidden_size

batch_size = torch.randint(2, 10, ())
seq_len = torch.randint(3, 20, ())
text = torch.randint(0, config.vocab_size, (batch_size, seq_len))


def google_forward(google_gemma,
                   batch_size,
                   token_ids,
                   seq_len,
                   scale,
                   layers=18):

    google_freqs_cis = google_gemma.freqs_cis
    google_emb = google_gemma.embedder
    google_gemma = google_gemma.model

    input_positions_tensor = torch.arange(0, seq_len)
    google_freqs_cis = google_freqs_cis.index_select(0, input_positions_tensor)
    google_hidden_states = google_emb(token_ids)
    google_hidden_states = google_hidden_states * (scale**0.5)
    # mask_tensor = torch.full((2, 1, 10, 10), -2.3819763e38).to(torch.float)
    mask_tensor = torch.full((batch_size, 1, seq_len, seq_len),
                             0).to(torch.float)
    kv_caches = []
    for _ in range(layers):
        size = (batch_size, seq_len, google_config.num_key_value_heads,
                google_config.head_dim)
        k_cache = torch.zeros(size=size)
        v_cache = torch.zeros(size=size)
        kv_caches.append((k_cache, v_cache))
    google_output = google_gemma(
        google_hidden_states,
        google_freqs_cis,
        input_positions_tensor,
        kv_caches,
        mask_tensor,
    )
    google_output = torch.matmul(google_output, google_emb.weight.T)
    return google_output


def wenet_forward(wenet_model, batch_size, token_ids, seq_len, layers=18):
    hidden_states = wenet_model.embed(token_ids)
    wenet_kv_caches = []
    for _ in range(layers):
        size = (0, 0, 0, 0)
        k_cache = torch.zeros(size=size)
        v_cache = torch.zeros(size=size)
        wenet_kv_caches.append((k_cache, v_cache))

    att_mask_tensor = torch.ones(batch_size,
                                 seq_len,
                                 seq_len,
                                 dtype=torch.bool)
    wenet_output, _ = model.decoder(hidden_states,
                                    att_mask_tensor.squeeze(1),
                                    kv_caches=wenet_kv_caches)

    wenet_output = model.out(wenet_output)
    return wenet_output


wenet_output = wenet_forward(model, batch_size, text, seq_len, layers)
google_output = google_forward(google_gemma, batch_size, text, seq_len, scale,
                               layers)

print(wenet_output)
print(google_output)
assert torch.allclose(wenet_output, google_output)

Mddct · 2024-04-26T07:55:17Z

sft:
2b gemma fsdp zero3

Mddct · 2024-04-28T11:15:53Z

generate in batch way:

gemma:

llama:

Mddct · 2024-04-30T05:41:51Z

解释下这里为什么要把shape 变成[bs, seq_len，head, head_dim]

https://github.com/wenet-e2e/wenet/blob/9805ed68638f711b6fda17627efb7aa918ce6870/wenet/transformer/attention.py#L637-#L651

来自gpt4的解释：

实测[bs, seq_len，head, head_dim]，对head_dim 上apply pos等操作要慢于[bs,head，seq_len, head_dim]

6s vs 2s (长度为300)

ref: https://github.com/google/gemma_pytorch/blob/main/gemma/model.py#L256

所以其他xxx attention 是否也需要有对应修改？

fclearner · 2024-04-30T06:54:59Z

周神，torch官方也有个llama微调的代码：https://github.com/pytorch/torchtune

Mddct · 2024-04-30T10:38:19Z

周神，torch官方也有个llama微调的代码：https://github.com/pytorch/torchtune

嗯这个有看过。不过我们最终目的不是llm 而是为了语音理解大模型和语音合成

而且大模型训练有自己的设计原则和技巧我们需要把优秀的组件继承过来

Mddct · 2024-05-31T01:43:01Z

该pr会拆分成以下加个pr

decoderonly [LLM] support causallm model #2547
llm dataset
convert script

add casual model

8cdc9ed

Mddct mentioned this pull request Apr 7, 2024

[train_engine] support fsdp #2412

Merged

24 tasks

Mddct added 2 commits April 7, 2024 14:20

fix typo

8559f93

rm ckpt

9f3dd76

Mddct force-pushed the Mddct-llm branch from 77282d6 to 9f3dd76 Compare April 7, 2024 15:51

add topk topp sampler

9958a55

Mddct added 3 commits April 12, 2024 14:43

fix positoin

1de7240

Merge branch 'main' into Mddct-llm

a90d336

add generate

6568552

Mddct changed the title ~~[WIP text/LLM] support LLMs~~ [WIP][text/LLM] support LLMs Apr 12, 2024

Mddct added 9 commits April 12, 2024 19:50

add toto

984d481

support sft & pretrain training forward

b36b3ad

gemm conversion works

cc57164

support init casual model

4180661

Merge branch 'main' into Mddct-llm

3fabb2b

Merge branch 'main' into Mddct-llm

7bbb2d7

all gemma model works

e6a6d02

fix ut

fbe519f

merge main

ed38698

Mddct added 9 commits April 15, 2024 09:34

Merge branch 'main' into Mddct-llm

50458c3

merge main

acd42c7

fix cache

25f5ef3

Merge branch 'main' into Mddct-llm

135b9c0

generate works!

33a55d5

unify chat pattern

05a2579

convert llama3 works

126d740

merge main

bd6a6e6

fix w1 w2 w3 in feedforward

34eecb2

Mddct force-pushed the Mddct-llm branch 4 times, most recently from 1ea0839 to 64ff835 Compare April 25, 2024 12:16

training works

38330d1

Mddct force-pushed the Mddct-llm branch from 64ff835 to 38330d1 Compare April 25, 2024 12:26

pretrain works

bfe0628

Mddct force-pushed the Mddct-llm branch 2 times, most recently from 31a8dcd to 782998d Compare April 27, 2024 02:53

refactor covert

79bafa3

Mddct force-pushed the Mddct-llm branch from 782998d to 79bafa3 Compare April 27, 2024 03:01

fix flash att in generate

a9a7f7b

Mddct force-pushed the Mddct-llm branch from 3879fca to a9a7f7b Compare April 28, 2024 11:18

llama works

e5e36fc

Mddct force-pushed the Mddct-llm branch from dbf797a to 68bc61a Compare April 30, 2024 02:59

fix llama3

0e81840

Mddct force-pushed the Mddct-llm branch from 68bc61a to 0e81840 Compare April 30, 2024 03:31

fix speed

9805ed6

try fix ut

32853c6

support stop tokens in gen and support ppl

e81b110

Mddct force-pushed the Mddct-llm branch from 6a67c64 to b53ef44 Compare April 30, 2024 18:13

support stop tokens in gen and support ppl

e246769

Mddct force-pushed the Mddct-llm branch from b53ef44 to e246769 Compare April 30, 2024 18:34

Mddct mentioned this pull request Jun 1, 2024

[LLM] support causallm model #2547

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][wenet/LLM] support LLMs #2460

[WIP][wenet/LLM] support LLMs #2460

Mddct commented Apr 7, 2024 •

edited

Mddct commented Apr 12, 2024

Mddct commented Apr 14, 2024 •

edited

Mddct commented Apr 26, 2024 •

edited

Mddct commented Apr 28, 2024 •

edited

Mddct commented Apr 30, 2024 •

edited

fclearner commented Apr 30, 2024 •

edited by Mddct

Mddct commented Apr 30, 2024

Mddct commented May 31, 2024 •

edited

[WIP][wenet/LLM] support LLMs #2460

Are you sure you want to change the base?

[WIP][wenet/LLM] support LLMs #2460

Conversation

Mddct commented Apr 7, 2024 • edited

Mddct commented Apr 12, 2024

Mddct commented Apr 14, 2024 • edited

Mddct commented Apr 26, 2024 • edited

Mddct commented Apr 28, 2024 • edited

Mddct commented Apr 30, 2024 • edited

fclearner commented Apr 30, 2024 • edited by Mddct

Mddct commented Apr 30, 2024

Mddct commented May 31, 2024 • edited

Mddct commented Apr 7, 2024 •

edited

Mddct commented Apr 14, 2024 •

edited

Mddct commented Apr 26, 2024 •

edited

Mddct commented Apr 28, 2024 •

edited

Mddct commented Apr 30, 2024 •

edited

fclearner commented Apr 30, 2024 •

edited by Mddct

Mddct commented May 31, 2024 •

edited