Skip to content

stream_infer input_embeddings #889

Answered by irexyc
JidongZhang-THU asked this question in Q&A
Discussion options

You must be logged in to vote

For some tasks like qwen-vl or internlm-xcomposer, the decode process are same with normal llm. The only difference is the embedding layer. Normal llm use embedding layer to encode token_ids to input_embs. These multimodal model concat the image features and input_embs as final input.

To make the code simpler, we add dummy ids to token_ids and after embedding layer, we replace that dummy embeddings with real image features.

This is a web demo #874

Replies: 2 comments 2 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
2 replies
@JidongZhang-THU
Comment options

@JidongZhang-THU
Comment options

Answer selected by JidongZhang-THU
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants