Fix a bug in function "prepare_inputs_labels_for_multimodal" of "LlavaMetaForCausalLM" when there are more than one image in each conversation of a batch. #967
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Dear author:
Thanks for your excellent work LLaVA.
When reading your code, I found that in this line there is a conditional statements used for the case that there is more than one image for each conversation in a batch, I suppose. So the "images" in this case is either a tensor of shape [N, M, 3, H, W], where N is batch_size and M is the image number of each sentence, or a list that contain N tensors of shape [m_n, 3, H, W], where m_n is the image number and differs in different sentences in a batch.
But in this if block, the resulting "image_features" may has the wrong shape in the case that at least one sentence contains more than one image, and raise an "IndexError" exception in this line of the file.
I commit the shape of the result of each line in this if block for better understanding:
I don't understand why using flatten(0, 1) to modify the shape of x, it concatenates features of multiple images into only ONE feature. So image_features will only contain N image features, but there should have been N*M image features.
I wrote a simple script to reproduce the bug as follows:
this will raise the "IndexError" exception in this line of the file. You can modify the "input_text" and "image_files" to make the "image_files" a list that contains N tensors of shape [m_n, 3, H, W] for the case that the image numbers m_n differ in each input conversation (or sentence), and the exception still raises.
So I modify the code in this if block. If I make any mistake or have a misunderstanding of this code, please don't hesitate to correct me.
Thank you.