Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in multi-image conversation ( Only Support Single Image Conversation) #176

Open
BeiningWu opened this issue May 17, 2024 · 1 comment

Comments

@BeiningWu
Copy link

Thanks for your great job!
I follow your tutorial in https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5-Int8 and I found that the model only support single image conversation. I use the Int8 model.

for example, i load three images to model like this:

pixel_values_0=load_image("./test_video/clip10/clip1000.png", max_num=6).to(torch.bfloat16).cuda()
pixel_values_1=load_image("./test_video/clip10/clip1020.png", max_num=6).to(torch.bfloat16).cuda()
pixel_values_2=load_image("./test_video/clip10/clip1040.png", max_num=6).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values_0,pixel_values_1,pixel_values_2), dim=0)
question = "how many pictures did you see?"
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(question, response)

And the model respond: I saw one picture.

Then test the official code:

pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = "详细描述这两张图片" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(question, response)

question = "这两张图片的相同点和区别分别是什么" # What are the similarities and differences between these two pictures
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(question, response)

And I got the responese like: 这张图片展示了一只大熊猫,它是中国的国宝。大熊猫坐在地上,周围是绿色的植被和竹子。它的毛色主要是黑白相间的,有着非常明显的黑色眼圈和耳朵。大熊猫看起来很平静,似乎在享受周围的环境。

背景中可以看到一些木制的结构和岩石,这可能是动物园或野生动物保护区的一部分。整体上,这张图片传达了一种宁静和自然的感觉,同时也展示了这种珍稀动物在自然环境中的生活状态。

I don't know how to fix this bug.
Here is my full test code:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '1'
from transformers import AutoTokenizer, AutoModel
import torch
import torchvision.transforms as T
from PIL import Image

from torchvision.transforms.functional import InterpolationMode


IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)


def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform


def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio


def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images


def load_image(image_file, input_size=448, max_num=6):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values


path = "./models/InternVL-Chat-V1-5-Int8/"
# If you have an 80G A100 GPU, you can put the entire model on a single GPU.
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    load_in_8bit=True).eval()


tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)

generation_config = dict(
    num_beams=1,
    max_new_tokens=512,
    do_sample=False,
)

pixel_values_0=load_image("./test_video/clip10/clip1000.png", max_num=6).to(torch.bfloat16).cuda()
pixel_values_1=load_image("./test_video/clip10/clip1020.png", max_num=6).to(torch.bfloat16).cuda()
pixel_values_2=load_image("./test_video/clip10/clip1040.png", max_num=6).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values_0,pixel_values_1,pixel_values_2), dim=0)


question = "how many pictures did you see?"
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)

print(question, response)
@czczup
Copy link
Member

czczup commented May 30, 2024

Hi, this is because during training, the model only encountered single-image samples. The multi-image capability mainly relies on zero-shot, and its performance is unstable. We plan to include interleaved multi-image data for training in the June version, which is expected to improve multi-image dialogue performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants