Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question about the resizing operation #310

Open
Richar-Du opened this issue Nov 22, 2023 · 1 comment
Open

A question about the resizing operation #310

Richar-Du opened this issue Nov 22, 2023 · 1 comment
Assignees

Comments

@Richar-Du
Copy link

Thanks for your awesome work in Otter-HD! I have a question about the resizing operation in fuyu. Since fuyu-8b claims that the model can accept images with any resolution as the input, then why does the code first resize all the images into [1080, 1920]? Why not just keep the original size?

Maybe an explanation is to keep the length of input the same, but adding pad tokens after text tokens is also plausible. Could you help answer this question? Thanks in advance :)

@gpantaz
Copy link

gpantaz commented Feb 7, 2024

Hi I was wondering the same question. After some digging on the FuyuImageProcessor I found this line in the resize method

https://github.com/huggingface/transformers/blob/1c31b7aa3bb4e7ef24c77596d2a76f45a770159f/src/transformers/models/fuyu/image_processing_fuyu.py#L299-L303C25

Correct me if I am wrong but I believe that the image is not resized unless it is larger than 1080x1920. Instead, it is padded to these dimensions. This means that say for an image (w, h) this is resized to 1080x1920 with padding values in each dimension. Then to handle variable sized images the processor tries to find the minimum integer that exceeds w (or h) and it is divisible by the patch size (30): w1 or (h1)

https://github.com/huggingface/transformers/blob/1c31b7aa3bb4e7ef24c77596d2a76f45a770159f/src/transformers/models/fuyu/image_processing_fuyu.py#L632C1-L644C65

Since the image has been padded to 1080x1920, it is safe to take w1 cols and h1 rows from the padded tensor.

However, this means that for all benchmarks that do not have images larger than 1080x1920 the performance of the model should be exactly the same in 1080x1920 and the "variable size" scenario? 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants