Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add Hiera #2083

Closed
raulcarlomagno opened this issue Jan 22, 2024 · 9 comments
Closed

[FEATURE] Add Hiera #2083

raulcarlomagno opened this issue Jan 22, 2024 · 9 comments
Labels
enhancement New feature or request

Comments

@raulcarlomagno
Copy link

raulcarlomagno commented Jan 22, 2024

Add a vision model from Meta

"Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles"
https://github.com/facebookresearch/hiera/tree/main

image

@raulcarlomagno raulcarlomagno added the enhancement New feature or request label Jan 22, 2024
@rwightman
Copy link
Collaborator

@raulcarlomagno I like this model quite a bit, neat ideas, but they've marked both the code and weights as non-commercial. I can deal with the weights, I treat them with separate licenses on the HF hub, but cannot bring NC code into timm...

Given that, it takes more effort to do a clean room impl / from first principles and I have a lot of things in progress right now. Or you could bug them to drop the NC license on the code and just keep it for the weights...

@chayryali
Copy link

@rwightman @raulcarlomagno Hi, we've made the license for Hiera code Apache 2.0. (We cannot do anything about the model licenses unfortunately.) Would love to support integration into timm!

@rwightman
Copy link
Collaborator

@chayryali that's great! I think it shouldn't be too had to get it in, the style is pretty much in line with timm already ... just a number of timm specific additions for model builder, and some extra functionality. I'll have to take another look at impl ...

Weight license will be handled w/ appropriate license in the HF model hub and also a comment / tag in the implementation where the pretrained weight links are.

@rwightman
Copy link
Collaborator

@chayryali so, been juggling just a few things lately, but do have this model working locally in timm.

I've been trying to add support for changing resolution though, either on init (diff input (img) size passed to model) or on the fly in forward.

As soon as the resolution is changed the model accuracy drops off the cliff, haven't had issue resizing vanilla vits and related models, or any of the window'd variants like swin, maxvit, etc ...

@rwightman
Copy link
Collaborator

If I hold the patch stride vs img size ratio constant it appears to work, but that constrains the possibilities significantly...

@chayryali
Copy link

@rwightman Great to hear it's working locally!

Regarding changing the resolution, it turns out (paper) the drop in performance is due to the interaction between window attention and absolute positional encoding. It also affects ViT (but typically in detection settings e.g. ViTDet, where it's more common to use window attention).

The fix is really simple, we make the abs position embeddings "window-aware" by maintaining two position embeddings, a window embedding (e.g. 8x8) and a global embedding (e.g. 7x7). The global embedding is interpolated to 56x56 (for 224x224 res) and the window embedding is tiled to 56x56 and added together to form the final position encoding. We are actually about to release the corresponding "absolute win" image and video models soon.

Abs_Win_Figure

@rwightman
Copy link
Collaborator

@chayryali nice, I hadn't seen that paper will have a read. I was working through an idea to add different ROPE pos to the window'd and global stages to see if that'd work but this appears simpler :)

Also, did a quick ablation while fiddling, instead of projecting for the residual shortcut, since it's 2x expansion by default avg + max pool seems to provide a similar, if not slightly faster learning progress comparing initial steps on a supervised learn task. Might have a different outlook for MAE pretrain though ...

if self.do_expand:
if self.proj is not None:
x = self.proj(x_norm)
x = x.view(x.shape[0], self.attn.q_stride, -1, x.shape[-1]).amax(dim=1) # max-pool
else:
x = torch.cat([
x.view(x.shape[0], self.attn.q_stride, -1, x.shape[-1]).amax(dim=1), # max-pool
x.view(x.shape[0], self.attn.q_stride, -1, x.shape[-1]).mean(dim=1), # avg-pool
],
dim=-1,
)

@rwightman
Copy link
Collaborator

could make that view less redundant there, but just fiddling :)

@rwightman
Copy link
Collaborator

@chayryali read the paper, makes sense. Is the updated code/models coming anytime soon?

In the comparison tables you have numbers for fine-tune at higher res. Definitely want to see those increases, but even just validating the same model at a higher res, if everything is working well you should see same to improved (train-test discrep) val numbers when you increase up to 20-30% or so higher than original res before it drops off (where fine-tune is then needed). Appears to hold for most other vit / vit-hybrid arch with high augmentations during pretrain. MAE might have a different impact there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants