[FEATURE] Add Hiera #2083

raulcarlomagno · 2024-01-22T15:05:39Z

Add a vision model from Meta

"Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles"
https://github.com/facebookresearch/hiera/tree/main

rwightman · 2024-01-22T23:55:36Z

@raulcarlomagno I like this model quite a bit, neat ideas, but they've marked both the code and weights as non-commercial. I can deal with the weights, I treat them with separate licenses on the HF hub, but cannot bring NC code into timm...

Given that, it takes more effort to do a clean room impl / from first principles and I have a lot of things in progress right now. Or you could bug them to drop the NC license on the code and just keep it for the weights...

chayryali · 2024-03-07T04:25:36Z

@rwightman @raulcarlomagno Hi, we've made the license for Hiera code Apache 2.0. (We cannot do anything about the model licenses unfortunately.) Would love to support integration into timm!

rwightman · 2024-03-07T06:34:50Z

@chayryali that's great! I think it shouldn't be too had to get it in, the style is pretty much in line with timm already ... just a number of timm specific additions for model builder, and some extra functionality. I'll have to take another look at impl ...

Weight license will be handled w/ appropriate license in the HF model hub and also a comment / tag in the implementation where the pretrained weight links are.

rwightman · 2024-04-19T23:27:24Z

@chayryali so, been juggling just a few things lately, but do have this model working locally in timm.

I've been trying to add support for changing resolution though, either on init (diff input (img) size passed to model) or on the fly in forward.

As soon as the resolution is changed the model accuracy drops off the cliff, haven't had issue resizing vanilla vits and related models, or any of the window'd variants like swin, maxvit, etc ...

rwightman · 2024-04-19T23:31:39Z

If I hold the patch stride vs img size ratio constant it appears to work, but that constrains the possibilities significantly...

chayryali · 2024-04-21T10:30:52Z

@rwightman Great to hear it's working locally!

Regarding changing the resolution, it turns out (paper) the drop in performance is due to the interaction between window attention and absolute positional encoding. It also affects ViT (but typically in detection settings e.g. ViTDet, where it's more common to use window attention).

The fix is really simple, we make the abs position embeddings "window-aware" by maintaining two position embeddings, a window embedding (e.g. 8x8) and a global embedding (e.g. 7x7). The global embedding is interpolated to 56x56 (for 224x224 res) and the window embedding is tiled to 56x56 and added together to form the final position encoding. We are actually about to release the corresponding "absolute win" image and video models soon.

rwightman · 2024-04-21T16:38:57Z

@chayryali nice, I hadn't seen that paper will have a read. I was working through an idea to add different ROPE pos to the window'd and global stages to see if that'd work but this appears simpler :)

Also, did a quick ablation while fiddling, instead of projecting for the residual shortcut, since it's 2x expansion by default avg + max pool seems to provide a similar, if not slightly faster learning progress comparing initial steps on a supervised learn task. Might have a different outlook for MAE pretrain though ...

pytorch-image-models/timm/models/hiera.py

Lines 354 to 364 in d88bed6

    
           if self.do_expand: 
        
               if self.proj is not None: 
        
                   x = self.proj(x_norm) 
        
                   x = x.view(x.shape[0], self.attn.q_stride, -1, x.shape[-1]).amax(dim=1)  # max-pool 
        
               else: 
        
                   x = torch.cat([ 
        
                       x.view(x.shape[0], self.attn.q_stride, -1, x.shape[-1]).amax(dim=1),  # max-pool 
        
                       x.view(x.shape[0], self.attn.q_stride, -1, x.shape[-1]).mean(dim=1),  # avg-pool 
        
                       ], 
        
                       dim=-1, 
        
                   )

rwightman · 2024-04-21T16:39:27Z

could make that view less redundant there, but just fiddling :)

rwightman · 2024-04-21T17:13:54Z

@chayryali read the paper, makes sense. Is the updated code/models coming anytime soon?

In the comparison tables you have numbers for fine-tune at higher res. Definitely want to see those increases, but even just validating the same model at a higher res, if everything is working well you should see same to improved (train-test discrep) val numbers when you increase up to 20-30% or so higher than original res before it drops off (where fine-tune is then needed). Appears to hold for most other vit / vit-hybrid arch with high augmentations during pretrain. MAE might have a different impact there.

raulcarlomagno added the enhancement New feature or request label Jan 22, 2024

rwightman closed this as completed in 8a54d2a May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add Hiera #2083

[FEATURE] Add Hiera #2083

raulcarlomagno commented Jan 22, 2024 •

edited

rwightman commented Jan 22, 2024

chayryali commented Mar 7, 2024

rwightman commented Mar 7, 2024

rwightman commented Apr 19, 2024

rwightman commented Apr 19, 2024

chayryali commented Apr 21, 2024

rwightman commented Apr 21, 2024

rwightman commented Apr 21, 2024

rwightman commented Apr 21, 2024

[FEATURE] Add Hiera #2083

[FEATURE] Add Hiera #2083

Comments

raulcarlomagno commented Jan 22, 2024 • edited

rwightman commented Jan 22, 2024

chayryali commented Mar 7, 2024

rwightman commented Mar 7, 2024

rwightman commented Apr 19, 2024

rwightman commented Apr 19, 2024

chayryali commented Apr 21, 2024

rwightman commented Apr 21, 2024

rwightman commented Apr 21, 2024

rwightman commented Apr 21, 2024

raulcarlomagno commented Jan 22, 2024 •

edited