Add FlashInternImage models #2167

IridescentPig · 2024-05-03T18:37:37Z

Paper:

Adapted from official impl at https://github.com/OpenGVLab/DCNv4

Some clarifications:

FlashInternImage is the InternImage model that uses DCNv4 as its core operation. Since the DCNv4 needs CUDA support, so a pure PyTorch implementation of DCNv3 is integrated as a substitute in the implementation of FlashInternImage. To use DCNv4, users need to install it via pip install DCNv4 (this might take some time). This will raise a warning when users create a FlashInternModel with DCNv4 not available.
The pure PyTorch implementation of DCNv3 uses torch.linspace(), which will raise errors when creating a fx model of FlashInternImage. I managed to fix this but failed, so I exclude the FlashInternImage model from tests related to fx model.

HuggingFaceDocBuilderDev · 2024-05-03T21:43:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

fffffgggg54 · 2024-05-05T05:02:32Z

FYI the InternImage links in your PR, top of the implementation, and model class link to the Swin transformer paper.

A few things looking at the cross attention implementation, projection bias would be simpler and match other models by setting bias=qkv_bias for each of the linear layers and remapping the weight. It also should be possible to use fused attention here, similar to the ViT block.

Most other hierarchical models are structured with stages of blocks instead of blocks of layers, the change in naming threw me off at first. The InternImage paper also used the stages and blocks naming scheme.

The seq_out forward for detection or segmentation is taken care of by feature extraction wrappers in timm, although I'm not sure if it is applicable here due to the way the cllip forwards and the potential for a throughput penalty. I'm not sure about this approach that's hardcoded into the model forward, but using the extraction feature on other models usually incurs a ~10% throughput penalty IME.

sahilqure · 2024-05-07T22:34:24Z

Please merge this. FlashIntern is really good

IridescentPig · 2024-05-09T02:52:41Z

FYI the InternImage links in your PR, top of the implementation, and model class link to the Swin transformer paper.

A few things looking at the cross attention implementation, projection bias would be simpler and match other models by setting bias=qkv_bias for each of the linear layers and remapping the weight. It also should be possible to use fused attention here, similar to the ViT block.

Most other hierarchical models are structured with stages of blocks instead of blocks of layers, the change in naming threw me off at first. The InternImage paper also used the stages and blocks naming scheme.

The seq_out forward for detection or segmentation is taken care of by feature extraction wrappers in timm, although I'm not sure if it is applicable here due to the way the cllip forwards and the potential for a throughput penalty. I'm not sure about this approach that's hardcoded into the model forward, but using the extraction feature on other models usually incurs a ~10% throughput penalty IME.

Thanks for the information provided. The issue with the link was a mistake I made when copying the link of the paper. As for the issues with some specific implementation, I need to double-check the code and try to contact the original author to determine if these implementations are for a special use.

IridescentPig · 2024-05-13T18:31:02Z

@rwightman @fffffgggg54 Hi, based on the information provided and some of my own thoughts, I have updated the code as follows:

Update the InternImage links, now they link to the correct paper.
Optimize the implementation of CrossAttention, now use simpler projection bias by setting bias=qkv_bias for each of the linear layers.
Use stages of blocks naming scheme instead of blocks of layers now.
As for the forward_features_seq_out function, it outputs the feature maps before downsampling of every stage, which is slightly different from the feature extraction wrappers in timm, so I still keep it in current impl.
Reimplement the DCNv3_pytorch module and rename it DCNv4_pytorch, now its almost the same as the CUDA version of DCNv4, except the impl of forward function. Also the DCN version is now differentiated between CUDA and pytorch, not between v3 and v4.
Remove some unused impls like those for InternImage-H/G to make the code tidy.
Fix some typos and bugs.

If you find some other things that need improvement, please point them out in this PR, and I will manage to optimize the code impl based on them.

rwightman · 2024-05-13T18:49:00Z

@IridescentPig thanks for the PR and info... I took a brief look and quality is good, haven't had a chance to dig in... been trying to wrap up a few other models/features to get a release out so this will probably fall into next release if everything looks okay.

Just a note on feature extraction though, for models with hierarchical feature maps, we do want the 'deepest feature at each feature map resolution' to be extracted by default, I've often ended up remapping models that don't make this easy (ie ones that have a downsample right at the end of a stage / block instead of at the start)...

I also implemented a new feature forward_intermediates extraction API for models that don't have a straightforward nn.Sequential of stages that makes it easy with the original helpers. It exists as its own function in the model and can also be used by a new feature extraction wrapper... being used for vit models and some others that were difficult to utilize with the original wrappers...

EDIT: will probably look at merging & testing this + #2169 + maybe an initial MobileNetV4 as the next push after the one I'm currently wrapping up.

IridescentPig added 7 commits May 4, 2024 01:22

Add implementation of FlashInternImage

9ffb362

Fix bugs of implementation of FlashInternImage

5dc0fec

Fix bugs of pretrained config and weight loaded of FlashInternImage

5c2bdb5

Fix bugs of forward and backward tests of FlashInternImage

0590b1e

Fix bugs of default_cfgs and forward_features tests of FlashInternImage

0f26e03

Fix bugs of torchscript test of FlashInternImage

7c3cb3e

Pass tests except torch.fx related of FlashInternImage

b000daa

IridescentPig added 4 commits May 13, 2024 01:03

Optimized code implementation of FlashInternImage

3535863

Rename some module and remove some unused variables of FlashInternImage

9a52ad5

Optimize code impl of CrossAttention

c947f1d

Fix bug of checkpoint_filter_fn of FlashInternImage

dee97c9

IridescentPig added 2 commits May 19, 2024 14:35

Update impl of DCNv4_pytorch

3c9b302

Fix bugs of DCNv4_pytorch module

b70b40a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FlashInternImage models #2167

Add FlashInternImage models #2167

IridescentPig commented May 3, 2024 •

edited

HuggingFaceDocBuilderDev commented May 3, 2024

fffffgggg54 commented May 5, 2024

sahilqure commented May 7, 2024 •

edited

IridescentPig commented May 9, 2024

IridescentPig commented May 13, 2024

rwightman commented May 13, 2024 •

edited

Add FlashInternImage models #2167

Are you sure you want to change the base?

Add FlashInternImage models #2167

Conversation

IridescentPig commented May 3, 2024 • edited

HuggingFaceDocBuilderDev commented May 3, 2024

fffffgggg54 commented May 5, 2024

sahilqure commented May 7, 2024 • edited

IridescentPig commented May 9, 2024

IridescentPig commented May 13, 2024

rwightman commented May 13, 2024 • edited

IridescentPig commented May 3, 2024 •

edited

sahilqure commented May 7, 2024 •

edited

rwightman commented May 13, 2024 •

edited