Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FlashInternImage models #2167

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

IridescentPig
Copy link

@IridescentPig IridescentPig commented May 3, 2024

Paper:

Adapted from official impl at https://github.com/OpenGVLab/DCNv4

Some clarifications:

  • FlashInternImage is the InternImage model that uses DCNv4 as its core operation. Since the DCNv4 needs CUDA support, so a pure PyTorch implementation of DCNv3 is integrated as a substitute in the implementation of FlashInternImage. To use DCNv4, users need to install it via pip install DCNv4 (this might take some time). This will raise a warning when users create a FlashInternModel with DCNv4 not available.
  • The pure PyTorch implementation of DCNv3 uses torch.linspace(), which will raise errors when creating a fx model of FlashInternImage. I managed to fix this but failed, so I exclude the FlashInternImage model from tests related to fx model.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@fffffgggg54
Copy link
Contributor

FYI the InternImage links in your PR, top of the implementation, and model class link to the Swin transformer paper.

A few things looking at the cross attention implementation, projection bias would be simpler and match other models by setting bias=qkv_bias for each of the linear layers and remapping the weight. It also should be possible to use fused attention here, similar to the ViT block.

Most other hierarchical models are structured with stages of blocks instead of blocks of layers, the change in naming threw me off at first. The InternImage paper also used the stages and blocks naming scheme.

The seq_out forward for detection or segmentation is taken care of by feature extraction wrappers in timm, although I'm not sure if it is applicable here due to the way the cllip forwards and the potential for a throughput penalty. I'm not sure about this approach that's hardcoded into the model forward, but using the extraction feature on other models usually incurs a ~10% throughput penalty IME.

@sahilqure
Copy link

sahilqure commented May 7, 2024

Please merge this. FlashIntern is really good

@IridescentPig
Copy link
Author

FYI the InternImage links in your PR, top of the implementation, and model class link to the Swin transformer paper.

A few things looking at the cross attention implementation, projection bias would be simpler and match other models by setting bias=qkv_bias for each of the linear layers and remapping the weight. It also should be possible to use fused attention here, similar to the ViT block.

Most other hierarchical models are structured with stages of blocks instead of blocks of layers, the change in naming threw me off at first. The InternImage paper also used the stages and blocks naming scheme.

The seq_out forward for detection or segmentation is taken care of by feature extraction wrappers in timm, although I'm not sure if it is applicable here due to the way the cllip forwards and the potential for a throughput penalty. I'm not sure about this approach that's hardcoded into the model forward, but using the extraction feature on other models usually incurs a ~10% throughput penalty IME.

Thanks for the information provided. The issue with the link was a mistake I made when copying the link of the paper. As for the issues with some specific implementation, I need to double-check the code and try to contact the original author to determine if these implementations are for a special use.

@IridescentPig
Copy link
Author

@rwightman @fffffgggg54 Hi, based on the information provided and some of my own thoughts, I have updated the code as follows:

  • Update the InternImage links, now they link to the correct paper.
  • Optimize the implementation of CrossAttention, now use simpler projection bias by setting bias=qkv_bias for each of the linear layers.
  • Use stages of blocks naming scheme instead of blocks of layers now.
  • As for the forward_features_seq_out function, it outputs the feature maps before downsampling of every stage, which is slightly different from the feature extraction wrappers in timm, so I still keep it in current impl.
  • Reimplement the DCNv3_pytorch module and rename it DCNv4_pytorch, now its almost the same as the CUDA version of DCNv4, except the impl of forward function. Also the DCN version is now differentiated between CUDA and pytorch, not between v3 and v4.
  • Remove some unused impls like those for InternImage-H/G to make the code tidy.
  • Fix some typos and bugs.

If you find some other things that need improvement, please point them out in this PR, and I will manage to optimize the code impl based on them.

@rwightman
Copy link
Collaborator

rwightman commented May 13, 2024

@IridescentPig thanks for the PR and info... I took a brief look and quality is good, haven't had a chance to dig in... been trying to wrap up a few other models/features to get a release out so this will probably fall into next release if everything looks okay.

Just a note on feature extraction though, for models with hierarchical feature maps, we do want the 'deepest feature at each feature map resolution' to be extracted by default, I've often ended up remapping models that don't make this easy (ie ones that have a downsample right at the end of a stage / block instead of at the start)...

I also implemented a new feature forward_intermediates extraction API for models that don't have a straightforward nn.Sequential of stages that makes it easy with the original helpers. It exists as its own function in the model and can also be used by a new feature extraction wrapper... being used for vit models and some others that were difficult to utilize with the original wrappers...

EDIT: will probably look at merging & testing this + #2169 + maybe an initial MobileNetV4 as the next push after the one I'm currently wrapping up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants