Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support open_clip with NPU backend #813

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

MengqingCao
Copy link

@MengqingCao MengqingCao commented Feb 5, 2024

openclip performs great on CLIP model training and inference, but unfortunately, it seems to only support gpu and cpu at the moment. I notice that there is a need for other backends:

And this PR add Ascend NPU backend support. I test the NPU-support feature by eavluating the ViT-L-14 model on ImageNet-1k dataset, and everything goes well.

eval on npu run with:

python3 -m training.main \
    --model ViT-L-14 \
    --pretrained "./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin" \
    --seed 0 \
    --imagenet-val './data/ImageNet-1000/val'

The pretrained wights is downloaded from laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K

The evaluation results of ViT-L-14 on npu:

  • imagenet-zeroshot-val-top1: 78.89%
  • imagenet-zeroshot-val-top5: 95.46%
    image
    The results are close to that of gpu's (top-1 acc: 79.2%).

detailed training logs:

2024-02-05,08:00:10 | INFO | Running with a single process. Device npu:0.
2024-02-05,08:00:10 | INFO | Loaded ViT-L-14 model config.
2024-02-05,08:00:17 | INFO | Loading pretrained ViT-L-14 weights (./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin).
2024-02-05,08:00:21 | INFO | Model:
2024-02-05,08:00:21 | INFO | CLIP(
  (visual): VisionTransformer(
    (conv1): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
    (patch_dropout): Identity()
    (ln_pre): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (transformer): Transformer(
      (resblocks): ModuleList(
        (0-23): 24 x ResidualAttentionBlock(
          (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1024, out_features=1024, bias=True)
          )
          (ls_1): Identity()
          (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): Sequential(
            (c_fc): Linear(in_features=1024, out_features=4096, bias=True)
            (gelu): GELU(approximate='none')
            (c_proj): Linear(in_features=4096, out_features=1024, bias=True)
          )
          (ls_2): Identity()
        )
      )
    )
    (ln_post): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (transformer): Transformer(
    (resblocks): ModuleList(
      (0-11): 12 x ResidualAttentionBlock(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
        )
        (ls_1): Identity()
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (gelu): GELU(approximate='none')
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
        )
        (ls_2): Identity()
      )
    )
  )
  (token_embedding): Embedding(49408, 768)
  (ln_final): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
2024-02-05,08:00:21 | INFO | Params:
2024-02-05,08:00:21 | INFO |   accum_freq: 1
2024-02-05,08:00:21 | INFO |   aug_cfg: {}
2024-02-05,08:00:21 | INFO |   batch_size: 64
2024-02-05,08:00:21 | INFO |   beta1: 0.9
2024-02-05,08:00:21 | INFO |   beta2: 0.98
2024-02-05,08:00:21 | INFO |   checkpoint_path: ./logs/2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp/checkpoints
2024-02-05,08:00:21 | INFO |   coca_caption_loss_weight: 2.0
2024-02-05,08:00:21 | INFO |   coca_contrastive_loss_weight: 1.0
2024-02-05,08:00:21 | INFO |   copy_codebase: False
2024-02-05,08:00:21 | INFO |   csv_caption_key: title
2024-02-05,08:00:21 | INFO |   csv_img_key: filepath
2024-02-05,08:00:21 | INFO |   csv_separator: 	
2024-02-05,08:00:21 | INFO |   dataset_resampled: False
2024-02-05,08:00:21 | INFO |   dataset_type: auto
2024-02-05,08:00:21 | INFO |   ddp_static_graph: False
2024-02-05,08:00:21 | INFO |   debug: False
2024-02-05,08:00:21 | INFO |   delete_previous_checkpoint: False
2024-02-05,08:00:21 | INFO |   device: npu:0
2024-02-05,08:00:21 | INFO |   dist_backend: nccl
2024-02-05,08:00:21 | INFO |   dist_url: env://
2024-02-05,08:00:21 | INFO |   distill: False
2024-02-05,08:00:21 | INFO |   distill_model: None
2024-02-05,08:00:21 | INFO |   distill_pretrained: None
2024-02-05,08:00:21 | INFO |   distributed: False
2024-02-05,08:00:21 | INFO |   epochs: 32
2024-02-05,08:00:21 | INFO |   epochs_cooldown: None
2024-02-05,08:00:21 | INFO |   eps: 1e-06
2024-02-05,08:00:21 | INFO |   force_custom_text: False
2024-02-05,08:00:21 | INFO |   force_image_size: None
2024-02-05,08:00:21 | INFO |   force_patch_dropout: None
2024-02-05,08:00:21 | INFO |   force_quick_gelu: False
2024-02-05,08:00:21 | INFO |   gather_with_grad: False
2024-02-05,08:00:21 | INFO |   grad_checkpointing: False
2024-02-05,08:00:21 | INFO |   grad_clip_norm: None
2024-02-05,08:00:21 | INFO |   horovod: False
2024-02-05,08:00:21 | INFO |   image_interpolation: None
2024-02-05,08:00:21 | INFO |   image_mean: None
2024-02-05,08:00:21 | INFO |   image_resize_mode: None
2024-02-05,08:00:21 | INFO |   image_std: None
2024-02-05,08:00:21 | INFO |   imagenet_v2: None
2024-02-05,08:00:21 | INFO |   imagenet_val: ./data/ImageNet-1000/val
2024-02-05,08:00:21 | INFO |   local_loss: False
2024-02-05,08:00:21 | INFO |   local_rank: 0
2024-02-05,08:00:21 | INFO |   lock_image: False
2024-02-05,08:00:21 | INFO |   lock_image_freeze_bn_stats: False
2024-02-05,08:00:21 | INFO |   lock_image_unlocked_groups: 0
2024-02-05,08:00:21 | INFO |   lock_text: False
2024-02-05,08:00:21 | INFO |   lock_text_freeze_layer_norm: False
2024-02-05,08:00:21 | INFO |   lock_text_unlocked_layers: 0
2024-02-05,08:00:21 | INFO |   log_every_n_steps: 100
2024-02-05,08:00:21 | INFO |   log_level: 20
2024-02-05,08:00:21 | INFO |   log_local: False
2024-02-05,08:00:21 | INFO |   log_path: ./logs/2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp/out.log
2024-02-05,08:00:21 | INFO |   logs: ./logs/
2024-02-05,08:00:21 | INFO |   lr: 0.0005
2024-02-05,08:00:21 | INFO |   lr_cooldown_end: 0.0
2024-02-05,08:00:21 | INFO |   lr_cooldown_power: 1.0
2024-02-05,08:00:21 | INFO |   lr_scheduler: cosine
2024-02-05,08:00:21 | INFO |   model: ViT-L-14
2024-02-05,08:00:21 | INFO |   name: 2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp
2024-02-05,08:00:21 | INFO |   no_set_device_rank: False
2024-02-05,08:00:21 | INFO |   precision: amp
2024-02-05,08:00:21 | INFO |   pretrained: ./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin
2024-02-05,08:00:21 | INFO |   pretrained_image: False
2024-02-05,08:00:21 | INFO |   rank: 0
2024-02-05,08:00:21 | INFO |   remote_sync: None
2024-02-05,08:00:21 | INFO |   remote_sync_frequency: 300
2024-02-05,08:00:21 | INFO |   remote_sync_protocol: s3
2024-02-05,08:00:21 | INFO |   report_to: 
2024-02-05,08:00:21 | INFO |   resume: None
2024-02-05,08:00:21 | INFO |   save_frequency: 1
2024-02-05,08:00:21 | INFO |   save_most_recent: False
2024-02-05,08:00:21 | INFO |   seed: 0
2024-02-05,08:00:21 | INFO |   siglip: False
2024-02-05,08:00:21 | INFO |   skip_scheduler: False
2024-02-05,08:00:21 | INFO |   tensorboard: False
2024-02-05,08:00:21 | INFO |   tensorboard_path: 
2024-02-05,08:00:21 | INFO |   torchcompile: False
2024-02-05,08:00:21 | INFO |   torchscript: False
2024-02-05,08:00:21 | INFO |   trace: False
2024-02-05,08:00:21 | INFO |   train_data: None
2024-02-05,08:00:21 | INFO |   train_data_upsampling_factors: None
2024-02-05,08:00:21 | INFO |   train_num_samples: None
2024-02-05,08:00:21 | INFO |   use_bn_sync: False
2024-02-05,08:00:21 | INFO |   use_bnb_linear: None
2024-02-05,08:00:21 | INFO |   val_data: None
2024-02-05,08:00:21 | INFO |   val_frequency: 1
2024-02-05,08:00:21 | INFO |   val_num_samples: None
2024-02-05,08:00:21 | INFO |   wandb: False
2024-02-05,08:00:21 | INFO |   wandb_notes: 
2024-02-05,08:00:21 | INFO |   wandb_project_name: open-clip
2024-02-05,08:00:21 | INFO |   warmup: 10000
2024-02-05,08:00:21 | INFO |   wd: 0.2
2024-02-05,08:00:21 | INFO |   workers: 4
2024-02-05,08:00:21 | INFO |   world_size: 1
2024-02-05,08:00:21 | INFO |   zeroshot_frequency: 2
2024-02-05,08:00:21 | INFO | Starting zero-shot imagenet.
2024-02-05,08:00:21 | INFO | Building zero-shot classifier
2024-02-05,08:01:13 | INFO | Using classifier
2024-02-05,08:02:09 | INFO | Finished zero-shot imagenet.
2024-02-05,08:02:09 | INFO | Eval Epoch: 0 imagenet-zeroshot-val-top1: 0.7889	imagenet-zeroshot-val-top5: 0.9546

@rom1504
Copy link
Collaborator

rom1504 commented Feb 5, 2024 via email

@MengqingCao
Copy link
Author

Cool! How is the inference and training speed?

Your speed of reply is amazing! : )
As the following pic shows, it takes around 55s for inferencing ViT-L-14 on ImageNet-1k validation dataset. (with batchsize=64 and 1 npu device)
image

So I think it's fast but I haven't tested the exact FLOPS. Is the FLOPS required?

@rom1504
Copy link
Collaborator

rom1504 commented Feb 5, 2024 via email

@MengqingCao
Copy link
Author

A metric we usually look at is the sample/s per accelerator. Some baselines: on one 3080 GPUs - B/32 inference speed is about 1300 sample/s - L/14 is about 300 sample/s Usually increasing the batch size to values like 256 help. For training on one A100 it looks like - 250 sample/s for B/32 (can be more if using less accelerators, hence having less interconnect bottleneck) - 80 sample/s for L/14 Usually with batch sizes around 128 per GPU. I think it would be very interesting to have similar numbers on NPU

On Mon, Feb 5, 2024, 1:30 PM Mengqing Cao @.> wrote: Cool! How is the inference and training speed? Your speed of reply is amazing! : ) As the following pic shows, it takes around 55s for inferencing ViT-L-14 on ImageNet-1k validation dataset. (with batchsize=64 and 1 npu device) image.png (view on web) https://github.com/mlfoundations/open_clip/assets/52243582/30356825-6496-4c4d-af64-79d4b31890d6 So I think it's fast but I haven't tested the exact FLOPS. Is the FLOPS required? — Reply to this email directly, view it on GitHub <#813 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437U4EVDRNYURNDJSQ3TYSDGEVAVCNFSM6AAAAABCZV26HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRWHA4DSMBZGA . You are receiving this because you commented.Message ID: @.>

sorry for the late reply and thanks for your explanation.

I've noticed that code implementations of this metric exist in the training pipeline, and it is named samples_per_second_per_gpu in src/training/train.py.
I have tested the sample/s metric of npu on the training pipeline with the following results:

  • 341.550 samples/s/npu for B/32 (with 128 batchsize)
  • 52.5704 samples/s/npu for L/14 (with 64 batchsize, my npu supports up to this batchsize)

I'm a bit confused whether the inference speed you mentioned is the process of evaluating the CLIP model, or the inference process of using the CLIP model for zero-shot image classification?

Screenshots

B/32:
image

L/14:
image

@MengqingCao
Copy link
Author

@rom1504 Hi, weeks went, if there is any suggestions or concerns, plz let me know and I'll address them as soon.

@MengqingCao
Copy link
Author

MengqingCao commented Mar 25, 2024

Could anyone help for reviewing? Thx 👍 @rom1504 @rwightman @gabrielilharco @bryant1410 @mitchellnw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants