Support open_clip with NPU backend #813

MengqingCao · 2024-02-05T08:15:11Z

openclip performs great on CLIP model training and inference, but unfortunately, it seems to only support gpu and cpu at the moment. I notice that there is a need for other backends:

TPU support. TPU support. #20
More backends support More backends support #796

And this PR add Ascend NPU backend support. I test the NPU-support feature by eavluating the ViT-L-14 model on ImageNet-1k dataset, and everything goes well.

eval on npu run with:

python3 -m training.main \
    --model ViT-L-14 \
    --pretrained "./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin" \
    --seed 0 \
    --imagenet-val './data/ImageNet-1000/val'

The pretrained wights is downloaded from laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K

The evaluation results of ViT-L-14 on npu:

imagenet-zeroshot-val-top1: 78.89%
imagenet-zeroshot-val-top5: 95.46%

The results are close to that of gpu's (top-1 acc: 79.2%).

detailed training logs:

2024-02-05,08:00:10 | INFO | Running with a single process. Device npu:0.
2024-02-05,08:00:10 | INFO | Loaded ViT-L-14 model config.
2024-02-05,08:00:17 | INFO | Loading pretrained ViT-L-14 weights (./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin).
2024-02-05,08:00:21 | INFO | Model:
2024-02-05,08:00:21 | INFO | CLIP(
  (visual): VisionTransformer(
    (conv1): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
    (patch_dropout): Identity()
    (ln_pre): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (transformer): Transformer(
      (resblocks): ModuleList(
        (0-23): 24 x ResidualAttentionBlock(
          (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1024, out_features=1024, bias=True)
          )
          (ls_1): Identity()
          (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): Sequential(
            (c_fc): Linear(in_features=1024, out_features=4096, bias=True)
            (gelu): GELU(approximate='none')
            (c_proj): Linear(in_features=4096, out_features=1024, bias=True)
          )
          (ls_2): Identity()
        )
      )
    )
    (ln_post): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (transformer): Transformer(
    (resblocks): ModuleList(
      (0-11): 12 x ResidualAttentionBlock(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
        )
        (ls_1): Identity()
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (gelu): GELU(approximate='none')
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
        )
        (ls_2): Identity()
      )
    )
  )
  (token_embedding): Embedding(49408, 768)
  (ln_final): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
2024-02-05,08:00:21 | INFO | Params:
2024-02-05,08:00:21 | INFO |   accum_freq: 1
2024-02-05,08:00:21 | INFO |   aug_cfg: {}
2024-02-05,08:00:21 | INFO |   batch_size: 64
2024-02-05,08:00:21 | INFO |   beta1: 0.9
2024-02-05,08:00:21 | INFO |   beta2: 0.98
2024-02-05,08:00:21 | INFO |   checkpoint_path: ./logs/2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp/checkpoints
2024-02-05,08:00:21 | INFO |   coca_caption_loss_weight: 2.0
2024-02-05,08:00:21 | INFO |   coca_contrastive_loss_weight: 1.0
2024-02-05,08:00:21 | INFO |   copy_codebase: False
2024-02-05,08:00:21 | INFO |   csv_caption_key: title
2024-02-05,08:00:21 | INFO |   csv_img_key: filepath
2024-02-05,08:00:21 | INFO |   csv_separator: 	
2024-02-05,08:00:21 | INFO |   dataset_resampled: False
2024-02-05,08:00:21 | INFO |   dataset_type: auto
2024-02-05,08:00:21 | INFO |   ddp_static_graph: False
2024-02-05,08:00:21 | INFO |   debug: False
2024-02-05,08:00:21 | INFO |   delete_previous_checkpoint: False
2024-02-05,08:00:21 | INFO |   device: npu:0
2024-02-05,08:00:21 | INFO |   dist_backend: nccl
2024-02-05,08:00:21 | INFO |   dist_url: env://
2024-02-05,08:00:21 | INFO |   distill: False
2024-02-05,08:00:21 | INFO |   distill_model: None
2024-02-05,08:00:21 | INFO |   distill_pretrained: None
2024-02-05,08:00:21 | INFO |   distributed: False
2024-02-05,08:00:21 | INFO |   epochs: 32
2024-02-05,08:00:21 | INFO |   epochs_cooldown: None
2024-02-05,08:00:21 | INFO |   eps: 1e-06
2024-02-05,08:00:21 | INFO |   force_custom_text: False
2024-02-05,08:00:21 | INFO |   force_image_size: None
2024-02-05,08:00:21 | INFO |   force_patch_dropout: None
2024-02-05,08:00:21 | INFO |   force_quick_gelu: False
2024-02-05,08:00:21 | INFO |   gather_with_grad: False
2024-02-05,08:00:21 | INFO |   grad_checkpointing: False
2024-02-05,08:00:21 | INFO |   grad_clip_norm: None
2024-02-05,08:00:21 | INFO |   horovod: False
2024-02-05,08:00:21 | INFO |   image_interpolation: None
2024-02-05,08:00:21 | INFO |   image_mean: None
2024-02-05,08:00:21 | INFO |   image_resize_mode: None
2024-02-05,08:00:21 | INFO |   image_std: None
2024-02-05,08:00:21 | INFO |   imagenet_v2: None
2024-02-05,08:00:21 | INFO |   imagenet_val: ./data/ImageNet-1000/val
2024-02-05,08:00:21 | INFO |   local_loss: False
2024-02-05,08:00:21 | INFO |   local_rank: 0
2024-02-05,08:00:21 | INFO |   lock_image: False
2024-02-05,08:00:21 | INFO |   lock_image_freeze_bn_stats: False
2024-02-05,08:00:21 | INFO |   lock_image_unlocked_groups: 0
2024-02-05,08:00:21 | INFO |   lock_text: False
2024-02-05,08:00:21 | INFO |   lock_text_freeze_layer_norm: False
2024-02-05,08:00:21 | INFO |   lock_text_unlocked_layers: 0
2024-02-05,08:00:21 | INFO |   log_every_n_steps: 100
2024-02-05,08:00:21 | INFO |   log_level: 20
2024-02-05,08:00:21 | INFO |   log_local: False
2024-02-05,08:00:21 | INFO |   log_path: ./logs/2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp/out.log
2024-02-05,08:00:21 | INFO |   logs: ./logs/
2024-02-05,08:00:21 | INFO |   lr: 0.0005
2024-02-05,08:00:21 | INFO |   lr_cooldown_end: 0.0
2024-02-05,08:00:21 | INFO |   lr_cooldown_power: 1.0
2024-02-05,08:00:21 | INFO |   lr_scheduler: cosine
2024-02-05,08:00:21 | INFO |   model: ViT-L-14
2024-02-05,08:00:21 | INFO |   name: 2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp
2024-02-05,08:00:21 | INFO |   no_set_device_rank: False
2024-02-05,08:00:21 | INFO |   precision: amp
2024-02-05,08:00:21 | INFO |   pretrained: ./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin
2024-02-05,08:00:21 | INFO |   pretrained_image: False
2024-02-05,08:00:21 | INFO |   rank: 0
2024-02-05,08:00:21 | INFO |   remote_sync: None
2024-02-05,08:00:21 | INFO |   remote_sync_frequency: 300
2024-02-05,08:00:21 | INFO |   remote_sync_protocol: s3
2024-02-05,08:00:21 | INFO |   report_to: 
2024-02-05,08:00:21 | INFO |   resume: None
2024-02-05,08:00:21 | INFO |   save_frequency: 1
2024-02-05,08:00:21 | INFO |   save_most_recent: False
2024-02-05,08:00:21 | INFO |   seed: 0
2024-02-05,08:00:21 | INFO |   siglip: False
2024-02-05,08:00:21 | INFO |   skip_scheduler: False
2024-02-05,08:00:21 | INFO |   tensorboard: False
2024-02-05,08:00:21 | INFO |   tensorboard_path: 
2024-02-05,08:00:21 | INFO |   torchcompile: False
2024-02-05,08:00:21 | INFO |   torchscript: False
2024-02-05,08:00:21 | INFO |   trace: False
2024-02-05,08:00:21 | INFO |   train_data: None
2024-02-05,08:00:21 | INFO |   train_data_upsampling_factors: None
2024-02-05,08:00:21 | INFO |   train_num_samples: None
2024-02-05,08:00:21 | INFO |   use_bn_sync: False
2024-02-05,08:00:21 | INFO |   use_bnb_linear: None
2024-02-05,08:00:21 | INFO |   val_data: None
2024-02-05,08:00:21 | INFO |   val_frequency: 1
2024-02-05,08:00:21 | INFO |   val_num_samples: None
2024-02-05,08:00:21 | INFO |   wandb: False
2024-02-05,08:00:21 | INFO |   wandb_notes: 
2024-02-05,08:00:21 | INFO |   wandb_project_name: open-clip
2024-02-05,08:00:21 | INFO |   warmup: 10000
2024-02-05,08:00:21 | INFO |   wd: 0.2
2024-02-05,08:00:21 | INFO |   workers: 4
2024-02-05,08:00:21 | INFO |   world_size: 1
2024-02-05,08:00:21 | INFO |   zeroshot_frequency: 2
2024-02-05,08:00:21 | INFO | Starting zero-shot imagenet.
2024-02-05,08:00:21 | INFO | Building zero-shot classifier
2024-02-05,08:01:13 | INFO | Using classifier
2024-02-05,08:02:09 | INFO | Finished zero-shot imagenet.
2024-02-05,08:02:09 | INFO | Eval Epoch: 0 imagenet-zeroshot-val-top1: 0.7889	imagenet-zeroshot-val-top5: 0.9546

rom1504 · 2024-02-05T09:05:08Z

Cool! How is the inference and training speed?

…

On Mon, Feb 5, 2024, 9:15 AM Mengqing Cao ***@***.***> wrote: openclip performs great on CLIP model training and inference, but unfortunately, it seems to only support gpu and cpu at the moment. I notice that there is a need for other backend: - TPU support. #20 <#20> - More backends support #796 <#796> And this PR add Ascend NPU backend support. I test the NPU-support feature by eavluating the ViT-L-14 model on ImageNet-1k dataset, and everything goes well. *eval on npu run with:* python3 -m training.main \ --model ViT-L-14 \ --pretrained "./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin" \ --seed 0 \ --imagenet-val './data/ImageNet-1000/val' The pretrained wights is downloaded from laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K <https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K> The evaluation results of ViT-L-14 on npu: - imagenet-zeroshot-val-top1: *78.89%* - imagenet-zeroshot-val-top5: *95.46%* image.png (view on web) <https://github.com/mlfoundations/open_clip/assets/52243582/3df7fb0c-9928-4944-8a1a-e358240725b3> The results are close to that of gpu's (top-1 acc: 79.2%). detailed training logs: 2024-02-05,08:00:10 | INFO | Running with a single process. Device npu:0. 2024-02-05,08:00:10 | INFO | Loaded ViT-L-14 model config. 2024-02-05,08:00:17 | INFO | Loading pretrained ViT-L-14 weights (./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin). 2024-02-05,08:00:21 | INFO | Model: 2024-02-05,08:00:21 | INFO | CLIP( (visual): VisionTransformer( (conv1): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False) (patch_dropout): Identity() (ln_pre): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (transformer): Transformer( (resblocks): ModuleList( (0-23): 24 x ResidualAttentionBlock( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=1024, out_features=1024, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=1024, out_features=4096, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=4096, out_features=1024, bias=True) ) (ls_2): Identity() ) ) ) (ln_post): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (transformer): Transformer( (resblocks): ModuleList( (0-11): 12 x ResidualAttentionBlock( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=768, out_features=3072, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=3072, out_features=768, bias=True) ) (ls_2): Identity() ) ) ) (token_embedding): Embedding(49408, 768) (ln_final): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) 2024-02-05,08:00:21 | INFO | Params: 2024-02-05,08:00:21 | INFO | accum_freq: 1 2024-02-05,08:00:21 | INFO | aug_cfg: {} 2024-02-05,08:00:21 | INFO | batch_size: 64 2024-02-05,08:00:21 | INFO | beta1: 0.9 2024-02-05,08:00:21 | INFO | beta2: 0.98 2024-02-05,08:00:21 | INFO | checkpoint_path: ./logs/2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp/checkpoints 2024-02-05,08:00:21 | INFO | coca_caption_loss_weight: 2.0 2024-02-05,08:00:21 | INFO | coca_contrastive_loss_weight: 1.0 2024-02-05,08:00:21 | INFO | copy_codebase: False 2024-02-05,08:00:21 | INFO | csv_caption_key: title 2024-02-05,08:00:21 | INFO | csv_img_key: filepath 2024-02-05,08:00:21 | INFO | csv_separator: 2024-02-05,08:00:21 | INFO | dataset_resampled: False 2024-02-05,08:00:21 | INFO | dataset_type: auto 2024-02-05,08:00:21 | INFO | ddp_static_graph: False 2024-02-05,08:00:21 | INFO | debug: False 2024-02-05,08:00:21 | INFO | delete_previous_checkpoint: False 2024-02-05,08:00:21 | INFO | device: npu:0 2024-02-05,08:00:21 | INFO | dist_backend: nccl 2024-02-05,08:00:21 | INFO | dist_url: env:// 2024-02-05,08:00:21 | INFO | distill: False 2024-02-05,08:00:21 | INFO | distill_model: None 2024-02-05,08:00:21 | INFO | distill_pretrained: None 2024-02-05,08:00:21 | INFO | distributed: False 2024-02-05,08:00:21 | INFO | epochs: 32 2024-02-05,08:00:21 | INFO | epochs_cooldown: None 2024-02-05,08:00:21 | INFO | eps: 1e-06 2024-02-05,08:00:21 | INFO | force_custom_text: False 2024-02-05,08:00:21 | INFO | force_image_size: None 2024-02-05,08:00:21 | INFO | force_patch_dropout: None 2024-02-05,08:00:21 | INFO | force_quick_gelu: False 2024-02-05,08:00:21 | INFO | gather_with_grad: False 2024-02-05,08:00:21 | INFO | grad_checkpointing: False 2024-02-05,08:00:21 | INFO | grad_clip_norm: None 2024-02-05,08:00:21 | INFO | horovod: False 2024-02-05,08:00:21 | INFO | image_interpolation: None 2024-02-05,08:00:21 | INFO | image_mean: None 2024-02-05,08:00:21 | INFO | image_resize_mode: None 2024-02-05,08:00:21 | INFO | image_std: None 2024-02-05,08:00:21 | INFO | imagenet_v2: None 2024-02-05,08:00:21 | INFO | imagenet_val: ./data/ImageNet-1000/val 2024-02-05,08:00:21 | INFO | local_loss: False 2024-02-05,08:00:21 | INFO | local_rank: 0 2024-02-05,08:00:21 | INFO | lock_image: False 2024-02-05,08:00:21 | INFO | lock_image_freeze_bn_stats: False 2024-02-05,08:00:21 | INFO | lock_image_unlocked_groups: 0 2024-02-05,08:00:21 | INFO | lock_text: False 2024-02-05,08:00:21 | INFO | lock_text_freeze_layer_norm: False 2024-02-05,08:00:21 | INFO | lock_text_unlocked_layers: 0 2024-02-05,08:00:21 | INFO | log_every_n_steps: 100 2024-02-05,08:00:21 | INFO | log_level: 20 2024-02-05,08:00:21 | INFO | log_local: False 2024-02-05,08:00:21 | INFO | log_path: ./logs/2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp/out.log 2024-02-05,08:00:21 | INFO | logs: ./logs/ 2024-02-05,08:00:21 | INFO | lr: 0.0005 2024-02-05,08:00:21 | INFO | lr_cooldown_end: 0.0 2024-02-05,08:00:21 | INFO | lr_cooldown_power: 1.0 2024-02-05,08:00:21 | INFO | lr_scheduler: cosine 2024-02-05,08:00:21 | INFO | model: ViT-L-14 2024-02-05,08:00:21 | INFO | name: 2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp 2024-02-05,08:00:21 | INFO | no_set_device_rank: False 2024-02-05,08:00:21 | INFO | precision: amp 2024-02-05,08:00:21 | INFO | pretrained: ./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin 2024-02-05,08:00:21 | INFO | pretrained_image: False 2024-02-05,08:00:21 | INFO | rank: 0 2024-02-05,08:00:21 | INFO | remote_sync: None 2024-02-05,08:00:21 | INFO | remote_sync_frequency: 300 2024-02-05,08:00:21 | INFO | remote_sync_protocol: s3 2024-02-05,08:00:21 | INFO | report_to: 2024-02-05,08:00:21 | INFO | resume: None 2024-02-05,08:00:21 | INFO | save_frequency: 1 2024-02-05,08:00:21 | INFO | save_most_recent: False 2024-02-05,08:00:21 | INFO | seed: 0 2024-02-05,08:00:21 | INFO | siglip: False 2024-02-05,08:00:21 | INFO | skip_scheduler: False 2024-02-05,08:00:21 | INFO | tensorboard: False 2024-02-05,08:00:21 | INFO | tensorboard_path: 2024-02-05,08:00:21 | INFO | torchcompile: False 2024-02-05,08:00:21 | INFO | torchscript: False 2024-02-05,08:00:21 | INFO | trace: False 2024-02-05,08:00:21 | INFO | train_data: None 2024-02-05,08:00:21 | INFO | train_data_upsampling_factors: None 2024-02-05,08:00:21 | INFO | train_num_samples: None 2024-02-05,08:00:21 | INFO | use_bn_sync: False 2024-02-05,08:00:21 | INFO | use_bnb_linear: None 2024-02-05,08:00:21 | INFO | val_data: None 2024-02-05,08:00:21 | INFO | val_frequency: 1 2024-02-05,08:00:21 | INFO | val_num_samples: None 2024-02-05,08:00:21 | INFO | wandb: False 2024-02-05,08:00:21 | INFO | wandb_notes: 2024-02-05,08:00:21 | INFO | wandb_project_name: open-clip 2024-02-05,08:00:21 | INFO | warmup: 10000 2024-02-05,08:00:21 | INFO | wd: 0.2 2024-02-05,08:00:21 | INFO | workers: 4 2024-02-05,08:00:21 | INFO | world_size: 1 2024-02-05,08:00:21 | INFO | zeroshot_frequency: 2 2024-02-05,08:00:21 | INFO | Starting zero-shot imagenet. 2024-02-05,08:00:21 | INFO | Building zero-shot classifier 2024-02-05,08:01:13 | INFO | Using classifier 2024-02-05,08:02:09 | INFO | Finished zero-shot imagenet. 2024-02-05,08:02:09 | INFO | Eval Epoch: 0 imagenet-zeroshot-val-top1: 0.7889 imagenet-zeroshot-val-top5: 0.9546 ------------------------------ You can view, comment on, or merge this pull request online at: #813 Commit Summary - a6a2032 <a6a2032> add npu support File Changes (5 files <https://github.com/mlfoundations/open_clip/pull/813/files>) - *A* requirements-npu.txt <https://github.com/mlfoundations/open_clip/pull/813/files#diff-9b6c5e535fc5c475ff121268847e0dcd5d633fc27a6e0aa0781540ca7252e0e4> (7) - *M* src/training/distributed.py <https://github.com/mlfoundations/open_clip/pull/813/files#diff-467ce0e8c18cca22eccaee323a96ae4c702ff61cf45eee98530c4667453ca193> (9) - *M* src/training/main.py <https://github.com/mlfoundations/open_clip/pull/813/files#diff-8cac5527ae65d91d536016bb558349a70695c2e856a3e0526b21df7c69f9b8b2> (10) - *M* src/training/precision.py <https://github.com/mlfoundations/open_clip/pull/813/files#diff-fd92cef91b8b92ae70b1773f98ee605a215a10eaf7b54d794d00b25c6aa30571> (5) - *M* src/training/profiler.py <https://github.com/mlfoundations/open_clip/pull/813/files#diff-a98ec43d4829e757d6822f426daf81c934360a394502f3570e89112a4678a6c2> (7) Patch Links: - https://github.com/mlfoundations/open_clip/pull/813.patch - https://github.com/mlfoundations/open_clip/pull/813.diff — Reply to this email directly, view it on GitHub <#813>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437XGQNIGQOL7NQBGUQ3YSCIJZAVCNFSM6AAAAABCZV26HGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEYTOOJSG44TAOI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

MengqingCao · 2024-02-05T12:29:50Z

Cool! How is the inference and training speed?

Your speed of reply is amazing! : )
As the following pic shows, it takes around 55s for inferencing ViT-L-14 on ImageNet-1k validation dataset. (with batchsize=64 and 1 npu device)

So I think it's fast but I haven't tested the exact FLOPS. Is the FLOPS required?

rom1504 · 2024-02-05T15:35:55Z

A metric we usually look at is the sample/s per accelerator. Some baselines: on one 3080 GPUs - B/32 inference speed is about 1300 sample/s - L/14 is about 300 sample/s Usually increasing the batch size to values like 256 help. For training on one A100 it looks like - 250 sample/s for B/32 (can be more if using less accelerators, hence having less interconnect bottleneck) - 80 sample/s for L/14 Usually with batch sizes around 128 per GPU. I think it would be very interesting to have similar numbers on NPU

…

On Mon, Feb 5, 2024, 1:30 PM Mengqing Cao ***@***.***> wrote: Cool! How is the inference and training speed? Your speed of reply is amazing! : ) As the following pic shows, it takes around 55s for inferencing ViT-L-14 on ImageNet-1k validation dataset. (with batchsize=64 and 1 npu device) image.png (view on web) <https://github.com/mlfoundations/open_clip/assets/52243582/30356825-6496-4c4d-af64-79d4b31890d6> So I think it's fast but I haven't tested the exact FLOPS. Is the FLOPS required? — Reply to this email directly, view it on GitHub <#813 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437U4EVDRNYURNDJSQ3TYSDGEVAVCNFSM6AAAAABCZV26HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRWHA4DSMBZGA> . You are receiving this because you commented.Message ID: ***@***.***>

MengqingCao · 2024-02-17T09:41:14Z

A metric we usually look at is the sample/s per accelerator. Some baselines: on one 3080 GPUs - B/32 inference speed is about 1300 sample/s - L/14 is about 300 sample/s Usually increasing the batch size to values like 256 help. For training on one A100 it looks like - 250 sample/s for B/32 (can be more if using less accelerators, hence having less interconnect bottleneck) - 80 sample/s for L/14 Usually with batch sizes around 128 per GPU. I think it would be very interesting to have similar numbers on NPU
…
On Mon, Feb 5, 2024, 1:30 PM Mengqing Cao @.> wrote: Cool! How is the inference and training speed? Your speed of reply is amazing! : ) As the following pic shows, it takes around 55s for inferencing ViT-L-14 on ImageNet-1k validation dataset. (with batchsize=64 and 1 npu device) image.png (view on web) https://github.com/mlfoundations/open_clip/assets/52243582/30356825-6496-4c4d-af64-79d4b31890d6 So I think it's fast but I haven't tested the exact FLOPS. Is the FLOPS required? — Reply to this email directly, view it on GitHub <#813 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437U4EVDRNYURNDJSQ3TYSDGEVAVCNFSM6AAAAABCZV26HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRWHA4DSMBZGA . You are receiving this because you commented.Message ID: @.>

sorry for the late reply and thanks for your explanation.

I've noticed that code implementations of this metric exist in the training pipeline, and it is named samples_per_second_per_gpu in src/training/train.py.
I have tested the sample/s metric of npu on the training pipeline with the following results:

341.550 samples/s/npu for B/32 (with 128 batchsize)
52.5704 samples/s/npu for L/14 (with 64 batchsize, my npu supports up to this batchsize)

I'm a bit confused whether the inference speed you mentioned is the process of evaluating the CLIP model, or the inference process of using the CLIP model for zero-shot image classification?

Screenshots

B/32:

L/14:

MengqingCao · 2024-03-13T06:27:00Z

@rom1504 Hi, weeks went, if there is any suggestions or concerns, plz let me know and I'll address them as soon.

MengqingCao · 2024-03-25T06:44:41Z

Could anyone help for reviewing? Thx 👍 @rom1504 @rwightman @gabrielilharco @bryant1410 @mitchellnw

add npu support

cd48368

MengqingCao force-pushed the npu_support branch from 6cc2de4 to cd48368 Compare March 13, 2024 02:00

add TORCH_NPU_AVAILABLE

a5e22fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support open_clip with NPU backend #813

Support open_clip with NPU backend #813

MengqingCao commented Feb 5, 2024 •

edited

rom1504 commented Feb 5, 2024 via email

MengqingCao commented Feb 5, 2024

rom1504 commented Feb 5, 2024 via email

MengqingCao commented Feb 17, 2024

MengqingCao commented Mar 13, 2024

MengqingCao commented Mar 25, 2024 •

edited

Support open_clip with NPU backend #813

Are you sure you want to change the base?

Support open_clip with NPU backend #813

Conversation

MengqingCao commented Feb 5, 2024 • edited

rom1504 commented Feb 5, 2024 via email

MengqingCao commented Feb 5, 2024

rom1504 commented Feb 5, 2024 via email

MengqingCao commented Feb 17, 2024

Screenshots

MengqingCao commented Mar 13, 2024

MengqingCao commented Mar 25, 2024 • edited

MengqingCao commented Feb 5, 2024 •

edited

MengqingCao commented Mar 25, 2024 •

edited