Encountered freezing during start training at iteration 0 #5281

NatchapolShinno · 2024-05-08T08:10:38Z

I'm attempting to implement ViTGaze, but I've encountered an issue with a specific line of code. Upon investigation, I noticed that it's not utilizing GPU resources at all and is freezing at this point. Below is my logs. Despite several hours having passed, the "Starting training from iteration 0" line still persists. I'm training on videoattentiontarget dataset.


[05/08 16:59:38 detectron2]: Model:
GazeAttentionMapper(
  (backbone): ViT(
    (patch_embed): PatchEmbed(
      (proj): Conv2d(3, 384, kernel_size=(14, 14), stride=(14, 14))
    )
    (extra_pos_embed): Identity()
    (blocks): ModuleList(
      (0-11): 12 x Block(
        (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=384, out_features=1152, bias=True)
          (proj): Linear(in_features=384, out_features=384, bias=True)
        )
        (ls1): LayerScale()
        (drop_path): Identity()
        (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=384, out_features=1536, bias=True)
          (act): GELU(approximate='none')
          (drop1): Dropout(p=0.0, inplace=False)
          (norm): Identity()
          (fc2): Linear(in_features=1536, out_features=384, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
        (ls2): LayerScale()
      )
    )
    (norm): Identity()
  )
  (pam): PatchPAM(
    (patch_embed): Sequential(
      (patch_embed): Conv2d(3, 8, kernel_size=(14, 14), stride=(14, 14))
      (act_layer): ReLU(inplace=True)
    )
    (embed): Conv2d(8, 1, kernel_size=(1, 1), stride=(1, 1))
    (aux_embed): Conv2d(8, 1, kernel_size=(1, 1), stride=(1, 1))
  )
  (regressor): UpSampleConv(
    (pre_norm): Identity()
    (conv): Identity()
    (decoder): Sequential(
      (upsample1): Upsample(scale_factor=2.0, mode='bilinear')
      (conv1): Conv2d(24, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (bn1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu1): ReLU(inplace=True)
      (upsample2): Upsample(scale_factor=2.0, mode='bilinear')
      (conv2): Conv2d(16, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (bn2): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu2): ReLU(inplace=True)
      (upsample3): Upsample(scale_factor=2.0, mode='bilinear')
      (conv3): Conv2d(8, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (bn3): BatchNorm2d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu3): ReLU(inplace=True)
      (conv): Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1))
    )
  )
  (classifier): SimpleMlp(
    (classifier): Sequential(
      (dropout0): Dropout(p=0, inplace=False)
      (linear0): Linear(in_features=384, out_features=384, bias=True)
      (relu): ReLU()
      (dropout1): Dropout(p=0, inplace=False)
      (linear1): Linear(in_features=384, out_features=1, bias=True)
    )
  )
  (criterion): GazeMapperCriterion(
    (heatmap_loss): MSELoss()
    (aux_loss): BCEWithLogitsLoss()
  )
)
[05/08 16:59:40 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from /home/slab/ViTGaze/output/gazefollow_518/model_final.pth ...
[05/08 16:59:40 fvcore.common.checkpoint]: [Checkpointer] Loading from /home/slab/ViTGaze/output/gazefollow_518/model_final.pth ...
[05/08 16:59:40 d2.engine.train_loop]: Starting training from iteration 0

Environment:

[05/08 16:59:38 detectron2]: Environment info:
-------------------------------  -----------------------------------------------------------------------
sys.platform                     linux
Python                           3.8.10 (default, Jun  4 2021, 15:09:15) [GCC 7.5.0]
numpy                            1.23.5
detectron2                       0.6 @/home/slab/.local/lib/python3.8/site-packages/detectron2
Compiler                         GCC 9.4
CUDA compiler                    CUDA 11.7
detectron2 arch flags            7.5
DETECTRON2_ENV_MODULE            <not set>
PyTorch                          2.0.1+cu117 @/home/slab/.local/lib/python3.8/site-packages/torch
PyTorch debug build              False
torch._C._GLIBCXX_USE_CXX11_ABI  False
GPU available                    Yes
GPU 0                            NVIDIA TITAN RTX (arch=7.5)
Driver version                   515.43.04
CUDA_HOME                        /usr/local/cuda-11.7
Pillow                           10.3.0
torchvision                      0.15.2+cu117 @/home/slab/.local/lib/python3.8/site-packages/torchvision
torchvision arch flags           3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore                           0.1.5.post20221221
iopath                           0.1.9
cv2                              4.8.1
-------------------------------  -----------------------------------------------------------------------
PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.7
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.5
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

Could you please help me? Thank you in advance.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-08T08:10:51Z

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template.
The following information is missing: "Instructions To Reproduce the Issue and Full Logs";

Programmer-RD-AI · 2024-05-27T03:15:51Z

Hi,
This may be caused due to the size of the dataset and the model size as well,
I would recommend you try and train a basic level model first and see performance...
Thank you

github-actions bot added the needs-more-info More info is needed to complete the issue label May 8, 2024

github-actions bot removed the needs-more-info More info is needed to complete the issue label May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encountered freezing during start training at iteration 0 #5281

Encountered freezing during start training at iteration 0 #5281

NatchapolShinno commented May 8, 2024 •

edited

github-actions bot commented May 8, 2024

Programmer-RD-AI commented May 27, 2024

Encountered freezing during start training at iteration 0 #5281

Encountered freezing during start training at iteration 0 #5281

Comments

NatchapolShinno commented May 8, 2024 • edited

Environment:

github-actions bot commented May 8, 2024

Programmer-RD-AI commented May 27, 2024

NatchapolShinno commented May 8, 2024 •

edited