Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encountered freezing during start training at iteration 0 #5281

Open
NatchapolShinno opened this issue May 8, 2024 · 2 comments
Open

Encountered freezing during start training at iteration 0 #5281

NatchapolShinno opened this issue May 8, 2024 · 2 comments

Comments

@NatchapolShinno
Copy link

NatchapolShinno commented May 8, 2024

I'm attempting to implement ViTGaze, but I've encountered an issue with a specific line of code. Upon investigation, I noticed that it's not utilizing GPU resources at all and is freezing at this point. Below is my logs. Despite several hours having passed, the "Starting training from iteration 0" line still persists. I'm training on videoattentiontarget dataset.


[05/08 16:59:38 detectron2]: Model:
GazeAttentionMapper(
  (backbone): ViT(
    (patch_embed): PatchEmbed(
      (proj): Conv2d(3, 384, kernel_size=(14, 14), stride=(14, 14))
    )
    (extra_pos_embed): Identity()
    (blocks): ModuleList(
      (0-11): 12 x Block(
        (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=384, out_features=1152, bias=True)
          (proj): Linear(in_features=384, out_features=384, bias=True)
        )
        (ls1): LayerScale()
        (drop_path): Identity()
        (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=384, out_features=1536, bias=True)
          (act): GELU(approximate='none')
          (drop1): Dropout(p=0.0, inplace=False)
          (norm): Identity()
          (fc2): Linear(in_features=1536, out_features=384, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
        (ls2): LayerScale()
      )
    )
    (norm): Identity()
  )
  (pam): PatchPAM(
    (patch_embed): Sequential(
      (patch_embed): Conv2d(3, 8, kernel_size=(14, 14), stride=(14, 14))
      (act_layer): ReLU(inplace=True)
    )
    (embed): Conv2d(8, 1, kernel_size=(1, 1), stride=(1, 1))
    (aux_embed): Conv2d(8, 1, kernel_size=(1, 1), stride=(1, 1))
  )
  (regressor): UpSampleConv(
    (pre_norm): Identity()
    (conv): Identity()
    (decoder): Sequential(
      (upsample1): Upsample(scale_factor=2.0, mode='bilinear')
      (conv1): Conv2d(24, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (bn1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu1): ReLU(inplace=True)
      (upsample2): Upsample(scale_factor=2.0, mode='bilinear')
      (conv2): Conv2d(16, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (bn2): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu2): ReLU(inplace=True)
      (upsample3): Upsample(scale_factor=2.0, mode='bilinear')
      (conv3): Conv2d(8, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (bn3): BatchNorm2d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu3): ReLU(inplace=True)
      (conv): Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1))
    )
  )
  (classifier): SimpleMlp(
    (classifier): Sequential(
      (dropout0): Dropout(p=0, inplace=False)
      (linear0): Linear(in_features=384, out_features=384, bias=True)
      (relu): ReLU()
      (dropout1): Dropout(p=0, inplace=False)
      (linear1): Linear(in_features=384, out_features=1, bias=True)
    )
  )
  (criterion): GazeMapperCriterion(
    (heatmap_loss): MSELoss()
    (aux_loss): BCEWithLogitsLoss()
  )
)
[05/08 16:59:40 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from /home/slab/ViTGaze/output/gazefollow_518/model_final.pth ...
[05/08 16:59:40 fvcore.common.checkpoint]: [Checkpointer] Loading from /home/slab/ViTGaze/output/gazefollow_518/model_final.pth ...
[05/08 16:59:40 d2.engine.train_loop]: Starting training from iteration 0

Environment:

[05/08 16:59:38 detectron2]: Environment info:
-------------------------------  -----------------------------------------------------------------------
sys.platform                     linux
Python                           3.8.10 (default, Jun  4 2021, 15:09:15) [GCC 7.5.0]
numpy                            1.23.5
detectron2                       0.6 @/home/slab/.local/lib/python3.8/site-packages/detectron2
Compiler                         GCC 9.4
CUDA compiler                    CUDA 11.7
detectron2 arch flags            7.5
DETECTRON2_ENV_MODULE            <not set>
PyTorch                          2.0.1+cu117 @/home/slab/.local/lib/python3.8/site-packages/torch
PyTorch debug build              False
torch._C._GLIBCXX_USE_CXX11_ABI  False
GPU available                    Yes
GPU 0                            NVIDIA TITAN RTX (arch=7.5)
Driver version                   515.43.04
CUDA_HOME                        /usr/local/cuda-11.7
Pillow                           10.3.0
torchvision                      0.15.2+cu117 @/home/slab/.local/lib/python3.8/site-packages/torchvision
torchvision arch flags           3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore                           0.1.5.post20221221
iopath                           0.1.9
cv2                              4.8.1
-------------------------------  -----------------------------------------------------------------------
PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.7
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.5
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

Could you please help me? Thank you in advance.

@github-actions github-actions bot added the needs-more-info More info is needed to complete the issue label May 8, 2024
Copy link

github-actions bot commented May 8, 2024

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template.
The following information is missing: "Instructions To Reproduce the Issue and Full Logs";

@github-actions github-actions bot removed the needs-more-info More info is needed to complete the issue label May 8, 2024
@Programmer-RD-AI
Copy link
Contributor

Hi,
This may be caused due to the size of the dataset and the model size as well,
I would recommend you try and train a basic level model first and see performance...
Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants