Flaky test_memory_format with nn.BatchNorm2d when running with inductor #125967

huydhn · 2024-05-10T21:43:44Z

This test starts to fail in trunk for nn.BatchNorm2d recently. I think it's another example of #125239 where the test order of the tests matter.

On devgpu, running the test alone fails like how it fails on CI:

CUDA_VISIBLE_DEVICES=4 PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_modules.py -k TestModuleCUDA.test_memory_format_nn_BatchNorm2d
/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
  warnings.warn(
/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
  warnings.warn(
FW0510 14:40:02.994000 139808110089216 torch/_dynamo/convert_frame.py:669] [12/8] torch._dynamo hit config.accumulated_cache_size_limit (8)
W0510 14:40:02.994000 139808110089216 torch/_dynamo/convert_frame.py:669] [12/8]    function: 'inner_check_out_mem_format' (/data/users/huydo/github/pytorch/test/test_modules.py:662)
W0510 14:40:02.994000 139808110089216 torch/_dynamo/convert_frame.py:669] [12/8]    last reason: tensor 'L['output']' dtype mismatch. expected Float, actual Double
W0510 14:40:02.994000 139808110089216 torch/_dynamo/convert_frame.py:669] [12/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0510 14:40:02.994000 139808110089216 torch/_dynamo/convert_frame.py:669] [12/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
W0510 14:40:05.205000 139808110089216 torch/_dynamo/convert_frame.py:669] [7/8] torch._dynamo hit config.accumulated_cache_size_limit (8)
W0510 14:40:05.205000 139808110089216 torch/_dynamo/convert_frame.py:669] [7/8]    function: '_traverse_obj' (/data/users/huydo/github/pytorch/test/test_modules.py:310)
W0510 14:40:05.205000 139808110089216 torch/_dynamo/convert_frame.py:669] [7/8]    last reason: ___check_type_id(L['obj'], 8815232)
W0510 14:40:05.205000 139808110089216 torch/_dynamo/convert_frame.py:669] [7/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0510 14:40:05.205000 139808110089216 torch/_dynamo/convert_frame.py:669] [7/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
FFF
======================================================================
FAIL: test_memory_format_nn_BatchNorm2d_eval_mode_cuda_float32 (__main__.TestModuleCUDA.test_memory_format_nn_BatchNorm2d_eval_mode_cuda_float32)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 2763, in wrapper
    method(*args, **kwargs)
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 2763, in wrapper
    method(*args, **kwargs)
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_device_type.py", line 419, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_device_type.py", line 1028, in dep_fn
    return fn(slf, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_modules.py", line 128, in test_wrapper
    return test(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_cuda.py", line 198, in wrapped
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 618, in test_memory_format
    @with_tf32_off
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 749, in torch_dynamo_resume_in_test_memory_format_at_626
    _check_out_mem_format(outputs, input_mem_format, module_mem_format)
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 661, in _check_out_mem_format
    def _check_out_mem_format(output, input_mem_format, module_mem_format):
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 310, in _traverse_obj
    def _traverse_obj(self, obj, func):
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 318, in torch_dynamo_resume_in__traverse_obj_at_313
    return func(obj)
           ^^^^^^^^^
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 662, in inner_check_out_mem_format
    def inner_check_out_mem_format(output):
AssertionError: False is not true

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_modules.py -k TestModuleCUDA.test_memory_format_nn_BatchNorm2d_eval_mode_cuda_float32

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

======================================================================
FAIL: test_memory_format_nn_BatchNorm2d_eval_mode_cuda_float64 (__main__.TestModuleCUDA.test_memory_format_nn_BatchNorm2d_eval_mode_cuda_float64)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 2763, in wrapper
    method(*args, **kwargs)
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 2763, in wrapper
    method(*args, **kwargs)
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_device_type.py", line 419, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_device_type.py", line 1028, in dep_fn
    return fn(slf, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_modules.py", line 128, in test_wrapper
    return test(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_cuda.py", line 198, in wrapped
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 618, in test_memory_format
    @with_tf32_off
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 749, in torch_dynamo_resume_in_test_memory_format_at_626
    _check_out_mem_format(outputs, input_mem_format, module_mem_format)
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 661, in _check_out_mem_format
    def _check_out_mem_format(output, input_mem_format, module_mem_format):
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 318, in _traverse_obj
    return func(obj)
           ^^^^^^^^^
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 666, in inner_check_out_mem_format
    self.assertTrue(output.is_contiguous(memory_format=torch.channels_last))
AssertionError: False is not true

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_modules.py -k TestModuleCUDA.test_memory_format_nn_BatchNorm2d_eval_mode_cuda_float64

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

======================================================================
FAIL: test_memory_format_nn_BatchNorm2d_train_mode_cuda_float32 (__main__.TestModuleCUDA.test_memory_format_nn_BatchNorm2d_train_mode_cuda_float32)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 2763, in wrapper
    method(*args, **kwargs)
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 2763, in wrapper
    method(*args, **kwargs)
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_device_type.py", line 419, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_device_type.py", line 1028, in dep_fn
    return fn(slf, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_modules.py", line 128, in test_wrapper
    return test(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_cuda.py", line 198, in wrapped
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 618, in test_memory_format
    @with_tf32_off
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 749, in torch_dynamo_resume_in_test_memory_format_at_626
    _check_out_mem_format(outputs, input_mem_format, module_mem_format)
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 661, in _check_out_mem_format
    def _check_out_mem_format(output, input_mem_format, module_mem_format):
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 318, in _traverse_obj
    return func(obj)
           ^^^^^^^^^
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 666, in inner_check_out_mem_format
    self.assertTrue(output.is_contiguous(memory_format=torch.channels_last))
AssertionError: False is not true

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_modules.py -k TestModuleCUDA.test_memory_format_nn_BatchNorm2d_train_mode_cuda_float32

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

======================================================================
FAIL: test_memory_format_nn_BatchNorm2d_train_mode_cuda_float64 (__main__.TestModuleCUDA.test_memory_format_nn_BatchNorm2d_train_mode_cuda_float64)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 2763, in wrapper
    method(*args, **kwargs)
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 2763, in wrapper
    method(*args, **kwargs)
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_device_type.py", line 419, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_device_type.py", line 1028, in dep_fn
    return fn(slf, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_modules.py", line 128, in test_wrapper
    return test(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/testing/_internal/common_cuda.py", line 198, in wrapped
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 618, in test_memory_format
    @with_tf32_off
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 749, in torch_dynamo_resume_in_test_memory_format_at_626
    _check_out_mem_format(outputs, input_mem_format, module_mem_format)
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 661, in _check_out_mem_format
    def _check_out_mem_format(output, input_mem_format, module_mem_format):
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 318, in _traverse_obj
    return func(obj)
           ^^^^^^^^^
  File "/data/users/huydo/github/pytorch/test/test_modules.py", line 666, in inner_check_out_mem_format
    self.assertTrue(output.is_contiguous(memory_format=torch.channels_last))
AssertionError: False is not true

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_modules.py -k TestModuleCUDA.test_memory_format_nn_BatchNorm2d_train_mode_cuda_float64

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 4 tests in 16.499s

FAILED (failures=4)

But if I run it after nn.BatchNorm1, it passes:

CUDA_VISIBLE_DEVICES=4 PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_modules.py -k TestModuleCUDA.test_memory_format_nn_BatchNorm1d -k TestModuleCUDA.test_memory_format_nn_BatchNorm2d
/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
  warnings.warn(
/home/huydo/local/conda/py3.11/lib/python3.11/site-packages/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
  warnings.warn(
.W0510 14:41:47.946000 140513247396864 torch/_dynamo/convert_frame.py:669] [7/8] torch._dynamo hit config.accumulated_cache_size_limit (8)
W0510 14:41:47.946000 140513247396864 torch/_dynamo/convert_frame.py:669] [7/8]    function: '_traverse_obj' (/data/users/huydo/github/pytorch/test/test_modules.py:310)
W0510 14:41:47.946000 140513247396864 torch/_dynamo/convert_frame.py:669] [7/8]    last reason: ___check_type_id(L['obj'], 8815232)
W0510 14:41:47.946000 140513247396864 torch/_dynamo/convert_frame.py:669] [7/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0510 14:41:47.946000 140513247396864 torch/_dynamo/convert_frame.py:669] [7/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
.W0510 14:41:52.573000 140513247396864 torch/_dynamo/convert_frame.py:669] [12/8] torch._dynamo hit config.accumulated_cache_size_limit (8)
W0510 14:41:52.573000 140513247396864 torch/_dynamo/convert_frame.py:669] [12/8]    function: 'inner_check_out_mem_format' (/data/users/huydo/github/pytorch/test/test_modules.py:662)
W0510 14:41:52.573000 140513247396864 torch/_dynamo/convert_frame.py:669] [12/8]    last reason: tensor 'L['output']' requires_grad mismatch. expected requires_grad=1
W0510 14:41:52.573000 140513247396864 torch/_dynamo/convert_frame.py:669] [12/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0510 14:41:52.573000 140513247396864 torch/_dynamo/convert_frame.py:669] [12/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
..W0510 14:41:55.108000 140513247396864 torch/_dynamo/convert_frame.py:669] [11/8] torch._dynamo hit config.accumulated_cache_size_limit (8)
W0510 14:41:55.108000 140513247396864 torch/_dynamo/convert_frame.py:669] [11/8]    function: '_check_out_mem_format' (/data/users/huydo/github/pytorch/test/test_modules.py:661)
W0510 14:41:55.108000 140513247396864 torch/_dynamo/convert_frame.py:669] [11/8]    last reason: tensor 'L['output']' rank mismatch. expected 2, actual 4
W0510 14:41:55.108000 140513247396864 torch/_dynamo/convert_frame.py:669] [11/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0510 14:41:55.108000 140513247396864 torch/_dynamo/convert_frame.py:669] [11/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
W0510 14:41:57.339000 140513247396864 torch/_dynamo/convert_frame.py:669] [10/8] torch._dynamo hit config.accumulated_cache_size_limit (8)
W0510 14:41:57.339000 140513247396864 torch/_dynamo/convert_frame.py:669] [10/8]    function: 'inner_to_mem_format' (/data/users/huydo/github/pytorch/test/test_modules.py:652)
W0510 14:41:57.339000 140513247396864 torch/_dynamo/convert_frame.py:669] [10/8]    last reason: tensor 'L['obj']' rank mismatch. expected 2, actual 4
W0510 14:41:57.339000 140513247396864 torch/_dynamo/convert_frame.py:669] [10/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0510 14:41:57.339000 140513247396864 torch/_dynamo/convert_frame.py:669] [10/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
....
----------------------------------------------------------------------
Ran 8 tests in 20.956s

OK

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @seemethere @malfet @pytorch/pytorch-dev-infra @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @muchulee8 @ColinPeppler @amjames @desertfire

The text was updated successfully, but these errors were encountered:

Skipping the test in the context of #125967 until the issue is root caused and fixed properly. Pull Request resolved: #125970 Approved by: https://github.com/clee2000

ezyang · 2024-05-11T21:17:39Z

Marking high priority as it is easily reproducible

Skipping the test in the context of pytorch#125967 until the issue is root caused and fixed properly. Pull Request resolved: pytorch#125970 Approved by: https://github.com/clee2000

shunting314 · 2024-05-14T19:21:26Z

I'll blindly suspect it's due to RNG state. I tried to repro, but this command:

CUDA_VISIBLE_DEVICES=4 PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_modules.py -k TestModuleCUDA.test_memory_format_nn_BatchNorm2d

just shows:

ssss
----------------------------------------------------------------------
Ran 4 tests in 0.042s

OK (skipped=4)

Add --rerun-disabled-tests does not help.

What's the correct way to run a disabled test?

huydhn · 2024-05-14T20:18:31Z

Oh, I skip the test in #125970 to keep trunk sane. Please revert my change before trying to reproduce it. Otherwise, it shows up as a skipped test.

shunting314 · 2024-05-17T23:24:24Z

I at least figured out why test order matters here. It's due to dynamo compilation cache.

If we run test for BatchNorm1D first, we already have a few compiled functions in dynamo cache. Later on when we test BatchNorm2D, the cache limit is reached, we fall back to eager and bypass the issue.

To very the above point, I added torch._dynamo.reset() in the beginning of the test, now I can repro the issue even if BatchNorm2D is tested after BatchNorm1D.

In #125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests. This PR clear dynamo cache before each unit test so we get more deterministic result for unit test [ghstack-poisoned]

Fix #125967 . The test actually fail for empty 4D or 5D tensors when checking for memory format. I'm not exactly sure what recent inductor change cause the failure, but it may be not that important to maintain strides for an empty tensor. (?) I just skip the check for empty tensor. [ghstack-poisoned]

In #125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests. This PR clear dynamo cache before each unit test so we get more deterministic result for unit test [ghstack-poisoned]

Fix #125967 . The test actually fail for empty 4D or 5D tensors when checking for memory format. I'm not exactly sure what recent inductor change cause the failure, but it may be not that important to maintain strides for an empty tensor. (?) I just skip the check for empty tensor. [ghstack-poisoned]

In #125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests. This PR clear dynamo cache before each unit test so we get more deterministic result for unit test [ghstack-poisoned]

In #125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests. This PR clear dynamo cache before each unit test so we get more deterministic result for unit test Pull Request resolved: #126586 Approved by: https://github.com/jansel

huydhn added module: ci Related to continuous integration module: inductor labels May 10, 2024

pytorch-bot bot added the oncall: pt2 label May 10, 2024

huydhn mentioned this issue May 10, 2024

Skip test_memory_format_nn_BatchNorm2d in inductor #125970

Closed

ezyang added the high priority label May 11, 2024

pytorch-bot bot added the triage review label May 11, 2024

bdhirsh assigned shunting314 May 14, 2024

bdhirsh added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels May 14, 2024

This was referenced May 17, 2024

reset dynamo cache before each test #126586

Open

don't check memory format for empty tensors #126593

Open

pytorchmergebot closed this as completed in 12dee4f May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test_memory_format with nn.BatchNorm2d when running with inductor #125967

Flaky test_memory_format with nn.BatchNorm2d when running with inductor #125967

huydhn commented May 10, 2024 •

edited by pytorch-bot bot

ezyang commented May 11, 2024

shunting314 commented May 14, 2024

huydhn commented May 14, 2024

shunting314 commented May 17, 2024 •

edited

Flaky test_memory_format with nn.BatchNorm2d when running with inductor #125967

Flaky test_memory_format with nn.BatchNorm2d when running with inductor #125967

Comments

huydhn commented May 10, 2024 • edited by pytorch-bot bot

ezyang commented May 11, 2024

shunting314 commented May 14, 2024

huydhn commented May 14, 2024

shunting314 commented May 17, 2024 • edited

huydhn commented May 10, 2024 •

edited by pytorch-bot bot

shunting314 commented May 17, 2024 •

edited