Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'CUDNN_STATUS_EXECUTION_FAILED' occurs #35

Open
zhangmytf opened this issue Oct 2, 2019 · 1 comment
Open

'CUDNN_STATUS_EXECUTION_FAILED' occurs #35

zhangmytf opened this issue Oct 2, 2019 · 1 comment

Comments

@zhangmytf
Copy link

hi, when i run the code on my server ( v100*4 cuda 9.0 cudnn 7.0), it occurs this errors.
Could you please help me ?
which version of cuda and cudnn do you use?

`/home/admin/algomodule/test/kaggle-web-traffic# python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500
WARNING:tensorflow:From /home/admin/algomodule/test/kaggle-web-traffic/model.py:144: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
2019-10-02 06:00:37.510047: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-10-02 06:00:37.909980: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 06:00:37.911006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:08.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
2019-10-02 06:00:38.047527: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 06:00:38.048568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:09.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
2019-10-02 06:00:38.179680: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 06:00:38.180730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 2 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:0a.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
2019-10-02 06:00:38.319747: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 06:00:38.320794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:0b.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
2019-10-02 06:00:38.320867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3
2019-10-02 06:00:40.205535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-02 06:00:40.205600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3
2019-10-02 06:00:40.205610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y Y Y
2019-10-02 06:00:40.205616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N Y Y
2019-10-02 06:00:40.205631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: Y Y N Y
2019-10-02 06:00:40.205641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: Y Y Y N
2019-10-02 06:00:40.205992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14941 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:08.0, compute capability: 7.0)
2019-10-02 06:00:40.508989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14941 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:09.0, compute capability: 7.0)
2019-10-02 06:00:40.811745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14941 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:0a.0, compute capability: 7.0)
2019-10-02 06:00:41.114312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 14941 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:0b.0, compute capability: 7.0)
1: 0%| | 0/566 [00:00<?, ?it/s]2019-10-02 06:00:47.758076: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
2019-10-02 06:00:47.770054: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
2019-10-02 06:00:47.782300: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
Traceback (most recent call last):
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call
return fn(*args)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[Node: m_0/cudnn_gru/CudnnRNN = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0.0304904226, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=5, seed2=5, _device="/job:localhost/replica:0/task:0/device:GPU:0"](m_0/transpose, m_0/cudnn_gru/zeros, m_0/cudnn_gru/Const, m_0/cudnn_gru/opaque_kernel/read)]]
[[Node: m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1/_165 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3276_m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "trainer.py", line 786, in
train(**param_dict)
File "trainer.py", line 599, in train
step = trainer.train_step(sess, epoch)
File "trainer.py", line 251, in train_step
results = self._metric_step(Stage.TRAIN, ops, sess, epoch, summary_every=20)
File "trainer.py", line 235, in _metric_step
results = sess.run(ops)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run
run_metadata_ptr)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run
feed_dict_tensor, options, run_metadata)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
run_metadata)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[Node: m_0/cudnn_gru/CudnnRNN = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0.0304904226, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=5, seed2=5, _device="/job:localhost/replica:0/task:0/device:GPU:0"](m_0/transpose, m_0/cudnn_gru/zeros, m_0/cudnn_gru/Const, m_0/cudnn_gru/opaque_kernel/read)]]
[[Node: m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1/_165 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3276_m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'm_0/cudnn_gru/CudnnRNN', defined at:
File "trainer.py", line 786, in
train(**param_dict)
File "trainer.py", line 520, in train
all_models.append(create_model(scope, i, prefix=prefix, seed=seed + i))
File "trainer.py", line 474, in create_model
train_model = Model(pipe, hparams, is_train=True, graph_prefix=prefix, asgd_decay=asgd_decay, seed=seed)
File "/home/admin/algomodule/test/kaggle-web-traffic/model.py", line 342, in init
transpose_output=False)
File "/home/admin/algomodule/test/kaggle-web-traffic/model.py", line 65, in make_encoder
rnn_out, (rnn_state,) = cuda_model(inputs=rnn_time_input)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 362, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 736, in call
outputs = self.call(inputs, *args, **kwargs)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 412, in call
training)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 487, in _forward
seed=self._seed)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 922, in _cudnn_rnn
outputs, output_h, output_c, _ = gen_cudnn_rnn_ops.cudnn_rnn(**args)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/ops/gen_cudnn_rnn_ops.py", line 115, in cudnn_rnn
is_training=is_training, name=name)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
return func(*args, **kwargs)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
op_def=op_def)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in init
self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[Node: m_0/cudnn_gru/CudnnRNN = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0.0304904226, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=5, seed2=5, _device="/job:localhost/replica:0/task:0/device:GPU:0"](m_0/transpose, m_0/cudnn_gru/zeros, m_0/cudnn_gru/Const, m_0/cudnn_gru/opaque_kernel/read)]]
[[Node: m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1/_165 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3276_m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]`

@allen20200111
Copy link

do you fixed it ? i used tensorflow 1.14 cuda 10.0 cudnn 7.6 have same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants