Error reproducing competition results #32

ndkulkarni · 2019-04-25T17:38:15Z

I am trying to reproduce the competition results based on the instructions in the README.

I download and unzip the files from the kaggle competition into the data/ folder
I run the command python make_features.py data/vars --add_days=63 which creates the following pickle files: 2017-08-15_2017-09-11.pkl, all.pkl, train_2.pkl and the directory vars/ in the data/ folder
I run the trainer python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500 and receive the following error:

UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(944): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'

I am using a p3.2xlarge AWS instance with the Deep Learning AMI with Python 3.6.5 and Tensorflow-gpu==1.12.0

If I downgrade to TF-GPU 1.10, I still get the same error.

How can I resolve this?
Full output from train command

The text was updated successfully, but these errors were encountered:

limu1928 · 2019-07-25T23:16:19Z

I have the same problem. Did you figure it out?

limu1928 · 2019-07-25T23:52:16Z

I am trying to reproduce the competition results based on the instructions in the README.

I download and unzip the files from the kaggle competition into the data/ folder

I run the command python make_features.py data/vars --add_days=63 which creates the following pickle files: 2017-08-15_2017-09-11.pkl, all.pkl, train_2.pkl and the directory vars/ in the data/ folder

I run the trainer python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500 and receive the following error:

UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(944): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'

I am using a p3.2xlarge AWS instance with the Deep Learning AMI with Python 3.6.5 and Tensorflow-gpu==1.12.0

If I downgrade to TF-GPU 1.10, I still get the same error.

How can I resolve this?
Full output from train command
SImply restart a new instance will work...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error reproducing competition results #32

Error reproducing competition results #32

ndkulkarni commented Apr 25, 2019

limu1928 commented Jul 25, 2019

limu1928 commented Jul 25, 2019

Error reproducing competition results #32

Error reproducing competition results #32

Comments

ndkulkarni commented Apr 25, 2019

limu1928 commented Jul 25, 2019

limu1928 commented Jul 25, 2019