Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU error #73

Open
fredzfm opened this issue Oct 10, 2018 · 9 comments
Open

GPU error #73

fredzfm opened this issue Oct 10, 2018 · 9 comments

Comments

@fredzfm
Copy link

fredzfm commented Oct 10, 2018

Tried to run it with GPU. got the following error. can anyone help me on this?

(Python36) D:\chess\chess-alpha-zero>python src/chess_zero/run.py self
2018-10-10 11:45:55,139@chess_zero.manager INFO # config type: mini
Using TensorFlow backend.
2018-10-10 11:45:59,436@chess_zero.agent.model_chess DEBUG # loading model from D:\chess\chess-alpha-zero\data\model\model_best_config.json
2018-10-10 11:45:59.478648: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2018-10-10 11:45:59.695745: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
totalMemory: 11.00GiB freeMemory: 9.10GiB
2018-10-10 11:45:59.790370: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:02:00.0
totalMemory: 11.00GiB freeMemory: 9.10GiB
2018-10-10 11:45:59.795932: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1435] Adding visible gpu devices: 0, 1
2018-10-10 11:48:20.448740: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-10 11:48:20.451530: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:929] 0 1
2018-10-10 11:48:20.453788: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:942] 0: N N
2018-10-10 11:48:20.455816: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:942] 1: N N
2018-10-10 11:48:20.458363: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8795 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-10-10 11:48:20.834375: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 8795 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)
Traceback (most recent call last):
File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 1567, in _create_c_op
c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape must be rank 1 but is rank 0 for 'input_batchnorm/cond/Reshape_4' (op: 'Reshape') with input shapes: [1,256,1,1], [].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "src/chess_zero/run.py", line 20, in
manager.start()
File "src\chess_zero\manager.py", line 64, in start
return self_play.start(config)
File "src\chess_zero\worker\self_play.py", line 25, in start
return SelfPlayWorker(config).start()
File "src\chess_zero\worker\self_play.py", line 45, in init
self.current_model = self.load_model()
File "src\chess_zero\worker\self_play.py", line 85, in load_model
if self.config.opts.new or not load_best_model_weight(model):
File "src\chess_zero\lib\model_helper.py", line 15, in load_best_model_weight
return model.load(model.config.resource.model_best_config_path, model.config.resource.model_best_weight_path)
File "src\chess_zero\agent\model_chess.py", line 145, in load
self.model = Model.from_config(json.load(f))
File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\keras\engine\network.py", line 1032, in from_config
process_node(layer, node_data)
File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\keras\engine\network.py", line 991, in process_node
layer(unpack_singleton(input_tensors), **kwargs)
File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\keras\engine\base_layer.py", line 457, in call
output = self.call(inputs, **kwargs)
File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\keras\layers\normalization.py", line 206, in call
training=training)
File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\keras\backend\tensorflow_backend.py", line 3123, in in_train_phase
x = switch(training, x, alt)
File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\keras\backend\tensorflow_backend.py", line 3058, in switch
else_expression_fn)
File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\tensorflow\python\util\deprecation.py", line 432, in new_func
return func(*args, **kwargs)
File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 2072, in cond
orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 1913, in BuildCondBranch
original_result = fn()
File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\keras\layers\normalization.py", line 167, in normalize_inference
epsilon=self.epsilon)
File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\keras\backend\tensorflow_backend.py", line 1908, in batch_normalization
mean = tf.reshape(mean, (-1))
File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 6112, in reshape
"Reshape", tensor=tensor, shape=shape, name=name)
File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 3392, in create_op
op_def=op_def)
File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 1734, in init
control_input_ops)
File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 1570, in _create_c_op
raise ValueError(str(e))
ValueError: Shape must be rank 1 but is rank 0 for 'input_batchnorm/cond/Reshape_4' (op: 'Reshape') with input shapes: [1,256,1,1], [].

@brianprichardson
Copy link
Collaborator

It has been a while, but for a "self" play run try without any weight file and it should create one to start with.
The best weights are for "uci" play.

@fredzfm
Copy link
Author

fredzfm commented Oct 10, 2018

Thanks brianprichardson.

You mean it never runs with GPU?
I have tried "self" without any Json model and weight file, still got error.

@fredzfm
Copy link
Author

fredzfm commented Oct 10, 2018

tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape must be rank 1 but is rank 0 for 'input_batchnorm/cond/Reshape_4' (op: 'Reshape') with input shapes: [1,256,1,1], [].

@brianprichardson
Copy link
Collaborator

Depending on the situiation, the weights (.h5) and model (.json) files must match the net architecture in the configs file (typically mini.py). The stronger ones that I uploaded do not match the current config files.

IIRC, when running "self" if there are no .h5 and .json files they will be created first. You can add
self.model.summary()
at the end of the def build(self): in class ChessModel: in model_chess.py in the agent dir to see if it is creating a new model from the specs in the mini.py file.

For running "uci" it just tries to read the best files. Other params in the config file can still be set, but most are ignored for uci, like playouts is 1,200 (sort of like fixed number of nodes).

The first output you posted shows it is trying to run with the gpu. As slow as it is, it will be far to slow to run without a gpu, and your 1080ti is a very good one.

I would try a clean download and just try to run with "uci" and enter the "uci" and "isready" (remember to wait for the readyok), and then "go". You should get a bestmove after some time. If that works, then your packages and gpu are all working ok and we can work from there.

What are you trying to do, in general? Self-play training is extremely slow and takes a lot of disk space for the intermediate input plane files. That's why I have a tweaked version that takes pgn input and trains directly from that.

@reikdas
Copy link

reikdas commented Nov 11, 2018

This issue might be related to #75 and #76 .

I would try a clean download and just try to run with "uci" and enter the "uci" and "isready" (remember to wait for the readyok), and then "go". You should get a bestmove after some time. If that works, then your packages and gpu are all working ok and we can work from there.

What is the command to run this?
python src/chess_zero/run.py uci --isready does not work.

@brianprichardson
Copy link
Collaborator

brianprichardson commented Nov 11, 2018

First only do:
python src/chess_zero/run.py uci

Then, after it loads enter:
uci [wait for uciok]
isready [wait for readyok]
go [should see some bestmove output but may take some time with cpu and gpu busy]

@reikdas
Copy link

reikdas commented Nov 11, 2018

I get the following error logs when I issue isready

Using TensorFlow backend.
2018-11-11 18:52:28.546655: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-11-11 18:52:28.667415: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-11 18:52:28.667979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties: 
name: GeForce GTX 1070 with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.2655
pciBusID: 0000:01:00.0
totalMemory: 7.93GiB freeMemory: 7.51GiB
2018-11-11 18:52:28.667993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2018-11-11 18:52:29.920908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-11 18:52:29.920943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0 
2018-11-11 18:52:29.920952: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N 
2018-11-11 18:52:29.921135: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7243 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1)
Traceback (most recent call last):
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1626, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape must be rank 1 but is rank 0 for 'input_batchnorm/cond/Reshape_4' (op: 'Reshape') with input shapes: [1,256,1,1], [].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/chess_zero/run.py", line 20, in <module>
    manager.start()
  File "src/chess_zero/manager.py", line 76, in start
    return uci.start(config)
  File "src/chess_zero/play_game/uci.py", line 31, in start
    me_player = get_player(config)
  File "src/chess_zero/play_game/uci.py", line 67, in get_player
    if not load_best_model_weight(model):
  File "src/chess_zero/lib/model_helper.py", line 15, in load_best_model_weight
    return model.load(model.config.resource.model_best_config_path, model.config.resource.model_best_weight_path)
  File "src/chess_zero/agent/model_chess.py", line 145, in load
    self.model = Model.from_config(json.load(f))
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/keras/engine/network.py", line 1032, in from_config
    process_node(layer, node_data)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/keras/engine/network.py", line 991, in process_node
    layer(unpack_singleton(input_tensors), **kwargs)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/keras/engine/base_layer.py", line 457, in __call__
    output = self.call(inputs, **kwargs)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/keras/layers/normalization.py", line 206, in call
    training=training)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 3123, in in_train_phase
    x = switch(training, x, alt)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 3058, in switch
    else_expression_fn)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2087, in cond
    orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1920, in BuildCondBranch
    original_result = fn()
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/keras/layers/normalization.py", line 167, in normalize_inference
    epsilon=self.epsilon)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 1908, in batch_normalization
    mean = tf.reshape(mean, (-1))
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6296, in reshape
    "Reshape", tensor=tensor, shape=shape, name=name)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
    op_def=op_def)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1790, in __init__
    control_input_ops)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1629, in _create_c_op
    raise ValueError(str(e))
ValueError: Shape must be rank 1 but is rank 0 for 'input_batchnorm/cond/Reshape_4' (op: 'Reshape') with input shapes: [1,256,1,1], [].

@adangert
Copy link

I got the same errors:
ValueError: Shape must be rank 1 but is rank 0 for 'input_batchnorm/cond/Reshape_4' (op: 'Reshape') with input shapes: [1,256,1,1], [].

@brianprichardson
Copy link
Collaborator

See #75 there is a link a fork with a working version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants