Skip to content
This repository has been archived by the owner on Feb 25, 2022. It is now read-only.

TPU device does not support heartbeats. #272

Open
iliemihai opened this issue Feb 1, 2022 · 0 comments
Open

TPU device does not support heartbeats. #272

iliemihai opened this issue Feb 1, 2022 · 0 comments
Labels
bug Something isn't working.

Comments

@iliemihai
Copy link

iliemihai commented Feb 1, 2022

Hello,

When I try to train on a v3-32 TPU with tpu-vm-tf-2.6.0-pod image version, I reveive the following error:

Creating heartbeat manager for ['/job:worker/replica:0/task:0/device:CPU:0', '/job:worker/replica:0/task:2/device:CPU:0', '/job:worker/replica:0/task:1/device:CPU:0', '/job:worker/replica:0/task:3/device:CPU:0']
Configuring worker heartbeat: shutdown_mode: WAIT_FOR_COORDINATOR

TPU device does not support heartbeats. Failure handling will be disabled.
training_loop marked as finished
Reraising captured error
Traceback (most recent call last):
  File "/home/dumitrescu_stefan/dev/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/home/dumitrescu_stefan/dev/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1359, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/home/dumitrescu_stefan/dev/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1451, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.PermissionDeniedError: From /job:worker/replica:0/task:2:
/home/dumitrescu_stefan; Permission denied
	 [[{{node create_file_writer/CreateSummaryFileWriter}}]]
Recent warning and error logs:
  OP_REQUIRES failed at summary_kernels.cc:50 : Permission denied: /home/dumitrescu_stefan; Permission denied
  OP_REQUIRES failed at summary_kernels.cc:50 : Permission denied: /home/dumitrescu_stefan; Permission denied
  OP_REQUIRES failed at summary_kernels.cc:50 : Permission denied: /home/dumitrescu_stefan; Permission denied

To Reproduce
Run the training script:

  1. python3 main.py --model gpt3_XL_256_Pile --steps_per_checkpoint 40000 --tpu TPU_NAME

Environment (please complete the following information):

  • TPUs: V3-32 with tpu-vm-tf-2.6.0-pod image
  • Configs:
    { "n_head": 32, "n_vocab": 64000, "embed_dropout": 0, "lr": 0.0002, "lr_decay": "cosine", "warmup_steps": 3000, "beta1": 0.9, "beta2": 0.95, "epsilon": 1e-8, "opt_name": "adam", "weight_decay": 0.1, "train_batch_size": 512, "attn_dropout": 0, "train_steps": 286150, "eval_steps": 10, "predict_steps": 1, "res_dropout": 0, "eval_batch_size": 512, "predict_batch_size": 1, "iterations": 500, "n_embd": 2048, "datasets": [["example", 25, "documents_random", 1.0]], "model_path": "/home/dumitrescu_stefan/gpt-neo/neo-models/GPT3_1.3B", "n_ctx": 2048, "n_layer": 24, "scale_by_depth": true, "scale_by_in": false, "attention_types" : [[["global"],24]], "mesh_shape": "x:16,y:2", "layout": "batch:x,memory_length:y,embd:y", "activation_function": "gelu", "recompute_grad": true, "gradient_clipping": 1.0, "tokens_per_mb_per_replica": 2048, "precision": "bfloat16" }
    Dataset config is:
    { "n_vocab": 64000, "path": "/home/dumitrescu_stefan/gpt-neo/data_tfrecords/train_shard_*.tfrecords", "eval_path": "", "tokenizer_path": "/home/dumitrescu_stefan/gpt-neo/tokenizer/tokenizer.json", "eos_id": 1, "padding_id": 0 }
@iliemihai iliemihai added the bug Something isn't working. label Feb 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working.
Projects
None yet
Development

No branches or pull requests

1 participant