Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorBoard files gets deleted, Profiler returns 0 Millis for communication time! #550

Open
orwa-te opened this issue Mar 11, 2021 · 2 comments

Comments

@orwa-te
Copy link

orwa-te commented Mar 11, 2021

Environment:

  • Python version [3.7.7]
  • Spark version [3.0.0]
  • TensorFlow version [2.3.0]
  • TensorFlowOnSpark version [2.2.2]
  • Cluster version [Standalone]

Describe the bug:
I have 2 issues regarding the TensorBoard when executing a training process of my model on 2 worker nodes:

1- The first one is that after the training process completed, the TensorBoard files get deleted immediately on worker 1 while they are kept at worker 0 although I can use TensorBoard to check details while the training process is running.
2- I am trying to profile my model to check the details of consumed time for batches 3 to 5 while training the model in the Profiler page but I get 0 ms for communication time, more specifically the Device Collective Communication and Device to Device Time. However the Average Step Time gives reasonable values like 19368.9 ms!
From the Hosts drop-down list I can see that there is only one detected host in the cluster, not 2. Why does this happen?

image

Logs:
If applicable, add logs to help explain your problem. Note: errors may not be fully described in the driver/console logs. Make sure to check the executor logs for possible root causes.

Spark Submit Command Line:
spark-submit --master spark://master:7077 train_file.py --cluster_size 2 --epochs 1

@leewyang
Copy link
Contributor

  1. When using the "built-in" TensorBoard server in TFoS (triggered by supplying tensorboard=True), the TB server is hosted in the "chief" worker, so it has the same lifecycle as the "chief" worker. That is, it will be killed when the Spark job completes. If you want visibility after the job completes, you can write the TB events to the shared/distributed filesystem and then spawn your own TB process pointing to this location.
  2. This sounds like more of a question for the TensorFlow team since TFoS has nothing to do with these metrics. Regardless, I'm assuming that your environment somehow isn't set up to capture this information. For example, I'm guessing that "Device Collective Communication Time" is referring to something like NCCL, which you may not have (enabled) in your setup.

@orwa-te
Copy link
Author

orwa-te commented Mar 11, 2021

There are no GPUs in the cluster so that the worker nodes depend only on CPUs to process the data. As I understood from your answer, the Device Collective Communication time value is limited to GPU and NCCL. Isn't there any way to capture this value while using only the CPUs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants