Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get stuck at "Added broadcast_0_piece0 in memory on" while runing Spark standalone cluster #580

Open
icszhr opened this issue Dec 27, 2021 · 1 comment

Comments

@icszhr
Copy link

icszhr commented Dec 27, 2021

Environment:

  • Python version 3.6.9
  • Spark version 3.0.0
  • TensorFlow version 2.6.2
  • TensorFlowOnSpark version 2.2.4
  • Cluster version Standalone, Hadoop 3.1.3

Describe the bug:
Get stuck at "INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on" while runing Spark standalone cluster while Training MNIST using Keras.

Logs:
2021-12-27 10:51:01,579 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-12-27 10:51:03,012 INFO spark.SparkContext: Running Spark version 3.0.0
2021-12-27 10:51:03,044 INFO resource.ResourceUtils: ==============================================================
2021-12-27 10:51:03,045 INFO resource.ResourceUtils: Resources for spark.driver:

2021-12-27 10:51:03,045 INFO resource.ResourceUtils: ==============================================================
2021-12-27 10:51:03,045 INFO spark.SparkContext: Submitted application: mnist_keras
2021-12-27 10:51:03,081 INFO spark.SecurityManager: Changing view acls to: amax
2021-12-27 10:51:03,081 INFO spark.SecurityManager: Changing modify acls to: amax
2021-12-27 10:51:03,081 INFO spark.SecurityManager: Changing view acls groups to:
2021-12-27 10:51:03,081 INFO spark.SecurityManager: Changing modify acls groups to:
2021-12-27 10:51:03,081 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(amax); groups with view permissions: Set(); users with modify permissions: Set(amax); groups with modify permissions: Set()
2021-12-27 10:51:03,232 INFO util.Utils: Successfully started service 'sparkDriver' on port 45175.
2021-12-27 10:51:03,255 INFO spark.SparkEnv: Registering MapOutputTracker
2021-12-27 10:51:03,275 INFO spark.SparkEnv: Registering BlockManagerMaster
2021-12-27 10:51:03,289 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2021-12-27 10:51:03,289 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
2021-12-27 10:51:03,291 INFO spark.SparkEnv: Registering BlockManagerMasterHeartbeat
2021-12-27 10:51:03,298 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-1a9cd977-b478-4add-bb24-887f1cb3e057
2021-12-27 10:51:03,311 INFO memory.MemoryStore: MemoryStore started with capacity 14.2 GiB
2021-12-27 10:51:03,320 INFO spark.SparkEnv: Registering OutputCommitCoordinator
2021-12-27 10:51:03,401 INFO util.log: Logging initialized @2665ms to org.sparkproject.jetty.util.log.Slf4jLog
2021-12-27 10:51:03,443 INFO server.Server: jetty-9.4.z-SNAPSHOT; built: 2019-04-29T20:42:08.989Z; git: e1bc35120a6617ee3df052294e433f3a25ce7097; jvm 1.8.0_212-b10
2021-12-27 10:51:03,457 INFO server.Server: Started @2722ms
2021-12-27 10:51:03,474 INFO server.AbstractConnector: Started ServerConnector@1627cd3b{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2021-12-27 10:51:03,475 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
2021-12-27 10:51:03,495 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5064c620{/jobs,null,AVAILABLE,@spark}
2021-12-27 10:51:03,496 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4a32f47a{/jobs/json,null,AVAILABLE,@spark}
2021-12-27 10:51:03,497 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3508d47e{/jobs/job,null,AVAILABLE,@spark}
2021-12-27 10:51:03,498 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7264cdd7{/jobs/job/json,null,AVAILABLE,@spark}
2021-12-27 10:51:03,499 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4a3d89f5{/stages,null,AVAILABLE,@spark}
2021-12-27 10:51:03,499 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7da95106{/stages/json,null,AVAILABLE,@spark}
2021-12-27 10:51:03,500 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@46687b37{/stages/stage,null,AVAILABLE,@spark}
2021-12-27 10:51:03,501 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@656b0411{/stages/stage/json,null,AVAILABLE,@spark}
2021-12-27 10:51:03,501 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@60ad7cfe{/stages/pool,null,AVAILABLE,@spark}
2021-12-27 10:51:03,502 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@46482ae5{/stages/pool/json,null,AVAILABLE,@spark}
2021-12-27 10:51:03,502 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1ef317fc{/storage,null,AVAILABLE,@spark}
2021-12-27 10:51:03,502 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@15492f0a{/storage/json,null,AVAILABLE,@spark}
2021-12-27 10:51:03,503 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@10ab3ccf{/storage/rdd,null,AVAILABLE,@spark}
2021-12-27 10:51:03,503 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@57093dec{/storage/rdd/json,null,AVAILABLE,@spark}
2021-12-27 10:51:03,504 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4debbe0{/environment,null,AVAILABLE,@spark}
2021-12-27 10:51:03,504 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@327dac75{/environment/json,null,AVAILABLE,@spark}
2021-12-27 10:51:03,504 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5b91ba46{/executors,null,AVAILABLE,@spark}
2021-12-27 10:51:03,505 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@c385c3{/executors/json,null,AVAILABLE,@spark}
2021-12-27 10:51:03,505 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@74bf8e9e{/executors/threadDump,null,AVAILABLE,@spark}
2021-12-27 10:51:03,506 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@232553eb{/executors/threadDump/json,null,AVAILABLE,@spark}
2021-12-27 10:51:03,510 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5b24a17f{/static,null,AVAILABLE,@spark}
2021-12-27 10:51:03,510 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1c7efc23{/,null,AVAILABLE,@spark}
2021-12-27 10:51:03,511 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@eeaa0a8{/api,null,AVAILABLE,@spark}
2021-12-27 10:51:03,511 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1823b1f1{/jobs/job/kill,null,AVAILABLE,@spark}
2021-12-27 10:51:03,512 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6c274337{/stages/stage/kill,null,AVAILABLE,@spark}
2021-12-27 10:51:03,513 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://amax11:4040
2021-12-27 10:51:03,654 INFO client.StandaloneAppClient$ClientEndpoint: Connecting to master spark://192.168.1.11:7077...
2021-12-27 10:51:03,711 INFO client.TransportClientFactory: Successfully created connection to /192.168.1.11:7077 after 36 ms (0 ms spent in bootstraps)
2021-12-27 10:51:03,793 INFO cluster.StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20211227105103-0014
2021-12-27 10:51:03,794 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20211227105103-0014/0 on worker-20211226140545-192.168.1.11-43273 (192.168.1.11:43273) with 1 core(s)
2021-12-27 10:51:03,796 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20211227105103-0014/0 on hostPort 192.168.1.11:43273 with 1 core(s), 27.0 GiB RAM
2021-12-27 10:51:03,796 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20211227105103-0014/1 on worker-20211226140545-192.168.1.7-36031 (192.168.1.7:36031) with 1 core(s)
2021-12-27 10:51:03,797 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20211227105103-0014/1 on hostPort 192.168.1.7:36031 with 1 core(s), 27.0 GiB RAM
2021-12-27 10:51:03,797 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20211227105103-0014/2 on worker-20211226140545-192.168.1.5-36787 (192.168.1.5:36787) with 1 core(s)
2021-12-27 10:51:03,798 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20211227105103-0014/2 on hostPort 192.168.1.5:36787 with 1 core(s), 27.0 GiB RAM
2021-12-27 10:51:03,801 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 42229.
2021-12-27 10:51:03,801 INFO netty.NettyBlockTransferService: Server created on amax11:42229
2021-12-27 10:51:03,803 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2021-12-27 10:51:03,812 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, amax11, 42229, None)
2021-12-27 10:51:03,815 INFO storage.BlockManagerMasterEndpoint: Registering block manager amax11:42229 with 14.2 GiB RAM, BlockManagerId(driver, amax11, 42229, None)
2021-12-27 10:51:03,818 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, amax11, 42229, None)
2021-12-27 10:51:03,819 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, amax11, 42229, None)
2021-12-27 10:51:03,822 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20211227105103-0014/0 is now RUNNING
2021-12-27 10:51:03,822 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20211227105103-0014/2 is now RUNNING
2021-12-27 10:51:03,823 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20211227105103-0014/1 is now RUNNING
2021-12-27 10:51:03,934 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5ab79e0c{/metrics/json,null,AVAILABLE,@spark}
2021-12-27 10:51:04,320 INFO history.SingleEventLogFileWriter: Logging events to hdfs://amax11:8020/spark-sa-history/app-20211227105103-0014.inprogress
2021-12-27 10:51:04,521 INFO cluster.StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
args: Namespace(batch_size=64, cluster_size=3, epochs=3, export_dir='/home/amax/TOS/TensorFlowOnSpark/mnist_export', images_labels='/home/amax/TOS/TensorFlowOnSpark/data/mnist/csv/train', model_dir='/home/amax/TOS/TensorFlowOnSpark/mnist_model', tensorboard=False)
2021-12-27 10:51:04,712 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 297.3 KiB, free 14.2 GiB)
2021-12-27 10:51:04,759 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 27.4 KiB, free 14.2 GiB)
2021-12-27 10:51:04,762 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on amax11:42229 (size: 27.4 KiB, free: 14.2 GiB)
2021-12-27 10:51:04,765 INFO spark.SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:0
2021-12-27 10:51:04,916 INFO (MainThread-11873) Reserving TFSparkNodes
2021-12-27 10:51:04,917 INFO (MainThread-11873) cluster_template: {'chief': [0], 'worker': [1, 2]}
2021-12-27 10:51:04,919 INFO (MainThread-11873) Reservation server binding to port 0
2021-12-27 10:51:04,919 INFO (MainThread-11873) listening for reservations at ('192.168.1.11', 37823)
2021-12-27 10:51:04,919 INFO (MainThread-11873) Starting TensorFlow on executors
2021-12-27 10:51:04,924 INFO resource.ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 27648, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
2021-12-27 10:51:04,928 INFO (MainThread-11873) Waiting for TFSparkNodes to start
2021-12-27 10:51:04,928 INFO (MainThread-11873) waiting for 3 reservations
2021-12-27 10:51:04,983 INFO spark.SparkContext: Starting job: foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327
2021-12-27 10:51:04,999 INFO scheduler.DAGScheduler: Got job 0 (foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327) with 3 output partitions
2021-12-27 10:51:04,999 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327)
2021-12-27 10:51:05,000 INFO scheduler.DAGScheduler: Parents of final stage: List()
2021-12-27 10:51:05,000 INFO scheduler.DAGScheduler: Missing parents: List()
2021-12-27 10:51:05,003 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (PythonRDD[3] at foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327), which has no missing parents
2021-12-27 10:51:05,012 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 17.1 KiB, free 14.2 GiB)
2021-12-27 10:51:05,014 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 11.7 KiB, free 14.2 GiB)
2021-12-27 10:51:05,015 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on amax11:42229 (size: 11.7 KiB, free: 14.2 GiB)
2021-12-27 10:51:05,015 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1200
2021-12-27 10:51:05,025 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 0 (PythonRDD[3] at foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327) (first 15 tasks are for partitions Vector(0, 1, 2))
2021-12-27 10:51:05,026 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
2021-12-27 10:51:05,245 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.5:52680) with ID 2
2021-12-27 10:51:05,279 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.7:40504) with ID 1
2021-12-27 10:51:05,286 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.11:44508) with ID 0
2021-12-27 10:51:05,343 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.5:33393 with 14.2 GiB RAM, BlockManagerId(2, 192.168.1.5, 33393, None)
2021-12-27 10:51:05,361 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.7:38801 with 14.2 GiB RAM, BlockManagerId(1, 192.168.1.7, 38801, None)
2021-12-27 10:51:05,363 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:35421 with 14.2 GiB RAM, BlockManagerId(0, 192.168.1.11, 35421, None)
2021-12-27 10:51:05,391 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.1.5, executor 2, partition 0, PROCESS_LOCAL, 7337 bytes)
2021-12-27 10:51:05,398 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 192.168.1.7, executor 1, partition 1, PROCESS_LOCAL, 7337 bytes)
2021-12-27 10:51:05,399 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 192.168.1.11, executor 0, partition 2, PROCESS_LOCAL, 7337 bytes)
2021-12-27 10:51:05,559 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.7:38801 (size: 11.7 KiB, free: 14.2 GiB)
2021-12-27 10:51:05,560 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.11:35421 (size: 11.7 KiB, free: 14.2 GiB)
2021-12-27 10:51:05,566 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.5:33393 (size: 11.7 KiB, free: 14.2 GiB)
2021-12-27 10:51:05,930 INFO (MainThread-11873) waiting for 3 reservations
2021-12-27 10:51:06,931 INFO (MainThread-11873) waiting for 3 reservations
2021-12-27 10:51:07,932 INFO (MainThread-11873) waiting for 3 reservations
2021-12-27 10:51:08,934 INFO (MainThread-11873) waiting for 3 reservations
2021-12-27 10:51:09,935 INFO (MainThread-11873) waiting for 3 reservations
2021-12-27 10:51:10,936 INFO (MainThread-11873) all reservations completed
2021-12-27 10:51:10,937 INFO (MainThread-11873) All TFSparkNodes started
2021-12-27 10:51:10,937 INFO (MainThread-11873) {'executor_id': 1, 'host': '192.168.1.7', 'job_name': 'worker', 'task_index': 0, 'port': 36945, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-8lelocpu/listener-c8uef51e', 'authkey': b'n/\x82\x08F\x94L\xb0\xb5T\xd0\xf4\xb6\n\xb2y'}
2021-12-27 10:51:10,937 INFO (MainThread-11873) {'executor_id': 2, 'host': '192.168.1.11', 'job_name': 'worker', 'task_index': 1, 'port': 45615, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-k4cka2ni/listener-um3u1iby', 'authkey': b'\x9e\xfa+\x93\xb2bFx\xa1\x84pQ\xba\xef=6'}
2021-12-27 10:51:10,937 INFO (MainThread-11873) {'executor_id': 0, 'host': '192.168.1.5', 'job_name': 'chief', 'task_index': 0, 'port': 44093, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-sevn96me/listener-coe66f9n', 'authkey': b'3\xd8\x85\xe7\x06UKa\x8b\xcb\xe6\x11\n\xa1)C'}
2021-12-27 10:51:10,937 INFO (MainThread-11873) Feeding training data
2021-12-27 10:51:11,023 INFO mapred.FileInputFormat: Total input files to process : 10
2021-12-27 10:51:11,061 INFO spark.SparkContext: Starting job: collect at PythonRDD.scala:168
2021-12-27 10:51:11,064 INFO scheduler.DAGScheduler: Got job 1 (collect at PythonRDD.scala:168) with 30 output partitions
2021-12-27 10:51:11,064 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at PythonRDD.scala:168)
2021-12-27 10:51:11,064 INFO scheduler.DAGScheduler: Parents of final stage: List()
2021-12-27 10:51:11,065 INFO scheduler.DAGScheduler: Missing parents: List()
2021-12-27 10:51:11,066 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (PythonRDD[6] at RDD at PythonRDD.scala:53), which has no missing parents
2021-12-27 10:51:11,078 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 12.6 KiB, free 14.2 GiB)
2021-12-27 10:51:11,079 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 7.3 KiB, free 14.2 GiB)
2021-12-27 10:51:11,080 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on amax11:42229 (size: 7.3 KiB, free: 14.2 GiB)
2021-12-27 10:51:11,080 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1200
2021-12-27 10:51:11,081 INFO scheduler.DAGScheduler: Submitting 30 missing tasks from ResultStage 1 (PythonRDD[6] at RDD at PythonRDD.scala:53) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
2021-12-27 10:51:11,081 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 30 tasks
2021-12-27 10:51:15,330 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 3, 192.168.1.5, executor 2, partition 0, ANY, 7536 bytes)
2021-12-27 10:51:15,333 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 9956 ms on 192.168.1.5 (executor 2) (1/3)
2021-12-27 10:51:15,341 INFO python.PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 45655
2021-12-27 10:51:15,361 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.5:33393 (size: 7.3 KiB, free: 14.2 GiB)
2021-12-27 10:51:15,387 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.5:33393 (size: 27.4 KiB, free: 14.2 GiB)
2021-12-27 10:51:15,650 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 4, 192.168.1.11, executor 0, partition 1, ANY, 7536 bytes)
2021-12-27 10:51:15,650 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 10251 ms on 192.168.1.11 (executor 0) (2/3)
2021-12-27 10:51:15,691 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.11:35421 (size: 7.3 KiB, free: 14.2 GiB)
2021-12-27 10:51:15,712 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 5, 192.168.1.7, executor 1, partition 2, ANY, 7536 bytes)
2021-12-27 10:51:15,712 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 10314 ms on 192.168.1.7 (executor 1) (3/3)
2021-12-27 10:51:15,714 INFO scheduler.DAGScheduler: ResultStage 0 (foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327) finished in 10.703 s
2021-12-27 10:51:15,714 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
2021-12-27 10:51:15,715 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.11:35421 (size: 27.4 KiB, free: 14.2 GiB)
2021-12-27 10:51:15,719 INFO scheduler.DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
2021-12-27 10:51:15,719 INFO scheduler.TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
2021-12-27 10:51:15,721 INFO scheduler.DAGScheduler: Job 0 finished: foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327, took 10.737965 s
2021-12-27 10:51:15,749 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.7:38801 (size: 7.3 KiB, free: 14.2 GiB)
2021-12-27 10:51:15,779 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.7:38801 (size: 27.4 KiB, free: 14.2 GiB)

The logs from Spark UI:
Spark Executor Command: "/opt/module/jdk1.8.0_212/bin/java" "-cp" "/opt/module/spark-standalone/conf/:/opt/module/spark-standalone/jars/*" "-Xmx27648M" "-Dspark.driver.port=45175" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@amax11:45175" "--executor-id" "0" "--hostname" "192.168.1.11" "--cores" "1" "--app-id" "app-20211227105103-0014" "--worker-url" "spark://Worker@192.168.1.11:43273"

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/12/27 10:51:04 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 11957@amax11
21/12/27 10:51:04 INFO SignalUtils: Registered signal handler for TERM
21/12/27 10:51:04 INFO SignalUtils: Registered signal handler for HUP
21/12/27 10:51:04 INFO SignalUtils: Registered signal handler for INT
21/12/27 10:51:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/12/27 10:51:04 INFO SecurityManager: Changing view acls to: amax
21/12/27 10:51:04 INFO SecurityManager: Changing modify acls to: amax
21/12/27 10:51:04 INFO SecurityManager: Changing view acls groups to:
21/12/27 10:51:04 INFO SecurityManager: Changing modify acls groups to:
21/12/27 10:51:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(amax); groups with view permissions: Set(); users with modify permissions: Set(amax); groups with modify permissions: Set()
21/12/27 10:51:04 INFO TransportClientFactory: Successfully created connection to amax11/192.168.1.11:45175 after 54 ms (0 ms spent in bootstraps)
21/12/27 10:51:04 INFO SecurityManager: Changing view acls to: amax
21/12/27 10:51:04 INFO SecurityManager: Changing modify acls to: amax
21/12/27 10:51:04 INFO SecurityManager: Changing view acls groups to:
21/12/27 10:51:04 INFO SecurityManager: Changing modify acls groups to:
21/12/27 10:51:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(amax); groups with view permissions: Set(); users with modify permissions: Set(amax); groups with modify permissions: Set()
21/12/27 10:51:05 INFO TransportClientFactory: Successfully created connection to amax11/192.168.1.11:45175 after 1 ms (0 ms spent in bootstraps)
21/12/27 10:51:05 INFO DiskBlockManager: Created local directory at /tmp/spark-66944227-f0bf-4829-b870-f5d85feae2cd/executor-2c04267d-7763-4e33-b90c-a6c1ead9f50b/blockmgr-6ac139bf-9f9c-496c-82de-a559a0202fad
21/12/27 10:51:05 INFO MemoryStore: MemoryStore started with capacity 14.2 GiB
21/12/27 10:51:05 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@amax11:45175
21/12/27 10:51:05 INFO WorkerWatcher: Connecting to worker spark://Worker@192.168.1.11:43273
21/12/27 10:51:05 INFO TransportClientFactory: Successfully created connection to /192.168.1.11:43273 after 1 ms (0 ms spent in bootstraps)
21/12/27 10:51:05 INFO WorkerWatcher: Successfully connected to spark://Worker@192.168.1.11:43273
21/12/27 10:51:05 INFO ResourceUtils: ==============================================================
21/12/27 10:51:05 INFO ResourceUtils: Resources for spark.executor:

21/12/27 10:51:05 INFO ResourceUtils: ==============================================================
21/12/27 10:51:05 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
21/12/27 10:51:05 INFO Executor: Starting executor ID 0 on host 192.168.1.11
21/12/27 10:51:05 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35421.
21/12/27 10:51:05 INFO NettyBlockTransferService: Server created on 192.168.1.11:35421
21/12/27 10:51:05 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/12/27 10:51:05 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, 192.168.1.11, 35421, None)
21/12/27 10:51:05 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, 192.168.1.11, 35421, None)
21/12/27 10:51:05 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, 192.168.1.11, 35421, None)
21/12/27 10:51:05 INFO CoarseGrainedExecutorBackend: Got assigned task 2
21/12/27 10:51:05 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
21/12/27 10:51:05 INFO TorrentBroadcast: Started reading broadcast variable 1 with 1 pieces (estimated total size 4.0 MiB)
21/12/27 10:51:05 INFO TransportClientFactory: Successfully created connection to amax11/192.168.1.11:42229 after 2 ms (0 ms spent in bootstraps)
21/12/27 10:51:05 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 11.7 KiB, free 14.2 GiB)
21/12/27 10:51:05 INFO TorrentBroadcast: Reading broadcast variable 1 took 90 ms
21/12/27 10:51:05 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 17.1 KiB, free 14.2 GiB)
2021-12-27 10:51:10,558 INFO (MainThread-12074) Available GPUs: ['0']
2021-12-27 10:51:10,559 INFO (MainThread-12074) Proposed GPUs: ['0']
2021-12-27 10:51:10,559 INFO (MainThread-12074) Requested 1 GPU(s), setting CUDA_VISIBLE_DEVICES=0
2021-12-27 10:51:10,578 INFO (MainThread-12074) connected to server at ('192.168.1.11', 37823)
2021-12-27 10:51:10,578 INFO (MainThread-12074) TFSparkNode.reserve: {'executor_id': 2, 'host': '192.168.1.11', 'job_name': 'worker', 'task_index': 1, 'port': 45615, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-k4cka2ni/listener-um3u1iby', 'authkey': b'\x9e\xfa+\x93\xb2bFx\xa1\x84pQ\xba\xef=6'}
2021-12-27 10:51:12,582 INFO (MainThread-12074) node: {'executor_id': 0, 'host': '192.168.1.5', 'job_name': 'chief', 'task_index': 0, 'port': 44093, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-sevn96me/listener-coe66f9n', 'authkey': b'3\xd8\x85\xe7\x06UKa\x8b\xcb\xe6\x11\n\xa1)C'}
2021-12-27 10:51:12,582 INFO (MainThread-12074) node: {'executor_id': 1, 'host': '192.168.1.7', 'job_name': 'worker', 'task_index': 0, 'port': 36945, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-8lelocpu/listener-c8uef51e', 'authkey': b'n/\x82\x08F\x94L\xb0\xb5T\xd0\xf4\xb6\n\xb2y'}
2021-12-27 10:51:12,582 INFO (MainThread-12074) node: {'executor_id': 2, 'host': '192.168.1.11', 'job_name': 'worker', 'task_index': 1, 'port': 45615, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-k4cka2ni/listener-um3u1iby', 'authkey': b'\x9e\xfa+\x93\xb2bFx\xa1\x84pQ\xba\xef=6'}
2021-12-27 10:51:12,582 INFO (MainThread-12074) export TF_CONFIG: {"cluster": {"chief": ["192.168.1.5:44093"], "worker": ["192.168.1.7:36945", "192.168.1.11:45615"]}, "task": {"type": "worker", "index": 1}, "environment": "cloud"}
2021-12-27 10:51:15,616 INFO (MainThread-12074) Available GPUs: ['0']
2021-12-27 10:51:15,616 INFO (MainThread-12074) Proposed GPUs: ['0']
2021-12-27 10:51:15,616 INFO (MainThread-12074) Requested 1 GPU(s), setting CUDA_VISIBLE_DEVICES=0
2021-12-27 10:51:15,616 INFO (MainThread-12074) Starting TensorFlow worker:1 as worker on cluster node 2 on background process
21/12/27 10:51:15 INFO PythonRunner: Times: total = 9778, boot = 316, init = 1021, finish = 8441
2021-12-27 10:51:15,627 WARNING (MainThread-12152) From /home/amax/TOS/TensorFlowOnSpark/examples/mnist/keras/mnist_spark.py:11: _CollectiveAllReduceStrategyExperimental.init (from tensorflow.python.distribute.collective_all_reduce_strategy) is deprecated and will be removed in a future version.
Instructions for updating:
use distribute.MultiWorkerMirroredStrategy instead
21/12/27 10:51:15 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 1549 bytes result sent to driver
21/12/27 10:51:15 INFO CoarseGrainedExecutorBackend: Got assigned task 4
21/12/27 10:51:15 INFO Executor: Running task 1.0 in stage 1.0 (TID 4)
21/12/27 10:51:15 INFO TorrentBroadcast: Started reading broadcast variable 2 with 1 pieces (estimated total size 4.0 MiB)
21/12/27 10:51:15 INFO TransportClientFactory: Successfully created connection to /192.168.1.5:33393 after 1 ms (0 ms spent in bootstraps)
21/12/27 10:51:15 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 7.3 KiB, free 14.2 GiB)
21/12/27 10:51:15 INFO TorrentBroadcast: Reading broadcast variable 2 took 25 ms
21/12/27 10:51:15 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 12.6 KiB, free 14.2 GiB)
21/12/27 10:51:15 INFO HadoopRDD: Input split: hdfs://amax11:8020/home/amax/TOS/TensorFlowOnSpark/data/mnist/csv/train/part-00001:0+11232549
21/12/27 10:51:15 INFO TorrentBroadcast: Started reading broadcast variable 0 with 1 pieces (estimated total size 4.0 MiB)
21/12/27 10:51:15 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 27.4 KiB, free 14.2 GiB)
21/12/27 10:51:15 INFO TorrentBroadcast: Reading broadcast variable 0 took 9 ms
21/12/27 10:51:15 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 426.4 KiB, free 14.2 GiB)
2021-12-27 10:51:16.542093: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-12-27 10:51:16.980995: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 8107 MB memory: -> device: 0, name: GeForce RTX 3080, pci bus id: 0000:65:00.0, compute capability: 8.6
2021-12-27 10:51:16.989052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:worker/replica:0/task:1/device:GPU:0 with 8107 MB memory: -> device: 0, name: GeForce RTX 3080, pci bus id: 0000:65:00.0, compute capability: 8.6
2021-12-27 10:51:16.994323: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job chief -> {0 -> 192.168.1.5:44093}
2021-12-27 10:51:16.994344: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 192.168.1.7:36945, 1 -> 192.168.1.11:45615}
2021-12-27 10:51:16.995505: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://192.168.1.11:45615
2021-12-27 10:51:17,003 INFO (MainThread-12152) Enabled multi-worker collective ops with available devices: ['/job:worker/replica:0/task:1/device:CPU:0', '/job:worker/replica:0/task:1/device:GPU:0']
2021-12-27 10:51:17,104 INFO (MainThread-12152) Waiting for the cluster, timeout = inf
2021-12-27 10:51:17,194 INFO (MainThread-12169) Connected to TFSparkNode.mgr on 192.168.1.11, executor=2, state='running'
2021-12-27 10:51:17,201 INFO (MainThread-12169) mgr.state='running'
2021-12-27 10:51:17,201 INFO (MainThread-12169) Feeding partition <itertools.chain object at 0x7f8091c66d30> into input queue <multiprocessing.queues.JoinableQueue object at 0x7f7fe8a757f0>
21/12/27 10:51:17 INFO PythonRunner: Times: total = 772, boot = -382, init = 438, finish = 716

Spark Submit Command Line:
${SPARK_HOME}/bin/spark-submit
--master ${MASTER}
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.task.cpus=${CORES_PER_WORKER}
--executor-memory 27G
${TFoS_HOME}/examples/mnist/keras/mnist_spark.py
--cluster_size ${SPARK_WORKER_INSTANCES}
--images_labels ${TFoS_HOME}/data/mnist/csv/train
--model_dir ${TFoS_HOME}/mnist_model
--export_dir ${TFoS_HOME}/mnist_export

@leewyang
Copy link
Contributor

leewyang commented Jan 3, 2022

My guess is that your model_dir needs to fully-specify the HDFS path, e.g. hdfs://default/...

Note that the example uses spark-local mode (for simplicity).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants