Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(raylet) socket.gaierror: [Errno -2] Name or service not known #8

Open
xunaichao opened this issue May 31, 2022 · 9 comments
Open

(raylet) socket.gaierror: [Errno -2] Name or service not known #8

xunaichao opened this issue May 31, 2022 · 9 comments

Comments

@xunaichao
Copy link

When I run https://analytics-zoo.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-tf2keras-quickstart.html tensorFlow 2 For example.
############
Error:
(raylet) Traceback (most recent call last):
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 334, in
(raylet) raise e
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 323, in
(raylet) loop.run_until_complete(agent.run())
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/asyncio/base_events.py", line 568, in run_until_complete
(raylet) return future.result()
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 138, in run
(raylet) modules = self._load_modules()
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 92, in _load_modules
(raylet) c = cls(self)
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in init
(raylet) self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/metrics_agent.py", line 76, in init
(raylet) namespace="ray", port=metrics_export_port)))
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/prometheus_exporter.py", line 334, in new_stats_exporter
(raylet) options=option, gatherer=option.registry, collector=collector)
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/prometheus_exporter.py", line 266, in init
(raylet) self.serve_http()
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/prometheus_exporter.py", line 321, in serve_http
(raylet) port=self.options.port, addr=str(self.options.address))
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
(raylet) TmpServer.address_family, addr = _get_best_family(addr, port)
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
(raylet) infos = socket.getaddrinfo(address, port)
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/socket.py", line 753, in getaddrinfo
(raylet) for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
(raylet) socket.gaierror: [Errno -2] Name or service not known
##############

Hosts file
image

After running the example, session files are generated in /tmp/ray/ of the system
image

Runtime environment: Docker deployment uses Miniconda to install AZ and Ray

Conda create -n zoo python=3.7
conda activate zoo
pip install --pre --upgrade analytics-zoo
pip install analytics-zoo[ray]
PIP install tensorflow = = 2.3.0

conda list

Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
absl-py 1.0.0 pypi_0 pypi
aiohttp 3.7.0 pypi_0 pypi
aiohttp-cors 0.7.0 pypi_0 pypi
aioredis 1.1.0 pypi_0 pypi
analytics-zoo 0.12.0b2022052501 pypi_0 pypi
astunparse 1.6.3 pypi_0 pypi
async-timeout 3.0.1 pypi_0 pypi
attrs 21.4.0 pypi_0 pypi
bigdl 0.13.1.dev1 pypi_0 pypi
blessings 1.7 pypi_0 pypi
ca-certificates 2022.4.26 h06a4308_0
cachetools 5.1.0 pypi_0 pypi
certifi 2022.5.18.1 py37h06a4308_0
chardet 3.0.4 pypi_0 pypi
charset-normalizer 2.0.12 pypi_0 pypi
click 8.1.3 pypi_0 pypi
colorama 0.4.4 pypi_0 pypi
colorful 0.5.4 pypi_0 pypi
conda-pack 0.3.1 pypi_0 pypi
deprecated 1.2.13 pypi_0 pypi
filelock 3.7.0 pypi_0 pypi
gast 0.3.3 pypi_0 pypi
google-api-core 2.8.0 pypi_0 pypi
google-auth 2.6.6 pypi_0 pypi
google-auth-oauthlib 0.4.6 pypi_0 pypi
google-pasta 0.2.0 pypi_0 pypi
googleapis-common-protos 1.56.1 pypi_0 pypi
gpustat 0.6.0 pypi_0 pypi
grpcio 1.46.3 pypi_0 pypi
h5py 2.10.0 pypi_0 pypi
hiredis 1.1.0 pypi_0 pypi
idna 3.3 pypi_0 pypi
importlib-metadata 4.11.4 pypi_0 pypi
importlib-resources 5.7.1 pypi_0 pypi
jsonschema 4.5.1 pypi_0 pypi
keras-preprocessing 1.1.2 pypi_0 pypi
libedit 3.1.20210910 h7f8727e_0
libffi 3.2.1 hf484d3e_1007
libgcc-ng 11.2.0 h1234567_0
libgomp 11.2.0 h1234567_0
libstdcxx-ng 11.2.0 h1234567_0
markdown 3.3.7 pypi_0 pypi
msgpack 1.0.3 pypi_0 pypi
multidict 6.0.2 pypi_0 pypi
ncurses 6.3 h7f8727e_2
numpy 1.18.5 pypi_0 pypi
nvidia-ml-py3 7.352.0 pypi_0 pypi
oauthlib 3.2.0 pypi_0 pypi
opencensus 0.9.0 pypi_0 pypi
opencensus-context 0.1.2 pypi_0 pypi
opencv-python 4.5.5.64 pypi_0 pypi
openssl 1.0.2u h7b6447c_0
opt-einsum 3.3.0 pypi_0 pypi
packaging 21.3 pypi_0 pypi
pip 21.2.2 py37h06a4308_0
prometheus-client 0.14.1 pypi_0 pypi
protobuf 3.20.1 pypi_0 pypi
psutil 5.9.1 pypi_0 pypi
py-spy 0.3.12 pypi_0 pypi
py4j 0.10.7 pypi_0 pypi
pyasn1 0.4.8 pypi_0 pypi
pyasn1-modules 0.2.8 pypi_0 pypi
pyparsing 3.0.9 pypi_0 pypi
pyrsistent 0.18.1 pypi_0 pypi
pyspark 2.4.6 pypi_0 pypi
python 3.7.0 h6e4f718_3
pyyaml 6.0 pypi_0 pypi
ray 1.2.0 pypi_0 pypi
readline 7.0 h7b6447c_5
redis 4.1.4 pypi_0 pypi
requests 2.27.1 pypi_0 pypi
requests-oauthlib 1.3.1 pypi_0 pypi
rsa 4.8 pypi_0 pypi
scipy 1.4.1 pypi_0 pypi
setproctitle 1.2.3 pypi_0 pypi
setuptools 61.2.0 py37h06a4308_0
six 1.16.0 pypi_0 pypi
sqlite 3.33.0 h62c20be_0
tensorboard 2.9.0 pypi_0 pypi
tensorboard-data-server 0.6.1 pypi_0 pypi
tensorboard-plugin-wit 1.8.1 pypi_0 pypi
tensorflow 2.3.0 pypi_0 pypi
tensorflow-estimator 2.3.0 pypi_0 pypi
termcolor 1.1.0 pypi_0 pypi
tk 8.6.11 h1ccaba5_1
typing-extensions 4.2.0 pypi_0 pypi
urllib3 1.26.9 pypi_0 pypi
werkzeug 2.1.2 pypi_0 pypi
wheel 0.37.1 pyhd3eb1b0_0
wrapt 1.14.1 pypi_0 pypi
xz 5.2.5 h7f8727e_1
yarl 1.7.2 pypi_0 pypi
zipp 3.8.0 pypi_0 pypi
zlib 1.2.12 h7f8727e_2

————————————————————
1、Check python:
from zoo.util.utils import detect_python_location
detect_python_location()
image

2、Check ray installation
/usr/local/miniconda3/envs/zoo/bin/python /usr/local/miniconda3/envs/zoo/bin/ray start --head --include-dashboard ture --dashboard-host 172.27.0.2 --port 35413 --redis-password 123456 --num-cpus 1
image

/usr/local/miniconda3/envs/zoo/bin/python /usr/local/miniconda3/envs/zoo/bin/ray start --address 172.27.0.2:35413 --redis-password 123456 --num-cpus 1
image

ray start --address=‘172.27.0.2:35413' --redis-password='0'

image

Related documents.zip

@xunaichao
Copy link
Author

Please help solve it. Thank you

I'm going crazy

@hkvision
Copy link
Contributor

hkvision commented Jun 1, 2022

Hi @xunaichao

I checked the code and run it on Google Colab, I can get this error as well. But seems this error doesn't impact or interrupt the running, you can find the train and evaluate results in your log. Seems the error comes from ray dashboard, not sure whether this is caused by the out-of-date ray version.

As mentioned above, you are highly recommended to switch to the latest version of BigDL, I run the same BigDL example in Google Colab and there's no such error: https://bigdl.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-tf2keras-quickstart.html

@xunaichao
Copy link
Author

xunaichao commented Jun 1, 2022

@jason-dai @hkvision
thanks for your response. I have follow the instructions you gave:https://bigdl.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-tf2keras-quickstart.html
I now run my yolov3.py and have a exception,

run logs:

2022-06-01 10:01:10.069315: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2022-06-01 10:01:10.074183: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-06-01 10:01:10.074198: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Initializing orca context
Current pyspark location is : /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/pyspark/init.py
Start to getOrCreate SparkContext
pyspark_submit_args is: --driver-class-path /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/core/lib/all-2.1.0-20220314.094552-2.jar:/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/dllib/lib/bigdl-dllib-spark_2.4.6-2.0.0-jar-with-dependencies.jar:/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/orca/lib/bigdl-orca-spark_2.4.6-2.0.0-jar-with-dependencies.jar pyspark-shell
2022-06-01 10:01:13 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-06-01 10:01:14,896 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false
2022-06-01 10:01:14,898 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false
2022-06-01 10:01:14,899 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false
2022-06-01 10:01:14,899 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false
22-06-01 10:01:14 [Thread-4] INFO Engine$:121 - Auto detect executor number and executor cores number
22-06-01 10:01:14 [Thread-4] INFO Engine$:123 - Executor number is 1 and executor cores number is 4

User settings:

KMP_AFFINITY=granularity=fine,compact,1,0
KMP_BLOCKTIME=0
KMP_SETTINGS=1
OMP_NUM_THREADS=1

Effective settings:

KMP_ABORT_DELAY=0
KMP_ADAPTIVE_LOCK_PROPS='1,1024'
KMP_ALIGN_ALLOC=64
KMP_ALL_THREADPRIVATE=416
KMP_ATOMIC_MODE=2
KMP_BLOCKTIME=0
KMP_CPUINFO_FILE: value is not defined
KMP_DETERMINISTIC_REDUCTION=false
KMP_DEVICE_THREAD_LIMIT=2147483647
KMP_DISP_HAND_THREAD=false
KMP_DISP_NUM_BUFFERS=7
KMP_DUPLICATE_LIB_OK=false
KMP_FORCE_REDUCTION: value is not defined
KMP_FOREIGN_THREADS_THREADPRIVATE=true
KMP_FORKJOIN_BARRIER='2,2'
KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
KMP_FORKJOIN_FRAMES=true
KMP_FORKJOIN_FRAMES_MODE=3
KMP_GTID_MODE=3
KMP_HANDLE_SIGNALS=false
KMP_HOT_TEAMS_MAX_LEVEL=1
KMP_HOT_TEAMS_MODE=0
KMP_INIT_AT_FORK=true
KMP_ITT_PREPARE_DELAY=0
KMP_LIBRARY=throughput
KMP_LOCK_KIND=queuing
KMP_MALLOC_POOL_INCR=1M
KMP_MWAIT_HINTS=0
KMP_NUM_LOCKS_IN_BLOCK=1
KMP_PLAIN_BARRIER='2,2'
KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
KMP_REDUCTION_BARRIER='1,1'
KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
KMP_SCHEDULE='static,balanced;guided,iterative'
KMP_SETTINGS=true
KMP_SPIN_BACKOFF_PARAMS='4096,100'
KMP_STACKOFFSET=64
KMP_STACKPAD=0
KMP_STACKSIZE=8M
KMP_STORAGE_MAP=false
KMP_TASKING=2
KMP_TASKLOOP_MIN_TASKS=0
KMP_TASK_STEALING_CONSTRAINT=1
KMP_TEAMS_THREAD_LIMIT=104
KMP_TOPOLOGY_METHOD=all
KMP_USER_LEVEL_MWAIT=false
KMP_USE_YIELD=1
KMP_VERSION=false
KMP_WARNINGS=true
OMP_AFFINITY_FORMAT='OMP: pid %P tid %i thread %n bound to OS proc set {%A}'
OMP_ALLOCATOR=omp_default_mem_alloc
OMP_CANCELLATION=false
OMP_DEBUG=disabled
OMP_DEFAULT_DEVICE=0
OMP_DISPLAY_AFFINITY=false
OMP_DISPLAY_ENV=false
OMP_DYNAMIC=false
OMP_MAX_ACTIVE_LEVELS=2147483647
OMP_MAX_TASK_PRIORITY=0
OMP_NESTED=false
OMP_NUM_THREADS='1'
OMP_PLACES: value is not defined
OMP_PROC_BIND='intel'
OMP_SCHEDULE='static'
OMP_STACKSIZE=8M
OMP_TARGET_OFFLOAD=DEFAULT
OMP_THREAD_LIMIT=2147483647
OMP_TOOL=enabled
OMP_TOOL_LIBRARIES: value is not defined
OMP_WAIT_POLICY=PASSIVE
KMP_AFFINITY='noverbose,warnings,respect,granularity=fine,compact,1,0'

22-06-01 10:01:15 [Thread-4] INFO ThreadPool$:95 - Set mkl threads to 1 on thread 30
2022-06-01 10:01:15 WARN SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.
22-06-01 10:01:15 [Thread-4] INFO Engine$:446 - Find existing spark context. Checking the spark conf...
cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.Sample
BigDLBasePickler registering: bigdl.dllib.utils.common Sample
cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.EvaluatedResult
BigDLBasePickler registering: bigdl.dllib.utils.common EvaluatedResult
cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JTensor
BigDLBasePickler registering: bigdl.dllib.utils.common JTensor
cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JActivity
BigDLBasePickler registering: bigdl.dllib.utils.common JActivity
Successfully got a SparkContext
2022-06-01 10:01:18,220 INFO services.py:1340 -- View the Ray dashboard at http://172.27.0.2:8265
2022-06-01 10:01:18,225 WARNING services.py:1826 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
{'node_ip_address': '172.27.0.2', 'raylet_ip_address': '172.27.0.2', 'redis_address': '172.27.0.2:15812', 'object_store_address': '/tmp/ray/session_2022-06-01_10-01-15_641395_1703868/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-06-01_10-01-15_641395_1703868/sockets/raylet', 'webui_url': '172.27.0.2:8265', 'session_dir': '/tmp/ray/session_2022-06-01_10-01-15_641395_1703868', 'metrics_export_port': 47074, 'node_id': 'a6dd76c71c04c32df5e009bc951165e1b0e85486a8a75d23fb5ab9ed'}
(Worker pid=1704437) 2022-06-01 10:01:19.629608: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
(Worker pid=1704437) 2022-06-01 10:01:19.634737: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/cv2/../../lib64:
(Worker pid=1704437) 2022-06-01 10:01:19.634753: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(Worker pid=1704437) WARNING:tensorflow:From /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/orca/learn/tf2/tf_runner.py:317: _CollectiveAllReduceStrategyExperimental.init (from tensorflow.python.distribute.collective_all_reduce_strategy) is deprecated and will be removed in a future version.
(Worker pid=1704437) Instructions for updating:
(Worker pid=1704437) use distribute.MultiWorkerMirroredStrategy instead
(Worker pid=1704437) 2022-06-01 10:01:21.270040: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/cv2/../../lib64:
(Worker pid=1704437) 2022-06-01 10:01:21.270095: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=1704437) 2022-06-01 10:01:21.270135: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (816d2073a24f): /proc/driver/nvidia/version does not exist
(Worker pid=1704437) 2022-06-01 10:01:21.271364: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
(Worker pid=1704437) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(Worker pid=1704437) 2022-06-01 10:01:21.297690: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 172.27.0.2:53169}
(Worker pid=1704437) 2022-06-01 10:01:21.297883: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 172.27.0.2:53169}
(Worker pid=1704437) 2022-06-01 10:01:21.299556: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:438] Started server with target: grpc://172.27.0.2:53169
(raylet) /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/ray/dashboard/agent.py:152: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
(raylet) if LooseVersion(aiohttp.version) < LooseVersion("4.0.0"):
(raylet) /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/ray/dashboard/agent.py:152: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
(raylet) if LooseVersion(aiohttp.version) < LooseVersion("4.0.0"):
Traceback (most recent call last):
File "yolov3.py", line 656, in
main()
File "yolov3.py", line 643, in main
trainer = Estimator.from_keras(model_creator=model_creator)
File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/orca/learn/tf2/estimator.py", line 69, in from_keras
cpu_binding=cpu_binding)
File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/orca/learn/tf2/ray_estimator.py", line 96, in init
for i, worker in enumerate(self.remote_workers)])
File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
(Worker pid=1704437) 2022-06-01 10:01:27.086318: W tensorflow/core/util/tensor_slice_reader.cc:96] Could not open ./yolov3/yolov3.weights: DATA_LOSS: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
return func(*args, **kwargs)
File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/ray/worker.py", line 1713, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OSError): ray::Worker.setup_distributed() (pid=1704437, ip=172.27.0.2, repr=<bigdl.orca.learn.dl_cluster.Worker object at 0x7faab3e7fcd0>)
File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/orca/learn/tf2/tf_runner.py", line 321, in setup_distributed
self.model = self.model_creator(self.config)
File "yolov3.py", line 571, in model_creator
model_pretrained.load_weights(options.weights)
File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/h5py/_hl/files.py", line 394, in init
swmr=swmr)
File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/h5py/_hl/files.py", line 170, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 85, in h5py.h5f.open
OSError: Unable to open file (file signature not found)
Stopping orca context

the code i used is pasted here:
yolov3.py.zip

conda list:

Name Version Build Channel

_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
absl-py 1.0.0 pypi_0 pypi
aiohttp 3.8.1 pypi_0 pypi
aiohttp-cors 0.7.0 pypi_0 pypi
aioredis 1.3.1 pypi_0 pypi
aiosignal 1.2.0 pypi_0 pypi
anyio 3.6.1 pypi_0 pypi
astunparse 1.6.3 pypi_0 pypi
async-timeout 4.0.1 pypi_0 pypi
asynctest 0.13.0 pypi_0 pypi
attrs 21.4.0 pypi_0 pypi
bigdl 2.1.0b202205302 pypi_0 pypi
bigdl-chronos 2.1.0b202205302 pypi_0 pypi
bigdl-core 2.1.0b20220321 pypi_0 pypi
bigdl-dllib 2.1.0b202205302 pypi_0 pypi
bigdl-friesian 2.1.0b202205302 pypi_0 pypi
bigdl-math 0.14.0.dev1 pypi_0 pypi
bigdl-nano 2.1.0b202205302 pypi_0 pypi
bigdl-orca 2.1.0b202205302 pypi_0 pypi
bigdl-serving 2.1.0b202205302 pypi_0 pypi
bigdl-tf 0.14.0.dev1 pypi_0 pypi
blessed 1.19.1 pypi_0 pypi
ca-certificates 2022.4.26 h06a4308_0
cachetools 5.2.0 pypi_0 pypi
certifi 2022.5.18.1 py37h06a4308_0
chardet 3.0.4 pypi_0 pypi
charset-normalizer 2.0.12 pypi_0 pypi
click 8.1.3 pypi_0 pypi
cloudpickle 2.1.0 pypi_0 pypi
colorful 0.5.4 pypi_0 pypi
conda-pack 0.3.1 pypi_0 pypi
deprecated 1.2.13 pypi_0 pypi
filelock 3.7.1 pypi_0 pypi
flatbuffers 1.12 pypi_0 pypi
frozenlist 1.3.0 pypi_0 pypi
fsspec 2022.5.0 pypi_0 pypi
future 0.18.2 pypi_0 pypi
gast 0.4.0 pypi_0 pypi
google-api-core 2.8.1 pypi_0 pypi
google-auth 2.6.6 pypi_0 pypi
google-auth-oauthlib 0.4.6 pypi_0 pypi
google-pasta 0.2.0 pypi_0 pypi
googleapis-common-protos 1.56.2 pypi_0 pypi
gpustat 1.0.0b1 pypi_0 pypi
grpcio 1.46.3 pypi_0 pypi
h11 0.12.0 pypi_0 pypi
h5py 3.7.0 pypi_0 pypi
hiredis 2.0.0 pypi_0 pypi
httpcore 0.13.7 pypi_0 pypi
httpx 1.0.0b0 pypi_0 pypi
idna 3.3 pypi_0 pypi
importlib-metadata 4.11.4 pypi_0 pypi
importlib-resources 5.7.1 pypi_0 pypi
intel-openmp 2022.1.0 pypi_0 pypi
joblib 1.1.0 pypi_0 pypi
jsonschema 4.5.1 pypi_0 pypi
kafka-python 2.0.2 pypi_0 pypi
keras 2.9.0 pypi_0 pypi
keras-preprocessing 1.1.2 pypi_0 pypi
ld_impl_linux-64 2.38 h1181459_1
libclang 14.0.1 pypi_0 pypi
libffi 3.3 he6710b0_2
libgcc-ng 11.2.0 h1234567_0
libgomp 11.2.0 h1234567_0
libstdcxx-ng 11.2.0 h1234567_0
markdown 3.3.7 pypi_0 pypi
msgpack 1.0.3 pypi_0 pypi
multidict 4.7.6 pypi_0 pypi
ncurses 6.3 h7f8727e_2
numpy 1.21.6 pypi_0 pypi
nvidia-ml-py3 7.352.0 pypi_0 pypi
oauthlib 3.2.0 pypi_0 pypi
onnx 1.11.0 pypi_0 pypi
onnxruntime 1.11.1 pypi_0 pypi
opencensus 0.9.0 pypi_0 pypi
opencensus-context 0.1.2 pypi_0 pypi
opencv-python 4.5.5.64 pypi_0 pypi
opencv-python-headless 4.5.5.64 pypi_0 pypi
opencv-transforms 0.0.6 pypi_0 pypi
openssl 1.1.1o h7f8727e_0
opt-einsum 3.3.0 pypi_0 pypi
packaging 21.3 pypi_0 pypi
pandas 1.2.5 pypi_0 pypi
patsy 0.5.2 pypi_0 pypi
pillow 9.1.1 pypi_0 pypi
pip 21.2.2 py37h06a4308_0
prometheus-client 0.14.1 pypi_0 pypi
protobuf 3.19.4 pypi_0 pypi
psutil 5.9.1 pypi_0 pypi
py-spy 0.3.12 pypi_0 pypi
py4j 0.10.7 pypi_0 pypi
pyarrow 8.0.0 pypi_0 pypi
pyasn1 0.4.8 pypi_0 pypi
pyasn1-modules 0.2.8 pypi_0 pypi
pydeprecate 0.3.1 pypi_0 pypi
pyparsing 3.0.9 pypi_0 pypi
pyrsistent 0.18.1 pypi_0 pypi
pyspark 2.4.6 pypi_0 pypi
python 3.7.13 h12debd9_0
python-dateutil 2.8.2 pypi_0 pypi
pytorch-lightning 1.4.2 pypi_0 pypi
pyturbojpeg 1.6.6 pypi_0 pypi
pytz 2022.1 pypi_0 pypi
pyyaml 6.0 pypi_0 pypi
pyzmq 23.0.0 pypi_0 pypi
ray 1.9.2 pypi_0 pypi
readline 8.1.2 h7f8727e_1
redis 4.1.4 pypi_0 pypi
requests 2.27.1 pypi_0 pypi
requests-oauthlib 1.3.1 pypi_0 pypi
rfc3986 1.5.0 pypi_0 pypi
rsa 4.8 pypi_0 pypi
scikit-learn 1.0.2 pypi_0 pypi
scipy 1.7.3 pypi_0 pypi
setproctitle 1.2.3 pypi_0 pypi
setuptools 61.2.0 py37h06a4308_0
six 1.16.0 pypi_0 pypi
smart-open 6.0.0 pypi_0 pypi
sniffio 1.2.0 pypi_0 pypi
sqlite 3.38.3 hc218d9a_0
statsmodels 0.13.2 pypi_0 pypi
tensorboard 2.9.0 pypi_0 pypi
tensorboard-data-server 0.6.1 pypi_0 pypi
tensorboard-plugin-wit 1.8.1 pypi_0 pypi
tensorflow 2.9.1 pypi_0 pypi
tensorflow-estimator 2.9.0 pypi_0 pypi
tensorflow-io-gcs-filesystem 0.26.0 pypi_0 pypi
termcolor 1.1.0 pypi_0 pypi
threadpoolctl 3.1.0 pypi_0 pypi
tk 8.6.11 h1ccaba5_1
torch 1.9.0 pypi_0 pypi
torchmetrics 0.7.2 pypi_0 pypi
torchvision 0.10.0 pypi_0 pypi
tqdm 4.64.0 pypi_0 pypi
typing-extensions 4.2.0 pypi_0 pypi
urllib3 1.26.9 pypi_0 pypi
wcwidth 0.2.5 pypi_0 pypi
werkzeug 2.1.2 pypi_0 pypi
wheel 0.37.1 pyhd3eb1b0_0
wrapt 1.14.1 pypi_0 pypi
xz 5.2.5 h7f8727e_1
yarl 1.7.2 pypi_0 pypi
zipp 3.8.0 pypi_0 pypi
zlib 1.2.12 h7f8727e_2

thank you for help!

@shanyu-sys
Copy link
Contributor

It seems you may try to load the wrong weights:

./yolov3/yolov3.weights: DATA_LOSS: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

You may need to convert the pre-trained darknet weights first, as does in yolo v3 example.

And you could always refer to our Yolov3 example in BigDL. Hope that helps.

@shanyu-sys
Copy link
Contributor

May I ask whether you met the same error with your TensorFlow code (without using bigdl), i.e with your tflocal mode?

@xunaichao
Copy link
Author

we use, https://bigdl.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-tf2keras-quickstart.html, this example to save the model. and change the save module to :
4061654161657_ pic
we now get the .pb file sucessfully, but have an exception when i use model optimizer of openvino to convert the model format to IR. the error is like this:

Model Optimizer arguments:
Common parameters:
- Path to the Input Model: /az/test1/saved_model.pb
- Path for generated IR: /opt/intel/openvino_2021.4.752/deployment_tools/model_optimizer/.
- IR output name: saved_model
- Log level: ERROR
- Batch: Not specified, inherited from the model
- Input layers: Not specified, inherited from the model
- Output layers: Not specified, inherited from the model
- Input shapes: [1,120,120,3]
- Mean values: Not specified
- Scale values: Not specified
- Scale factor: Not specified
- Precision of IR: FP32
- Enable fusing: True
- Enable grouped convolutions fusing: True
- Move mean values to preprocess section: None
- Reverse input channels: False
TensorFlow specific parameters:
- Input model in text protobuf format: False
- Path to model dump for TensorBoard: None
- List of shared libraries with TensorFlow custom layers implementation: None
- Update the configuration file with input/output node names: None
- Use configuration file used to generate the model with Object Detection API: None
- Use the config file: None
- Inference Engine found in: /opt/intel/openvino_2021.4.752/python/python3.6/openvino
Inference Engine version: 2021.4.2-3974-e2a469a3450-releases/2021/4
Model Optimizer version: 2021.4.2-3974-e2a469a3450-releases/2021/4
2022-06-02 09:03:05.880400: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/opt/intel/openvino_2021.4.752/deployment_tools/model_optimizer/mo/utils/../../../inference_engine/lib/intel64:/opt/intel/openvino_2021.4.752/deployment_tools/model_optimizer/mo/utils/../../../inference_engine/external/tbb/lib:/opt/intel/openvino_2021.4.752/deployment_tools/model_optimizer/mo/utils/../../../ngraph/lib
2022-06-02 09:03:05.880451: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py:22: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
[ FRAMEWORK ERROR ] Cannot load input model: TensorFlow cannot read the model file: "/az/test1/saved_model.pb" is incorrect TensorFlow model file.
The file should contain one of the following TensorFlow graphs:

  1. frozen graph in text or binary format
  2. inference graph for freezing with checkpoint (--input_checkpoint) in text or binary format
  3. meta graph

Make sure that --input_model_is_text is provided for a model in text format. By default, a model is interpreted in binary format. Framework error details: Error parsing message.
For more information please refer to Model Optimizer FAQ, question #43. (https://docs.openvinotoolkit.org/latest/openvino_docs_MO_DG_prepare_model_Model_Optimizer_FAQ.html?question=43#question-43)
can you help us, thank you very much!
@yushan111 thank you for the example, it helps a lot!

@shanyu-sys
Copy link
Contributor

You will get a tf.keras model with est.get_model(), and you could successfully save the model with tf.saved_model API.

After that, it depends on you how you would like to use your tensorflow model.

About using Openvino to convert your tensorflow model, maybe you could open an issue in the Openvino project.

@xunaichao
Copy link
Author

@yushan111 thanks for your help

@liu-shaojun liu-shaojun transferred this issue from intel-analytics/BigDL-2.x Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants