Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BigDL2.0] autoestimator_pytorch hdfs path can not save model on k8s #22

Open
Le-Zheng opened this issue Nov 4, 2021 · 2 comments
Open

Comments

@Le-Zheng
Copy link
Contributor

Le-Zheng commented Nov 4, 2021

http://10.112.231.51:18888/view/BigDL-2.0-NB/job/BigDL-NB-K8s-ExampleTests/152/console

�[2m�[36m(pid=244, ip=172.30.27.4)�[0m /opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/model/base_pytorch_model.py:180: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m   return torch.from_numpy(inp)
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m 
  0%|          | 0/16 [00:00<?, ?it/s]/usr/local/envs/pytf1/lib/python3.7/site-packages/torch/autograd/__init__.py:132: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m   allow_unreachable=True)  # allow_unreachable flag
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m 
Loss: 0.6922382116317749:   0%|          | 0/16 [00:00<?, ?it/s]
Loss: 0.4504893720149994:   6%|▋         | 1/16 [00:00<00:00, 50.22it/s]
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m 
Loss: 0.27864789962768555:  12%|█▎        | 2/16 [00:00<00:00, 82.55it/s]
Loss: 0.18915259838104248:  19%|█▉        | 3/16 [00:00<00:00, 106.19it/s]
Loss: 0.112899050116539:  25%|██▌       | 4/16 [00:00<00:00, 124.31it/s]  
Loss: 0.09547075629234314:  31%|███▏      | 5/16 [00:00<00:00, 138.47it/s]
Loss: 0.029641583561897278:  38%|███▊      | 6/16 [00:00<00:00, 150.55it/s]
Loss: 0.056755051016807556:  44%|████▍     | 7/16 [00:00<00:00, 160.61it/s]
Loss: 0.019430123269557953:  50%|█████     | 8/16 [00:00<00:00, 170.19it/s]
Loss: 0.002557608764618635:  56%|█████▋    | 9/16 [00:00<00:00, 178.60it/s]
Loss: 0.004579346626996994:  62%|██████▎   | 10/16 [00:00<00:00, 185.35it/s]
Loss: 0.0019340637372806668:  69%|██████▉   | 11/16 [00:00<00:00, 192.40it/s]
Loss: 0.00223898165859282:  75%|███████▌  | 12/16 [00:00<00:00, 198.61it/s]  
Loss: 0.005255652591586113:  81%|████████▏ | 13/16 [00:00<00:00, 200.80it/s]
Loss: 0.00018203322542831302:  88%|████████▊ | 14/16 [00:00<00:00, 206.26it/s]
Loss: 0.055765699595212936:  94%|█████████▍| 15/16 [00:00<00:00, 212.25it/s]  
Loss: 0.055765699595212936: 100%|██████████| 16/16 [00:00<00:00, 225.74it/s]
�[2m�[36m(pid=245, ip=172.30.27.4)�[0m /opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/model/base_pytorch_model.py:180: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
�[2m�[36m(pid=245, ip=172.30.27.4)�[0m   return torch.from_numpy(inp)
�[2m�[36m(pid=245, ip=172.30.27.4)�[0m 
  0%|          | 0/16 [00:00<?, ?it/s]/usr/local/envs/pytf1/lib/python3.7/site-packages/torch/autograd/__init__.py:132: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
�[2m�[36m(pid=245, ip=172.30.27.4)�[0m   allow_unreachable=True)  # allow_unreachable flag
�[2m�[36m(pid=245, ip=172.30.27.4)�[0m 
Loss: 0.6456587314605713:   0%|          | 0/16 [00:00<?, ?it/s]
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m 2021-11-04 00:35:35,556	ERROR function_runner.py:254 -- Runner Thread raised error.
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m Traceback (most recent call last):
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m   File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 248, in run
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m     self._entrypoint()
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m   File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 316, in entrypoint
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m     self._status_reporter.get_checkpoint())
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m   File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m     output = fn()
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m   File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 325, in train_func
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m   File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 72, in put_ckpt_hdfs
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m     if remote_ckpt_basename not in get_remote_list(remote_dir):
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m   File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 46, in get_remote_list
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m     s_output, _ = process(args)
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m TypeError: cannot unpack non-iterable NoneType object
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m Exception in thread Thread-2:
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m Traceback (most recent call last):
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m   File "/usr/local/envs/pytf1/lib/python3.7/threading.py", line 926, in _bootstrap_inner
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m     self.run()
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m   File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 267, in run
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m     raise e
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m   File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 248, in run
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m     self._entrypoint()
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m   File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 316, in entrypoint
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m     self._status_reporter.get_checkpoint())
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m   File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m     output = fn()
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m   File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 325, in train_func
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m   File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 72, in put_ckpt_hdfs
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m     if remote_ckpt_basename not in get_remote_list(remote_dir):
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m   File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 46, in get_remote_list
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m     s_output, _ = process(args)
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m TypeError: cannot unpack non-iterable NoneType object
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m 
�[2m�[36m(pid=245, ip=172.30.27.4)�[0m 
Loss: 0.4749995172023773:   6%|▋         | 1/16 [00:00<00:00, 48.86it/s]
Loss: 0.3644247055053711:  12%|█▎        | 2/16 [00:00<00:00, 81.42it/s]
Loss: 0.19700123369693756:  19%|█▉        | 3/16 [00:00<00:00, 105.65it/s]
Loss: 0.15083497762680054:  25%|██▌       | 4/16 [00:00<00:00, 123.93it/s]
Loss: 0.1125955805182457:  31%|███▏      | 5/16 [00:00<00:00, 138.76it/s] 
Loss: 0.07053384184837341:  38%|███▊      | 6/16 [00:00<00:00, 150.92it/s]
Loss: 0.04681260883808136:  44%|████▍     | 7/16 [00:00<00:00, 161.47it/s]
Loss: 0.02035798318684101:  50%|█████     | 8/16 [00:00<00:00, 170.66it/s]
Loss: 0.012909774668514729:  56%|█████▋    | 9/16 [00:00<00:00, 178.95it/s]
Loss: 0.0078040556982159615:  62%|██████▎   | 10/16 [00:00<00:00, 186.17it/s]
Loss: 0.04752806946635246:  69%|██████▉   | 11/16 [00:00<00:00, 192.78it/s]  
Loss: 0.019220085814595222:  75%|███████▌  | 12/16 [00:00<00:00, 198.82it/s]
Loss: 0.010350744239985943:  81%|████████▏ | 13/16 [00:00<00:00, 200.81it/s]
Loss: 0.0005109629710204899:  88%|████████▊ | 14/16 [00:00<00:00, 206.25it/s]
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m 
�[2m�[36m(pid=244, ip=172.30.27.4)�[0m /bin/sh: hdfs: command not found
@Le-Zheng
Copy link
Contributor Author

Le-Zheng commented Nov 4, 2021

@yushan111

@shanyu-sys
Copy link
Contributor

AutoEstimator currently only supports distributed on clusters with hdfs, therefore doesn't support k8s for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants