Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepspeed_optimize_model_gpu Qwen/Qwen-7B-Chat #10763

Open
kevin-t-tang opened this issue Apr 15, 2024 · 1 comment
Open

deepspeed_optimize_model_gpu Qwen/Qwen-7B-Chat #10763

kevin-t-tang opened this issue Apr 15, 2024 · 1 comment
Assignees

Comments

@kevin-t-tang
Copy link

ipex-llm + deepspeed run Qwen-7B-Chat with the following error:
[0] RuntimeError: shape '[1, 1024, 16, 128]' is invalid for input of size 4194304

accelerate 0.29.2
mpi4py 3.1.6
bigdl-core-xe-21 2.5.0b20240411
bigdl-core-xe-esimd-21 2.5.0b20240411
ipex-llm 2.1.0b20240411
transformers 4.37.2

:: initializing oneAPI environment ...
run-deepspeed-arc_dg2.sh: BASH_VERSION = 5.1.16(1)-release
args: Using "$@" for oneapi-vars.sh arguments: --force
:: advisor -- processing etc/advisor/vars.sh
:: ccl -- processing etc/ccl/vars.sh
:: compiler -- processing etc/compiler/vars.sh
:: dal -- processing etc/dal/vars.sh
:: debugger -- processing etc/debugger/vars.sh
:: dnnl -- processing etc/dnnl/vars.sh
:: dpct -- processing etc/dpct/vars.sh
:: dpl -- processing etc/dpl/vars.sh
:: ipp -- processing etc/ipp/vars.sh
:: ippcp -- processing etc/ippcp/vars.sh
:: mkl -- processing etc/mkl/vars.sh
:: mpi -- processing etc/mpi/vars.sh
:: tbb -- processing etc/tbb/vars.sh
:: vtune -- processing etc/vtune/vars.sh
:: oneAPI environment initialized ::

[0] /home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
[0] warnings.warn(
[1] /home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
[1] warnings.warn(
[1] 2024-04-15 13:41:37,444 - INFO - intel_extension_for_pytorch auto imported
[0] 2024-04-15 13:41:37,457 - INFO - intel_extension_for_pytorch auto imported
[1] [2024-04-15 13:41:37,865] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to xpu (auto detect)
[0] [2024-04-15 13:41:37,881] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to xpu (auto detect)
[1] [WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[1] [WARNING] async_io: please install the libaio-dev package with apt
[1] [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[1] model_path: /opt/WD/900-Model-Data/Model/Qwen-7B-Chat
[1] [2024-04-15 13:41:38,252] [INFO] [real_accelerator.py:211:set_accelerator] Setting ds_accelerator to cpu (model specified)
[0] [WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[0] [WARNING] async_io: please install the libaio-dev package with apt
[0] [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[0] model_path: /opt/WD/900-Model-Data/Model/Qwen-7B-Chat
[0] [2024-04-15 13:41:38,265] [INFO] [real_accelerator.py:211:set_accelerator] Setting ds_accelerator to cpu (model specified)
Loading checkpoint shards: 100%|██████████| 8/8 [10:07<00:00, 75.90s/it][0]
Loading checkpoint shards: 100%|██████████| 8/8 [10:07<00:00, 75.91s/it][1]
[1] [2024-04-15 13:51:46,932] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.14.1+ed8aed57, git-hash=ed8aed57, git-branch=HEAD
[0] [2024-04-15 13:51:46,932] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.14.1+ed8aed57, git-hash=ed8aed57, git-branch=HEAD
[1] [2024-04-15 13:51:46,932] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference
[1] [2024-04-15 13:51:46,932] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[1] [2024-04-15 13:51:46,933] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[0] [2024-04-15 13:51:46,933] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference
[0] [2024-04-15 13:51:46,933] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[0] [2024-04-15 13:51:46,933] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[1] Using /home/ubuntu/.cache/torch_extensions/py311_cpu as PyTorch extensions root...
[0] Using /home/ubuntu/.cache/torch_extensions/py311_cpu as PyTorch extensions root...
[1] Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py311_cpu/deepspeed_ccl_comm/build.ninja...
[1] Building extension module deepspeed_ccl_comm...
[1] Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1] ninja: no work to do.
[1] Loading extension module deepspeed_ccl_comm...
[0] Loading extension module deepspeed_ccl_comm...
[0] Time to load deepspeed_ccl_comm op: 0.2021636962890625 seconds
[0] DeepSpeed deepspeed.ops.comm.deepspeed_ccl_comm_op built successfully
[0] [2024-04-15 13:51:55,218] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
[1] Time to load deepspeed_ccl_comm op: 0.1423795223236084 seconds
[1] DeepSpeed deepspeed.ops.comm.deepspeed_ccl_comm_op built successfully
[1] [2024-04-15 13:51:55,218] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
[1] [2024-04-15 13:51:55,218] [INFO] [comm.py:637:init_distributed] cdb=<deepspeed.comm.ccl.CCLBackend object at 0x796d96a8fb50>
[0] [2024-04-15 13:51:55,218] [INFO] [comm.py:637:init_distributed] cdb=<deepspeed.comm.ccl.CCLBackend object at 0x75eb9c38d4d0>
[1] [2024-04-15 13:51:55,218] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[0] [2024-04-15 13:51:55,218] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[0] [2024-04-15 13:51:55,709] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=2, master_addr=172.16.215.160, master_port=29500
[1] [2024-04-15 13:51:55,709] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=1, local_rank=1, world_size=2, master_addr=172.16.215.160, master_port=29500
[0] [2024-04-15 13:51:55,709] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
[0] 2024-04-15 13:51:58,306 - INFO - Converting the current model to sym_int4 format......
[1] 2024-04-15 13:51:58,386 - INFO - Converting the current model to sym_int4 format......
[1] AutoTP: [(<class 'transformers_modules.Qwen-7B-Chat.modeling_qwen.QWenBlock'>, ['mlp.c_proj', 'attn.c_proj'])]
[1] >> loading of model costs 620.1323246199172s
[1] [2024-04-15 13:52:09,164] [INFO] [real_accelerator.py:211:set_accelerator] Setting ds_accelerator to xpu (model specified)
[0] AutoTP: [(<class 'transformers_modules.Qwen-7B-Chat.modeling_qwen.QWenBlock'>, ['attn.c_proj', 'mlp.c_proj'])]
[0] >> loading of model costs 620.0398639050545s
[0] [2024-04-15 13:52:09,836] [INFO] [real_accelerator.py:211:set_accelerator] Setting ds_accelerator to xpu (model specified)
[1] [2024-04-15 13:52:10,415] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
[1] [2024-04-15 13:52:10,416] [INFO] [comm.py:637:init_distributed] cdb=None
[1] /home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:397: UserWarning: do_sample is set to False. However, top_p is set to 0.8 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p.
[1] warnings.warn(
[1] /home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:407: UserWarning: do_sample is set to False. However, top_k is set to 0 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_k.
[1] warnings.warn(
[0] [2024-04-15 13:52:11,125] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
[0] [2024-04-15 13:52:11,125] [INFO] [comm.py:637:init_distributed] cdb=None
[0] /home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:397: UserWarning: do_sample is set to False. However, top_p is set to 0.8 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p.
[0] warnings.warn(
[0] /home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:407: UserWarning: do_sample is set to False. However, top_k is set to 0 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_k.
[0] warnings.warn(
[1] <class 'transformers_modules.Qwen-7B-Chat.modeling_qwen.QWenLMHeadModel'>
[1] Traceback (most recent call last):
[1] File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/run.py", line 1721, in
[1] run_model(model, api, in_out_pairs, conf['local_model_hub'], conf['warm_up'], conf['num_trials'], conf['num_beams'],
[1] File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/run.py", line 111, in run_model
[1] result = run_deepspeed_optimize_model_gpu(repo_id, local_model_hub, in_out_pairs, warm_up, num_trials, num_beams, low_bit, batch_size, cpu_embedding)
[1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1] File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/run.py", line 1525, in run_deepspeed_optimize_model_gpu
[1] output_ids = model.generate(input_ids, do_sample=False, max_new_tokens=out_len,
[1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[1] return func(*args, **kwargs)
[1] ^^^^^^^^^^^^^^^^^^^^^
[1] File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 1563, in generate
[1] return self.greedy_search(
[1] ^^^^^^^^^^^^^^^^^^^
[1] File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 2385, in greedy_search
[1] outputs = self(
[1] ^^^^^
[1] File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 533, in call
[1] return self.model(*args, **kwargs)
[1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
[1] return self._call_impl(*args, **kwargs)
[1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
[1] return forward_call(*args, **kwargs)
[1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1] File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 1043, in forward
[1] transformer_outputs = self.transformer(
[1] ^^^^^^^^^^^^^^^^^
[1] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
[1] return self._call_impl(*args, **kwargs)
[1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
[1] return forward_call(*args, **kwargs)
[1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/ipex_llm/transformers/models/qwen.py", line 715, in qwen_model_forward
[1] outputs = block(
[1] ^^^^^^
[1] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
[1] return self._call_impl(*args, **kwargs)
[1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
[1] return forward_call(*args, **kwargs)
[1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1] File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 610, in forward
[1] attn_outputs = self.attn(
[1] ^^^^^^^^^^
[1] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
[1] return self._call_impl(*args, **kwargs)
[1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
[1] return forward_call(*args, **kwargs)
[1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/ipex_llm/transformers/models/qwen.py", line 109, in qwen_attention_forward
[1] return forward_function(
[1] ^^^^^^^^^^^^^^^^^
[1] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/ipex_llm/transformers/models/qwen.py", line 369, in qwen_attention_forward_quantized
[1] query = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim)
[1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1] RuntimeError: shape '[1, 1024, 16, 128]' is invalid for input of size 4194304
[0] <class 'transformers_modules.Qwen-7B-Chat.modeling_qwen.QWenLMHeadModel'>
[0] Traceback (most recent call last):
[0] File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/run.py", line 1721, in
[0] run_model(model, api, in_out_pairs, conf['local_model_hub'], conf['warm_up'], conf['num_trials'], conf['num_beams'],
[0] File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/run.py", line 111, in run_model
[0] result = run_deepspeed_optimize_model_gpu(repo_id, local_model_hub, in_out_pairs, warm_up, num_trials, num_beams, low_bit, batch_size, cpu_embedding)
[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[0] File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/run.py", line 1525, in run_deepspeed_optimize_model_gpu
[0] output_ids = model.generate(input_ids, do_sample=False, max_new_tokens=out_len,
[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[0] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[0] return func(*args, **kwargs)
[0] ^^^^^^^^^^^^^^^^^^^^^
[0] File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 1563, in generate
[0] return self.greedy_search(
[0] ^^^^^^^^^^^^^^^^^^^
[0] File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 2385, in greedy_search
[0] outputs = self(
[0] ^^^^^
[0] File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 533, in call
[0] return self.model(*args, **kwargs)
[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[0] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
[0] return self._call_impl(*args, **kwargs)
[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[0] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
[0] return forward_call(*args, **kwargs)
[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[0] File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 1043, in forward
[0] transformer_outputs = self.transformer(
[0] ^^^^^^^^^^^^^^^^^
[0] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
[0] return self._call_impl(*args, **kwargs)
[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[0] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
[0] return forward_call(*args, **kwargs)
[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[0] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/ipex_llm/transformers/models/qwen.py", line 715, in qwen_model_forward
[0] outputs = block(
[0] ^^^^^^
[0] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
[0] return self._call_impl(*args, **kwargs)
[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[0] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
[0] return forward_call(*args, **kwargs)
[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[0] File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 610, in forward
[0] attn_outputs = self.attn(
[0] ^^^^^^^^^^
[0] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
[0] return self._call_impl(*args, **kwargs)
[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[0] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
[0] return forward_call(*args, **kwargs)
[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[0] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/ipex_llm/transformers/models/qwen.py", line 109, in qwen_attention_forward
[0] return forward_function(
[0] ^^^^^^^^^^^^^^^^^
[0] File "/home/ubuntu/miniconda3/envs/llm-deepspeed/lib/python3.11/site-packages/ipex_llm/transformers/models/qwen.py", line 369, in qwen_attention_forward_quantized
[0] query = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim)
[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[0] RuntimeError: shape '[1, 1024, 16, 128]' is invalid for input of size 4194304

@plusbang plusbang self-assigned this Apr 15, 2024
@plusbang plusbang mentioned this issue Apr 15, 2024
2 tasks
@plusbang
Copy link
Contributor

Hi, @kevin-t-tang , I could reproduce this error of Qwen-7B-Chat and Qwen-14B-Chat using AutoTP, and this PR(#10766) could fix it. Please have a try after this PR is merged to nightly version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants