Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM: update split tensor conditions. #10872

Merged

Conversation

lalalapotter
Copy link
Contributor

@lalalapotter lalalapotter commented Apr 24, 2024

Description

This PR is to update split tensor conditions:

  1. enable when "IPEX_LLM_SPLIT_QKV" is enabled.
  2. enable when intermedia attn_weights size larger than block memory limitation.
  3. update quantize kv cache conditions.

@lalalapotter lalalapotter self-assigned this Apr 24, 2024
# split tensor for memory block limitation
# support fp16 and set input length threshold at 5000 for now
return True
if query_layer.element_size()*bsz*n_head*seq_len*seq_len >= 4*1024*1024*1024:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this check hurt performance? query_layer.element_size(), should be no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems not:
use condition with element_size():
0,meta-llama/Llama-2-7b-chat-hf,4839.14,27.14,0.0,8192-512,1,8192-512,1,fp8,False,4.49,11.966796875,N/A,N/A
use condition with shape:
0,meta-llama/Llama-2-7b-chat-hf,4837.71,27.13,0.0,8192-512,1,8192-512,1,fp8,False,4.48,11.966796875,N/A,N/A

@lalalapotter lalalapotter merged commit 75dbf24 into intel-analytics:main Apr 30, 2024
16 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants