Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement kv cache sparsity like H2O with attention score #30758

Open
HarryWu99 opened this issue May 11, 2024 · 2 comments
Open

Implement kv cache sparsity like H2O with attention score #30758

HarryWu99 opened this issue May 11, 2024 · 2 comments
Labels
Cache Feature request Request for a new feature

Comments

@HarryWu99
Copy link

HarryWu99 commented May 11, 2024

Feature request

Hello!

It is a bit like #26553, which implement SinkCache. I would love to see some method of kv cache sparsity like H2O implemented, as proposed in http://arxiv.org/abs/2405.04434.

The authors have release the code here: https://github.com/FMInference/H2O.

People can use it like:

from transformers import AutoModelForCausalLM AutoTokenizer, H2O_Cache

cache = H2O_Cache(recent_length=512, HH_length=512)
gen_out = model.generate(**inputs, do_sample=False, max_new_tokens=3000, past_key_values=cache)

Motivation

image

Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores.
a KV cache eviction policy that dynamically retains a balance of recent and H2 tokens

Your contribution

I would love to help implement this into transformers.

It is not only implement a H2Ocache in src/transformers/cache_utils.py, but also change the order of some code in LlamaAttention#forward function, so Cache#update can get the attention score, which some method of kv cache sparsity like snapKV and future work also need.

@amyeroberts amyeroberts added Feature request Request for a new feature Cache labels May 13, 2024
@amyeroberts
Copy link
Collaborator

cc @ArthurZucker @gante

@gante
Copy link
Member

gante commented May 21, 2024

Hey @HarryWu99 👋

Techniques that improve memory utilization with LLMs are always exciting! At first glance, it seems like a good candidate to be added to transformers with the API you showcased in your example. Two additional points for consideration:

  1. Benchmarks need to be run before merging, to confirm the implementation is working as expected;
  2. You mentioned changes in LlamaAttention.forward, to use the attention scores. We may need a new function for that, like Cache.post_process(), we may need to iterate on the design throughout the PR.

If you're happy with these two points, we'd be happy to take your PR and guide you in the process 🤗

(P.S. your first link is to the DeepSeek-V2 paper, I'm assuming you meant the H2O paper :) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cache Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

3 participants