Implement kv cache sparsity like H2O with attention score #30758

HarryWu99 · 2024-05-11T11:56:04Z

Feature request

Hello!

It is a bit like #26553, which implement SinkCache. I would love to see some method of kv cache sparsity like H2O implemented, as proposed in http://arxiv.org/abs/2405.04434.

The authors have release the code here: https://github.com/FMInference/H2O.

People can use it like:

from transformers import AutoModelForCausalLM AutoTokenizer, H2O_Cache

cache = H2O_Cache(recent_length=512, HH_length=512)
gen_out = model.generate(**inputs, do_sample=False, max_new_tokens=3000, past_key_values=cache)

Motivation

Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores.
a KV cache eviction policy that dynamically retains a balance of recent and H2 tokens

Your contribution

I would love to help implement this into transformers.

It is not only implement a H2Ocache in src/transformers/cache_utils.py, but also change the order of some code in LlamaAttention#forward function, so Cache#update can get the attention score, which some method of kv cache sparsity like snapKV and future work also need.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-05-13T09:17:33Z

cc @ArthurZucker @gante

gante · 2024-05-21T18:45:15Z

Hey @HarryWu99 👋

Techniques that improve memory utilization with LLMs are always exciting! At first glance, it seems like a good candidate to be added to transformers with the API you showcased in your example. Two additional points for consideration:

Benchmarks need to be run before merging, to confirm the implementation is working as expected;
You mentioned changes in LlamaAttention.forward, to use the attention scores. We may need a new function for that, like Cache.post_process(), we may need to iterate on the design throughout the PR.

If you're happy with these two points, we'd be happy to take your PR and guide you in the process 🤗

(P.S. your first link is to the DeepSeek-V2 paper, I'm assuming you meant the H2O paper :) )

amyeroberts added Feature request Request for a new feature Cache labels May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement kv cache sparsity like H2O with attention score #30758

Implement kv cache sparsity like H2O with attention score #30758

HarryWu99 commented May 11, 2024 •

edited

amyeroberts commented May 13, 2024

gante commented May 21, 2024

Implement kv cache sparsity like H2O with attention score #30758

Implement kv cache sparsity like H2O with attention score #30758

Comments

HarryWu99 commented May 11, 2024 • edited

Feature request

Motivation

Your contribution

amyeroberts commented May 13, 2024

gante commented May 21, 2024

HarryWu99 commented May 11, 2024 •

edited