Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating ReFT #1654

Open
raven38 opened this issue Apr 15, 2024 · 6 comments
Open

Integrating ReFT #1654

raven38 opened this issue Apr 15, 2024 · 6 comments

Comments

@raven38
Copy link

raven38 commented Apr 15, 2024

Hello all, and thank you for your great work!

ReFT, a representation finetuning framework more parameter-efficient than popular PEFTs like LoRA, is announced earlier this month.

I implemented LoReFT, a instance of ReFT, up on the your PEFT library referencing to the original implementation.
My implementation is available at https://github.com/raven38/peft

Would you be interested to perhaps integrate ReFT into peft? I would be happy to work on this if there is interest from you and the community.

@BenjaminBossan
Copy link
Member

Hi, thanks for bringing this to our attention. We had already looked at (Lo)ReFT internally and had some discussion about whether it would be a good addition to PEFT. IIRC, the ReFT repo was heavily relying on pyvene, does your fork do that too or are you integrating the pyvene code? Maybe you can open a draft PR so that we can more easily discuss what has been changed.

Regarding the paper itself, I've only skimmed it, so I don't have the full picture. I'll quote myself on what I had to say internally:

One thing that makes me wonder is how much of the benefits of ReFT can be attributed to the fact that they only apply it selectively to certain layers and token positions. At first glance, I didn't see an ablation study to check how well and efficiently ReFT works without this.
This makes me wonder if similar gains cannot be obtained by applying LoRA/DoRA/VeRA to only selective layers. Maybe that's something that we can add to PEFT and test out. All this reminds me also of LISA, which works by just training a random subset of layers (randomized at each step).

Do you have any further insights into that?

@raven38
Copy link
Author

raven38 commented Apr 16, 2024

Thank you for the feedback.
My code does not depend on pyvene and works with just integrating few codes in pyreft and pyvene. I think ReFT mainly consists of the overwrite of layers, additional weight and selection of intervention tokens. In my implementation, I use PEFT library for the overwrite of layers, borrow weight operations from pyreft and pyvene, and use Pytorch's gather and scatter_ for the selection of tokens for intervention.

I will make a draft PR for discussion.

@raven38 raven38 mentioned this issue Apr 16, 2024
4 tasks
@frankaging
Copy link

frankaging commented Apr 21, 2024

@raven38 @BenjaminBossan hey! i was randomly browsing on GitHub and found this ticket, super exciting to see the PEFT library potentially supporting ReFT.

Although i think current pyreft + pyvene could support more schematic ReFT design, integrating with the PEFT library could potentially scale up (i.e., different level of parallelisms, checkpointing) simple ReFT experiments super effectively!

One input here that might be helpful is batching: using gather and scatter_ to support different intervention locations, and different numbers of interventions across examples in a batch. Currently, we intervene based on token's relative position (i.e., first n tokens in the prompt + last n tokens in the prompt). As a result, some shorter sequences (e.g., GLUE tasks) may need special handlings on positions in order to batch them. We find that gather and scatter_ may not work well if we gather and scatter with the same index multiple times (sort of like index-wise padding with redundant interventions). Happy to provide more context if needed!

Another comment that might be helpful is KV cache. We currently only intervene on the prompt tokens. So the intervened KV for the prompt tokens can be cached, and there should be no inference overhead when generating (this is different from adaptors depending on the implementations). You guys might already take care of that though.

Thanks again!

@BenjaminBossan
Copy link
Member

Thanks for sharing that information @frankaging. I haven't checked the PR in detail yet and compared it to the pyreft/pyvene code, @raven38 should be better positioned to answer your question.

One thing I wondered: From your perspective, do you think that pyeft/pyvene is structured in a way that we could add it as an optional dependency to PEFT and re-use its code or what it be easier to re-implement it from scratch is the PR currently does?

@raven38
Copy link
Author

raven38 commented Apr 28, 2024

@frankaging Thank you for the feedback.
One of the motivations for integrating ReFT into the PEFT library is to allow ReFT to support more architectures. IIUC, although PEFT support any PyTorch models, pyreft + pyvene only support specific models defined at https://github.com/stanfordnlp/pyvene/blob/7d94cdd4841834e079edba4de83410f6b91d254c/pyvene/models/intervenable_modelcard.py#L37

I am also thinking about the issue of batching. In pyreft, intervention locations are set on the dataset side, but considering the compatibility with other adapters' APIs, I don't think this solution is appropriate. I don't yet have a solution for this issue, but I would like to hear if you have any.
https://github.com/stanfordnlp/pyreft/blob/77970c1d7c4f46e5148f1caff2974db76ec5bf4f/pyreft/dataset.py#L67

I'm interested in KV cache. Can I learn the implementations of KV cache on other adaptors?

@BenjaminBossan
Copy link
Member

I am also thinking about the issue of batching. In pyreft, intervention locations are set on the dataset side, but considering the compatibility with other adapters' APIs, I don't think this solution is appropriate. I don't yet have a solution for this issue, but I would like to hear if you have any.

Good point. We could put the burden of adding the intervention information on the user, but ideally, it would be great if they only need to set the parameters and we can apply it automatically. I'm not sure how generalizable it is (beyond language models for instance), we may want to have an option to intervene on all the input data perhaps?

I'm interested in KV cache. Can I learn the implementations of KV cache on other adaptors?

If you mean inside of PEFT, note that we don't have any PEFT specific KV cache. If transformers adds it to a model, we want to make sure it can be properly used, but we don't have anything on our own. If you see potential for performance gains by adding something in PEFT, let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants