Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird Loss with LISA #806

Open
harry7171 opened this issue May 4, 2024 · 1 comment
Open

Weird Loss with LISA #806

harry7171 opened this issue May 4, 2024 · 1 comment

Comments

@harry7171
Copy link

Hi ,

so giving a background -
I am using Mistral 7b along with HF trainer for finetuning on domain specific data.
Where the task is CausalLM ie next word prediction.
Using datacollatorfor Causal LM for data prep using context size is 1000 tokens per data point and I have 9k total dataset. which includes 5-10% of Wiki data for mixing it with Domain data for avoiding Catastrophic Forgetting.
Test data is a part of train to make it learn on the specific data

I am utilizing the DynamicLayerActivationCallback from LMFlow in my trainer as Training Callbacks.

I tried multiple experiments with -

  • lisa_activated_layers- 2 , lisa_interval_steps - 50 epoch 8
  • lisa_activated_layers- 2 , lisa_interval_steps - 50 epoch 10
    for both of the runs the loss starts around 8 ad goes around 5-6 but it goes into plateau . and doesnt come below 5 .

I find it little strange, maybe need other experimentation on -

  • changing lisa_activated_layers
  • changing interval steps (I think this can be important factor too)

Also would like to get the idea, whats the ideal or recommended hyperparams for such type of finetuning with around 10K datapoints.

Thanks in Advance

@research4pan
Copy link
Contributor

Thanks for your interest in LMFlow! We have fixed several bugs of the LISA implementation in LMFlow, it would be nice if you could check whether the implementation matches our latest version.

If the implementation is correct, it is worth trying:

  • A smaller lisa_interval_steps, like 5 instead of 50, since more frequent sampling allows more layers to be covered.
  • If that doesn't work, you may try larger lisa_activated_layers. We have observed that in some cases, such as llama-2-70b, deeper structure requires larger lisa_activated_layers. This may be also the case when data distribution is not that easy to learn.

Hope this information can be helpful. Please feel free to let us know if further problems are encountered 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants