Added code for Full fine tune #645

ziozzang · 2024-04-03T07:09:23Z

The code presented here is derived from the original lora.py file, with minimal modifications. The primary addition is the inclusion of full fine-tuning functionality, while preserving the core structure of the original code. This revised version offers a potential starting point for testing the training process on more powerful Mac devices.

Efforts were made to avoid altering any code within the tuner/* directory, ensuring that this update does not introduce any conflicts with the legacy codebase.

The code has been successfully tested on a Mac M2 Studio model with 192GB of memory, demonstrating its compatibility with high-performance hardware configurations.

N8python · 2024-04-04T03:34:08Z

Do you perform your full fine-tune in float32?

due to compatibility of 'tuner/trainer.py', I fixed save, load function in code. I did test fully. =) but some memory leak issue exist on memory handling on mlx-example or mlx, maybe?

ziozzang · 2024-04-04T05:15:23Z

Fixed load/save function. fully tested with Phi-2 2.8B model.

model file saving works well and resuming of model training works well also.

Do you perform your full fine-tune in float32?

no. I just copy code from mlx-lm/lora.py and fixed to run as full fine-tune. to make compatible original tuner/* code.

ziozzang · 2024-04-04T05:22:20Z

test example.

M2 studio / 192GB
Model: phi-2 2.8B
Training set: chat completion / 400 items. (only iter set to 10, to show running demo)

$ python -m mlx_lm.full -c full_config.yaml

Loading configuration file full_config.yaml
Loading pretrained model
model file loaded: model.safetensors
Loading datasets
Training
Starting training..., iters: 10
Iter 1: Val loss 1.405, Val took 2.793s
Iter 5: Train loss 1.094, Learning Rate 1.000e-05, It/sec 0.956, Tokens/sec 779.347, Trained Tokens 4077, Peak mem 22.747 GB
Iter 10: Train loss 1.311, Learning Rate 1.000e-05, It/sec 0.668, Tokens/sec 589.665, Trained Tokens 8490, Peak mem 26.198 GB
Iter 10: Val loss 1.542, Val took 2.936s
Saved final adapter weights to adapter.npz.
Saved final model weights to model.safetensors.

N8python · 2024-04-05T03:24:16Z

Tried training qwen-1.8b. NaN loss immediately. Will try phi-2.

ziozzang · 2024-04-05T03:29:20Z

Tried training qwen-1.8b. NaN loss immediately. Will try phi-2.

when I tried Gemma-2b, same NaN loss. maybe, it's foundation code issue. maybe in models/* ? I didn't check.

N8python · 2024-04-05T03:32:33Z

Think its the float16.

N8python · 2024-04-05T03:40:57Z

Just checked - NaN w/ phi.

chimezie · 2024-04-05T13:05:10Z

I also was receiving NaN using Qwen 14B against my dataset but couldn't reproduce with the test data in lora/data. Tried again with updates on main for both mlx/mlx_lm this morning and have reached 4K iterations so far w/out NaN's .

In the past it had been a float16 issue for me. I don't remember if I quantized this one at 32 or 16, but the config.json of the locally converted model has:

{
    "architectures": [
        "Qwen2ForCausalLM"
    ],
    [..]
    "quantization": {
        "group_size": 64,
        "bits": 4
    },
    [..]
    "torch_dtype": "bfloat16",
    [..]
    "use_bfloat16": false,
}

chimezie · 2024-04-09T16:11:33Z

I've opened an older issue (#620) regarding training error NaN values

awni · 2024-04-17T03:16:04Z

This is cool, and I think it would be nice to support. We might be able to do it with a far smaller diff however. Something like:

Have a training type field in the config
If it's full_fine_tune then don't freeze the model / don't use LoRA layers

Everything else should be the same. Wdyt?

ziozzang added 2 commits April 3, 2024 16:00

Support Full fine tune.

f25b714

add example of full fine tune.

0af915f

ziozzang mentioned this pull request Apr 3, 2024

[Feature Request] Full-Tuning Example #297

Open

ziozzang closed this Apr 4, 2024

fixed model load and save functions.

d9cc36f

due to compatibility of 'tuner/trainer.py', I fixed save, load function in code. I did test fully. =) but some memory leak issue exist on memory handling on mlx-example or mlx, maybe?

ziozzang reopened this Apr 4, 2024

awni mentioned this pull request May 27, 2024

Request for Example on Full Parameter and Training for LLM Model #796

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added code for Full fine tune #645

Added code for Full fine tune #645

ziozzang commented Apr 3, 2024

N8python commented Apr 4, 2024

ziozzang commented Apr 4, 2024

ziozzang commented Apr 4, 2024

N8python commented Apr 5, 2024

ziozzang commented Apr 5, 2024

N8python commented Apr 5, 2024

N8python commented Apr 5, 2024

chimezie commented Apr 5, 2024

chimezie commented Apr 9, 2024

awni commented Apr 17, 2024

Added code for Full fine tune #645

Are you sure you want to change the base?

Added code for Full fine tune #645

Conversation

ziozzang commented Apr 3, 2024

N8python commented Apr 4, 2024

ziozzang commented Apr 4, 2024

ziozzang commented Apr 4, 2024

N8python commented Apr 5, 2024

ziozzang commented Apr 5, 2024

N8python commented Apr 5, 2024

N8python commented Apr 5, 2024

chimezie commented Apr 5, 2024

chimezie commented Apr 9, 2024

awni commented Apr 17, 2024