Option to split during conversion #6942

christianazinn · 2024-04-27T04:23:58Z

This PR introduces additional options to convert.py that allow users to split a model into shards while converting rather than having to do it after conversion, including a default small first shard as outlined in #6463.

Other functionality we ought to have includes --split-max-size (so far it's just --split-max-tensors), displaying estimated shard sizes, dry running, and adding sharding for the other convert-*-to-*.py scripts. This will be considered a draft until those are worked out. Also needs considerable testing, but luckily as this deals with the Python scripts, it can be tested easily.

Usage

(examples are using zephyr-smol_llama-100m-sft-full)

Example, `--split-max-size`

python3 convert.py --outfile /path/to/outfile.gguf --outtype f16 /path/to/safetensors --split --split-max-size 64M

Output: equal to what's printed to stdout from master, then

Writing the following files:
    /path/to/outfile-00001-of-00005.gguf: n_tensors = 0, total_size = negligible - metadata only
    /path/to/outfile-00002-of-00005.gguf: n_tensors = 1, total_size = 47.1M
    /path/to/outfile-00003-of-00005.gguf: n_tensors = 11, total_size = 63.6M
    /path/to/outfile-00004-of-00005.gguf: n_tensors = 32, total_size = 63.4M
    /path/to/outfile-00005-of-00005.gguf: n_tensors = 13, total_size = 19.1M

Writing shard 2/5 with 1/57 tensors remaining (of 57 total)
[1/1] Writing tensor output.weight                          | size  32128 x    768  | type F16  | T+   2

Writing shard 3/5 with 11/56 tensors remaining (of 57 total)
[ 1/11] Writing tensor token_embd.weight                      | size  32128 x    768  | type F16  | T+   2
[ 2/11] Writing tensor blk.0.attn_norm.weight                 | size    768           | type F32  | T+   3
[ 3/11] Writing tensor blk.0.ffn_down.weight                  | size    768 x   3072  | type F16  | T+   3
[ 4/11] Writing tensor blk.0.ffn_gate.weight                  | size   3072 x    768  | type F16  | T+   3
[ 5/11] Writing tensor blk.0.ffn_up.weight                    | size   3072 x    768  | type F16  | T+   3
[ 6/11] Writing tensor blk.0.ffn_norm.weight                  | size    768           | type F32  | T+   3
[ 7/11] Writing tensor blk.0.attn_k.weight                    | size    256 x    768  | type F16  | T+   3
[ 8/11] Writing tensor blk.0.attn_output.weight               | size    768 x    768  | type F16  | T+   3
[ 9/11] Writing tensor blk.0.attn_q.weight                    | size    768 x    768  | type F16  | T+   3
[10/11] Writing tensor blk.0.attn_v.weight                    | size    256 x    768  | type F16  | T+   3
[11/11] Writing tensor blk.1.attn_norm.weight                 | size    768           | type F32  | T+   3

Writing shard 4/5 with 32/45 tensors remaining (of 57 total)
[ 1/32] Writing tensor blk.1.ffn_down.weight                  | size    768 x   3072  | type F16  | T+   0
[etc...]

With --split-max-size 200M (or any number greater than the total resultant size), it gives:

Model has smaller size than the split threshold, not splitting

Writing the following files:
    /path/to/outfile.gguf: n_tensors = 57, total_size = 193.2M

[the rest of output is the same as in master]

Example, `--split-max-tensors` with `--dry-run`, `--large-first-shard`

python3 convert.py --outfile /path/to/outfile.gguf --outtype f16 /path/to/safetensors --split --split-max-tensors 20 --dry-run --large-first-shard

Output: equal to what's printed to stdout from master, then

Writing the following files:
    /path/to/outfile-00001-of-00003.gguf: n_tensors = 20, total_size = 127.1M
    /path/to/outfile-00002-of-00003.gguf: n_tensors = 20, total_size = 37.5M
    /path/to/outfile-00003-of-00003.gguf: n_tensors = 17, total_size = 28.5M

Dry run, not writing files

With --split-max-tensors 64 (or any number greater than the total tensor count), it gives:

Model has fewer tensors than the split threshold, not splitting

Writing the following files:
    /path/to/outfile.gguf: n_tensors = 57, total_size = 193.2M

Dry run, not writing files

References

christianazinn · 2024-04-28T01:23:06Z

I've added support for --split-max-size and --dry-run, taking a page out of gguf-split.cpp. Faced with adding split functionality to the convert-*-to-*.py scripts, I wonder whether this should be added to the GGUFWriter class itself rather than to the convert scripts, since it would be tedious to rewrite every write_tensors method in convert-hf-to-gguf.py.

The counterpoint I can see to doing this is that GGUFWriter should only write one file, since it's GGUFWriter and not GGMLWriter. It would also be very annoying to rewrite GGUFWriter, and I'm hesitant to touch the gguf package as a novice. But it's also likely nobody thought of this scenario when creating the file, so perhaps there's good reason to make these changes in the GGUFWriter class. @phymbert thoughts?

phymbert · 2024-04-28T10:11:54Z

This is already a good start. Could you add an end to end usage in the summary?

christianazinn · 2024-04-28T16:59:45Z

Sure thing (I assume you mean examples of usage and expected outputs).

I also plan to rework the implementation by consolidating code into a new GGUFManager class that handles multiple file writes via multiple GGUFWriter instances, so GGUFWriter still only writes to one file. This is because each Model in convert-hf-to-gguf.py has only one instance of GGUFWriter, so splitting would be nearly impossible there. Usage should remain the same, but the code will be fundamentally altered. (I also imagine this could do things to memory usage, so that will need to be heavily tested.)

christianazinn · 2024-04-28T22:27:58Z

I'll need to implement for convert-llama-ggml-to-gguf.py and convert-persimmon-to-gguf.py soon - what are some models that require those scripts for conversion, so I can test? Also, I see convert-lora-to-ggml.py doesn't even use GGUFWriter - is that just for converting LoRA adapters? Is that something we should even add splitting for, considering the small size of LoRA adapters?

Anyway, GGUFManager is implemented as a near drop-in replacement for GGUFWriter that supports file splitting, so far only in convert.py (migrated from my previous commits); support for convert-hf-to-gguf.py is next up.

slaren · 2024-04-28T22:44:14Z

convert-llama-ggml-to-gguf.py is for conversion of pre-gguf models. At this point it could be removed. convert-lora-to-ggml.py doesn't export to gguf format. convert-persimmon-to-gguf.py should probably be integrated into convert-hf-to-gguf.py, but I don't think it needs to be updated.

christianazinn · 2024-04-29T00:41:03Z

Got it - will only implement for convert-hf-to-gguf.py. Remind me to watch memory usage while converting. Since I'm making changes to the gguf package, how will I push those?

slaren · 2024-04-29T00:53:50Z

You can modify the gguf package in the gguf-py directory in this repository. There are instructions for publishing new releases in https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/README.md.

christianazinn · 2024-04-29T01:29:02Z

You can modify the gguf package in the gguf-py directory in this repository

That's what I've been doing so far; will check out instructions to contribute, thanks!

christianazinn · 2024-04-29T01:59:16Z

Testing on Mistral 7B Instruct, this branch's convert.py takes up approximately the same amount of memory as that of master. Will need to check on larger models since the discrepancy was around 6%, 3.6G vs. 3.4G used at max. Obviously memory plays a major role in splitting larger files, which is the entire point of this PR.

christianazinn · 2024-04-30T01:32:19Z

Running tests on my side for all convert-hf-to-gguf.py supported model architectures. What models fall under QWenLMHeadModel - is that just plain QWen 1?

christianazinn · 2024-05-02T00:20:32Z

christianazinn · 2024-05-05T14:59:06Z

Leaving a note for myself to watch merge conflicts with #6511. Development on this branch has slowed down as I'm pretty busy.

christianazinn · 2024-05-05T19:29:37Z

Noting time to convert baichuan-inc/Baichuan2-7B-Chat.

New branch, --split, --split-max-size 4G:
real 6m27.788s
user 1m15.914s
sys 0m46.017s

New branch, no split:
real 7m17.661s
user 1m18.516s
sys 0m44.285s

master:
real 5m57.387s
user 1m14.567s
sys 0m48.403s

Note that these conversions were done writing the outfile over 2.5GbE, so there was considerable time spent just saving the file. Will test more later, but it doesn't seem like the change increases conversion time too significantly.

mofosyne · 2024-05-09T14:26:47Z

Merge attempted. Some ambiguous lines, so @christianazinn should give this a lookover to make sure the intent is still correct.

christianazinn · 2024-05-09T15:34:00Z

I'll check in a few hours and fix conflicts.

christianazinn · 2024-05-10T01:22:43Z

The new get-vocab-base-pre functionality introduced to convert-hf-to-gguf.py by #6920 is throwing me off, but things look fine for the most part. Push incoming for conflict resolution; testing on Refact for convert-hf-to-gguf.py worked and no fundamental changes are required to convert.py. This will remain approximately dormant for another two weeks or so while I focus on finals, but since the code is already almost all implemented, if other people want to pick up and take this PR to the finish line I'd more than appreciate it.

christianazinn marked this pull request as draft April 27, 2024 04:24

support splits in convert.py

874c341

christianazinn force-pushed the convert-split branch from 26ebf83 to 874c341 Compare April 27, 2024 18:32

Support split by size and dry run to write estimated shards/filesizes

72cbd4e

Move split functionality to new GGUFManager class

702a744

fix improper function signature

c33bdf3

compilade mentioned this pull request May 3, 2024

convert.py: add python logging instead of print() #6511

Merged

phymbert mentioned this pull request May 4, 2024

gguf-split: add --no-tensor-first-split option #7072

Merged

tentative push of convert-hf-to-gguf support

b7c6120

mofosyne added review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level python python script changes enhancement New feature or request labels May 9, 2024

Merge branch 'master' into convert-split

14b3291

resolve merge + SplitArguments for easier parsing

87a98a5

mofosyne added the help wanted Extra attention is needed label May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to split during conversion #6942

Option to split during conversion #6942

christianazinn commented Apr 27, 2024 •

edited

christianazinn commented Apr 28, 2024

phymbert commented Apr 28, 2024

christianazinn commented Apr 28, 2024 •

edited

christianazinn commented Apr 28, 2024

slaren commented Apr 28, 2024

christianazinn commented Apr 29, 2024

slaren commented Apr 29, 2024

christianazinn commented Apr 29, 2024

christianazinn commented Apr 29, 2024

christianazinn commented Apr 30, 2024

christianazinn commented May 2, 2024 •

edited

christianazinn commented May 5, 2024

christianazinn commented May 5, 2024

mofosyne commented May 9, 2024

christianazinn commented May 9, 2024 •

edited

christianazinn commented May 10, 2024

Option to split during conversion #6942

Are you sure you want to change the base?

Option to split during conversion #6942

Conversation

christianazinn commented Apr 27, 2024 • edited

Usage

Example, --split-max-size

Example, --split-max-tensors with --dry-run, --large-first-shard

christianazinn commented Apr 28, 2024

phymbert commented Apr 28, 2024

christianazinn commented Apr 28, 2024 • edited

christianazinn commented Apr 28, 2024

slaren commented Apr 28, 2024

christianazinn commented Apr 29, 2024

slaren commented Apr 29, 2024

christianazinn commented Apr 29, 2024

christianazinn commented Apr 29, 2024

christianazinn commented Apr 30, 2024

christianazinn commented May 2, 2024 • edited

christianazinn commented May 5, 2024

christianazinn commented May 5, 2024

mofosyne commented May 9, 2024

christianazinn commented May 9, 2024 • edited

christianazinn commented May 10, 2024

christianazinn commented Apr 27, 2024 •

edited

Example, `--split-max-size`

Example, `--split-max-tensors` with `--dry-run`, `--large-first-shard`

christianazinn commented Apr 28, 2024 •

edited

christianazinn commented May 2, 2024 •

edited

christianazinn commented May 9, 2024 •

edited