Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to split during conversion #6942

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

christianazinn
Copy link
Contributor

@christianazinn christianazinn commented Apr 27, 2024

This PR introduces additional options to convert.py that allow users to split a model into shards while converting rather than having to do it after conversion, including a default small first shard as outlined in #6463.

Other functionality we ought to have includes --split-max-size (so far it's just --split-max-tensors), displaying estimated shard sizes, dry running, and adding sharding for the other convert-*-to-*.py scripts. This will be considered a draft until those are worked out. Also needs considerable testing, but luckily as this deals with the Python scripts, it can be tested easily.

Usage

(examples are using zephyr-smol_llama-100m-sft-full)

Example, --split-max-size

python3 convert.py --outfile /path/to/outfile.gguf --outtype f16 /path/to/safetensors --split --split-max-size 64M

Output: equal to what's printed to stdout from master, then

Writing the following files:
    /path/to/outfile-00001-of-00005.gguf: n_tensors = 0, total_size = negligible - metadata only
    /path/to/outfile-00002-of-00005.gguf: n_tensors = 1, total_size = 47.1M
    /path/to/outfile-00003-of-00005.gguf: n_tensors = 11, total_size = 63.6M
    /path/to/outfile-00004-of-00005.gguf: n_tensors = 32, total_size = 63.4M
    /path/to/outfile-00005-of-00005.gguf: n_tensors = 13, total_size = 19.1M

Writing shard 2/5 with 1/57 tensors remaining (of 57 total)
[1/1] Writing tensor output.weight                          | size  32128 x    768  | type F16  | T+   2

Writing shard 3/5 with 11/56 tensors remaining (of 57 total)
[ 1/11] Writing tensor token_embd.weight                      | size  32128 x    768  | type F16  | T+   2
[ 2/11] Writing tensor blk.0.attn_norm.weight                 | size    768           | type F32  | T+   3
[ 3/11] Writing tensor blk.0.ffn_down.weight                  | size    768 x   3072  | type F16  | T+   3
[ 4/11] Writing tensor blk.0.ffn_gate.weight                  | size   3072 x    768  | type F16  | T+   3
[ 5/11] Writing tensor blk.0.ffn_up.weight                    | size   3072 x    768  | type F16  | T+   3
[ 6/11] Writing tensor blk.0.ffn_norm.weight                  | size    768           | type F32  | T+   3
[ 7/11] Writing tensor blk.0.attn_k.weight                    | size    256 x    768  | type F16  | T+   3
[ 8/11] Writing tensor blk.0.attn_output.weight               | size    768 x    768  | type F16  | T+   3
[ 9/11] Writing tensor blk.0.attn_q.weight                    | size    768 x    768  | type F16  | T+   3
[10/11] Writing tensor blk.0.attn_v.weight                    | size    256 x    768  | type F16  | T+   3
[11/11] Writing tensor blk.1.attn_norm.weight                 | size    768           | type F32  | T+   3

Writing shard 4/5 with 32/45 tensors remaining (of 57 total)
[ 1/32] Writing tensor blk.1.ffn_down.weight                  | size    768 x   3072  | type F16  | T+   0
[etc...]

With --split-max-size 200M (or any number greater than the total resultant size), it gives:

Model has smaller size than the split threshold, not splitting

Writing the following files:
    /path/to/outfile.gguf: n_tensors = 57, total_size = 193.2M

[the rest of output is the same as in master]

Example, --split-max-tensors with --dry-run, --large-first-shard

python3 convert.py --outfile /path/to/outfile.gguf --outtype f16 /path/to/safetensors --split --split-max-tensors 20 --dry-run --large-first-shard

Output: equal to what's printed to stdout from master, then

Writing the following files:
    /path/to/outfile-00001-of-00003.gguf: n_tensors = 20, total_size = 127.1M
    /path/to/outfile-00002-of-00003.gguf: n_tensors = 20, total_size = 37.5M
    /path/to/outfile-00003-of-00003.gguf: n_tensors = 17, total_size = 28.5M

Dry run, not writing files

With --split-max-tensors 64 (or any number greater than the total tensor count), it gives:

Model has fewer tensors than the split threshold, not splitting

Writing the following files:
    /path/to/outfile.gguf: n_tensors = 57, total_size = 193.2M

Dry run, not writing files

References

@christianazinn christianazinn marked this pull request as draft April 27, 2024 04:24
@christianazinn
Copy link
Contributor Author

I've added support for --split-max-size and --dry-run, taking a page out of gguf-split.cpp. Faced with adding split functionality to the convert-*-to-*.py scripts, I wonder whether this should be added to the GGUFWriter class itself rather than to the convert scripts, since it would be tedious to rewrite every write_tensors method in convert-hf-to-gguf.py.

The counterpoint I can see to doing this is that GGUFWriter should only write one file, since it's GGUFWriter and not GGMLWriter. It would also be very annoying to rewrite GGUFWriter, and I'm hesitant to touch the gguf package as a novice. But it's also likely nobody thought of this scenario when creating the file, so perhaps there's good reason to make these changes in the GGUFWriter class. @phymbert thoughts?

@phymbert
Copy link
Collaborator

This is already a good start. Could you add an end to end usage in the summary?

@christianazinn
Copy link
Contributor Author

christianazinn commented Apr 28, 2024

Sure thing (I assume you mean examples of usage and expected outputs).

I also plan to rework the implementation by consolidating code into a new GGUFManager class that handles multiple file writes via multiple GGUFWriter instances, so GGUFWriter still only writes to one file. This is because each Model in convert-hf-to-gguf.py has only one instance of GGUFWriter, so splitting would be nearly impossible there. Usage should remain the same, but the code will be fundamentally altered. (I also imagine this could do things to memory usage, so that will need to be heavily tested.)

@christianazinn
Copy link
Contributor Author

I'll need to implement for convert-llama-ggml-to-gguf.py and convert-persimmon-to-gguf.py soon - what are some models that require those scripts for conversion, so I can test? Also, I see convert-lora-to-ggml.py doesn't even use GGUFWriter - is that just for converting LoRA adapters? Is that something we should even add splitting for, considering the small size of LoRA adapters?

Anyway, GGUFManager is implemented as a near drop-in replacement for GGUFWriter that supports file splitting, so far only in convert.py (migrated from my previous commits); support for convert-hf-to-gguf.py is next up.

@slaren
Copy link
Collaborator

slaren commented Apr 28, 2024

convert-llama-ggml-to-gguf.py is for conversion of pre-gguf models. At this point it could be removed. convert-lora-to-ggml.py doesn't export to gguf format. convert-persimmon-to-gguf.py should probably be integrated into convert-hf-to-gguf.py, but I don't think it needs to be updated.

@christianazinn
Copy link
Contributor Author

Got it - will only implement for convert-hf-to-gguf.py. Remind me to watch memory usage while converting. Since I'm making changes to the gguf package, how will I push those?

@slaren
Copy link
Collaborator

slaren commented Apr 29, 2024

You can modify the gguf package in the gguf-py directory in this repository. There are instructions for publishing new releases in https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/README.md.

@christianazinn
Copy link
Contributor Author

You can modify the gguf package in the gguf-py directory in this repository

That's what I've been doing so far; will check out instructions to contribute, thanks!

@christianazinn
Copy link
Contributor Author

Testing on Mistral 7B Instruct, this branch's convert.py takes up approximately the same amount of memory as that of master. Will need to check on larger models since the discrepancy was around 6%, 3.6G vs. 3.4G used at max. Obviously memory plays a major role in splitting larger files, which is the entire point of this PR.

@christianazinn
Copy link
Contributor Author

Running tests on my side for all convert-hf-to-gguf.py supported model architectures. What models fall under QWenLMHeadModel - is that just plain QWen 1?

@christianazinn
Copy link
Contributor Author

christianazinn commented May 2, 2024

Will keep track of tests here as I go. Picking one model from each architecture in convert-hf-to-gguf.py as it exists in my branch and testing; will need assistance testing, say, vision models, which I'm not as familiar with. Also note that I went with smaller models to test the architecture; larger models should act the same, but again, tests will be needed.

It also seems like the current convert-hf-to-gguf.py doesn't print tensor status as it goes, which I intend to change.

  • GPTNeoX: EleutherAI/gpt-neox-20b - FAILED LOADING with "unknown architecture" (failed on master as well)
  • Bloom: bigscience/bloom-7b1 - WORKS
  • MPT: mosaicml/mpt-7b - WORKS
  • Orion: OrionStarAI/Orion-14B-Chat - WORKS
  • Baichuan: baichuan-inc/Baichuan2-7B-Chat - WORKS
  • Xverse: xverse/XVERSE-7B-Chat - FAILED CONVERSION with "data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 78 column 3" (failed on master as well)
  • Falcon: tiiuae/falcon-7b-instruct - WORKS
  • GPTBigCode: bigcode/gpt_bigcode-santacoder - FAILED LOADING with "tensor output.weight not found" (failed on master as well)
  • GPTRefact: smallcloudai/Refact-1_6B-fim - WORKS (incoherent code but I assume that's what it's used for)
  • Persimmon: adept/persimmon-8b-chat - Strictly "WORKS" but is incoherent - I assume this has to do with prompt formatting on master so I won't look further. It loads and generates.
  • StableLM: stabilityai/stablelm-2-1_6b-chat - WORKS
  • Mistral: mistralai/Mistral-7B-Instruct-v0.2
  • Llama2: meta-llama/Llama-2-7b-chat-hf
  • DBRX: databricks/dbrx-instruct
  • MiniCPM: openbmb/MiniCPM-V-2
  • Qwen1: Qwen/Qwen-1_8B
  • Qwen2: Qwen/Qwen1.5-1.8B
  • Qwen MoE: Qwen/Qwen1.5-MoE-A2.7B-Chat
  • GPT2: openai-community/gpt2
  • Phi2: microsoft/phi-2
  • Phi3: microsoft/Phi-3-mini-4k-instruct
  • Plamo: pfnet/plamo-13b-instruct
  • CodeShell: WisdomShell/CodeShell-7B-Chat
  • InternLM: internlm/internlm2-chat-7b
  • BERT: avsolatorio/GIST-Embedding-v0
  • NomicBERT: nomic-ai/nomic-embed-text-v1.5
  • Gemma: google/gemma-1.1-2b-it
  • StarCoder2: bigcode/starcoder2-3b
  • Mamba: TRI-ML/mamba-7b-rw
  • Cohere: CohereForAI/c4ai-command-r-v01
  • OLMo: allenai/OLMo-7B-Instruct

@christianazinn
Copy link
Contributor Author

Leaving a note for myself to watch merge conflicts with #6511. Development on this branch has slowed down as I'm pretty busy.

@christianazinn
Copy link
Contributor Author

Noting time to convert baichuan-inc/Baichuan2-7B-Chat.

New branch, --split, --split-max-size 4G:
real 6m27.788s
user 1m15.914s
sys 0m46.017s

New branch, no split:
real 7m17.661s
user 1m18.516s
sys 0m44.285s

master:
real 5m57.387s
user 1m14.567s
sys 0m48.403s

Note that these conversions were done writing the outfile over 2.5GbE, so there was considerable time spent just saving the file. Will test more later, but it doesn't seem like the change increases conversion time too significantly.

@mofosyne mofosyne added review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level python python script changes enhancement New feature or request labels May 9, 2024
@mofosyne
Copy link
Collaborator

mofosyne commented May 9, 2024

Merge attempted. Some ambiguous lines, so @christianazinn should give this a lookover to make sure the intent is still correct.

@christianazinn
Copy link
Contributor Author

christianazinn commented May 9, 2024

I'll check in a few hours and fix conflicts.

@christianazinn
Copy link
Contributor Author

The new get-vocab-base-pre functionality introduced to convert-hf-to-gguf.py by #6920 is throwing me off, but things look fine for the most part. Push incoming for conflict resolution; testing on Refact for convert-hf-to-gguf.py worked and no fundamental changes are required to convert.py. This will remain approximately dormant for another two weeks or so while I focus on finals, but since the code is already almost all implemented, if other people want to pick up and take this PR to the finish line I'd more than appreciate it.

@mofosyne mofosyne added the help wanted Extra attention is needed label May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed python python script changes review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants