Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add safe tensor support to convert-llama.py #52

Conversation

DifferentialityDevelopment
Copy link
Contributor

I haven't yet updated the other model conversion scripts yet, but this allows you to convert any llama model that uses safetensor.

@b4rtaz
Copy link
Owner

b4rtaz commented May 14, 2024

Please update also docs/LLAMA.md.

Comment on lines 203 to 206
if '/' in modelPath:
modelName = modelPath.split('/')[-1]
else:
modelName = modelPath.split('\\')[-1]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the os.path.basename function would be better to extract the filename.

@DifferentialityDevelopment
Copy link
Contributor Author

Please update also docs/LLAMA.md.

I updated the usage a bit, though could probably mention that it would work with the hugging face repo for Llama as well.

@b4rtaz
Copy link
Owner

b4rtaz commented May 15, 2024

@DifferentialityDevelopment I'm wondering about this part:

        with safetensors.safe_open(model_file, framework="pt") as f:
            for layer in f.keys():
                layers.append({
                    "name" : layer,
                    "file" : model_file
                })

Are you sure that the source model has all layers in the correct order that is expected by Distributed Llama?

@DifferentialityDevelopment
Copy link
Contributor Author

@DifferentialityDevelopment I'm wondering about this part:

        with safetensors.safe_open(model_file, framework="pt") as f:
            for layer in f.keys():
                layers.append({
                    "name" : layer,
                    "file" : model_file
                })

Are you sure that the source model has all layers in the correct order that is expected by Distributed Llama?

DId not check yet, will do a full convert on llama-3 8B Instruct, do a test with distributed llama and report back.

@DifferentialityDevelopment
Copy link
Contributor Author

The convert process itself does seem to work fine, but will test once it finishes

python converter/convert-llama.py J:\Llama-3\Meta-Llama-3-8B-Instruct J:\Llama-3\Meta-Llama-3-8B-Instruct-Distributed q40
Model name: Meta-Llama-3-8B-Instruct
Target float type: q40
Target file: dllama_meta-llama-3-8b-instruct_q40.bin
Total layers: 291
Total chunks: 7
Unknown header key: head_size
{'head_size': 128.0, 'n_layers': 32, 'n_heads': 32, 'n_kv_heads': 8, 'max_seq_len': 8192, 'rope_theta': 500000, 'arch_type': 11259136, 'n_experts': 0, 'n_active_experts': 0}
💿 Chunking model 1/7...
Loading tensors for model.embed_tokens.weight from: model-00001-of-00004.safetensors
🔶 Exporting model.embed_tokens.weight torch.Size([128256, 4096])...
Saved q40 tensor in 123.95s, 295501824 bytes
Loading tensors for model.layers.0.input_layernorm.weight from: model-00001-of-00004.safetensors
🔶 Exporting model.layers.0.input_layernorm.weight torch.Size([4096])...
Saved q40 tensor in 0.00s, 2304 bytes
Loading tensors for model.layers.0.mlp.down_proj.weight from: model-00001-of-00004.safetensors
🔶 Exporting model.layers.0.mlp.down_proj.weight torch.Size([4096, 14336])...
Saved q40 tensor in 14.69s, 33030144 bytes
Loading tensors for model.layers.0.mlp.gate_proj.weight from: model-00001-of-00004.safetensors
🔶 Exporting model.layers.0.mlp.gate_proj.weight torch.Size([14336, 4096])...
Saved q40 tensor in 14.96s, 33030144 bytes
Loading tensors for model.layers.0.mlp.up_proj.weight from: model-00001-of-00004.safetensors
🔶 Exporting model.layers.0.mlp.up_proj.weight torch.Size([14336, 4096])...
Saved q40 tensor in 14.95s, 33030144 bytes
Loading tensors for model.layers.0.post_attention_layernorm.weight from: model-00001-of-00004.safetensors
🔶 Exporting model.layers.0.post_attention_layernorm.weight torch.Size([4096])...
Saved q40 tensor in 0.00s, 2304 bytes
Loading tensors for model.layers.0.self_attn.k_proj.weight from: model-00001-of-00004.safetensors
🔶 Exporting model.layers.0.self_attn.k_proj.weight torch.Size([1024, 4096])...
Saved q40 tensor in 1.08s, 2359296 bytes
Loading tensors for model.layers.0.self_attn.o_proj.weight from: model-00001-of-00004.safetensors
🔶 Exporting model.layers.0.self_attn.o_proj.weight torch.Size([4096, 4096])...
Saved q40 tensor in 4.37s, 9437184 bytes
Loading tensors for model.layers.0.self_attn.q_proj.weight from: model-00001-of-00004.safetensors
🔶 Exporting model.layers.0.self_attn.q_proj.weight torch.Size([4096, 4096])...
Saved q40 tensor in 4.27s, 9437184 bytes
Loading tensors for model.layers.0.self_attn.v_proj.weight from: model-00001-of-00004.safetensors
🔶 Exporting model.layers.0.self_attn.v_proj.weight torch.Size([1024, 4096])...
Saved q40 tensor in 1.05s, 2359296 bytes
Loading tensors for model.layers.1.input_layernorm.weight from: model-00001-of-00004.safetensors
🔶 Exporting model.layers.1.input_layernorm.weight torch.Size([4096])...
Saved q40 tensor in 0.00s, 2304 bytes
Loading tensors for model.layers.1.mlp.down_proj.weight from: model-00001-of-00004.safetensors
🔶 Exporting model.layers.1.mlp.down_proj.weight torch.Size([4096, 14336])...
Saved q40 tensor in 14.91s, 33030144 bytes
Loading tensors for model.layers.1.mlp.gate_proj.weight from: model-00001-of-00004.safetensors
🔶 Exporting model.layers.1.mlp.gate_proj.weight torch.Size([14336, 4096])...
Saved q40 tensor in 14.76s, 33030144 bytes

@b4rtaz
Copy link
Owner

b4rtaz commented May 15, 2024

Please consider also that, some models may have a different layers order by some reason.

@DifferentialityDevelopment
Copy link
Contributor Author

Please consider also that, some models may have a different layers order by some reason.

I would think the order of the keys when loading a .safetensor model is the same as from the .pth file but I could be wrong, will do a bit of research.

@DifferentialityDevelopment
Copy link
Contributor Author

Your absolutely right, the layers are not necessarily in the right order, see below output of their keys, I noticed that layer 9 only appears after layer 20.
So I will need to fix the ordering.
I'm not entirely sure where to place lm_head.weight and model.norm.weight, they appear near the end of the list.
The other thing I have trouble with is that I'm not sure which of the layers is the feed_forward layer?
Which is what you use in the pth conversion to get the hidden_dim size

Additionally they have a different name convention, so I had to change a few more things.
Is this correct:
[safetensor] model.embed_tokens.weight -> [pth] tok_embeddings.weight
[safetensor] model.layers.0.mlp.gate_proj.weight -> [pth] layers.0.feed_forward.w1.weight
[safetensor] model.layers.0.mlp.up_proj.weight -> [pth] layers.0.feed_forward.w2.weight
[safetensor] model.layers.0.post_attention_layernorm.weight -> [pth] layers.0.attention_norm.weight
[safetensor] model.norm.weight -> [pth] norm.weight

Keys:

model.embed_tokens.weight => 128256
model.layers.0.input_layernorm.weight => 4096
model.layers.0.mlp.down_proj.weight => 4096
model.layers.0.mlp.gate_proj.weight => 14336
model.layers.0.mlp.up_proj.weight => 14336
model.layers.0.post_attention_layernorm.weight => 4096
model.layers.0.self_attn.k_proj.weight => 1024
model.layers.0.self_attn.o_proj.weight => 4096
model.layers.0.self_attn.q_proj.weight => 4096
model.layers.0.self_attn.v_proj.weight => 1024
model.layers.1.input_layernorm.weight => 4096
model.layers.1.mlp.down_proj.weight => 4096
model.layers.1.mlp.gate_proj.weight => 14336
model.layers.1.mlp.up_proj.weight => 14336
model.layers.1.post_attention_layernorm.weight => 4096
model.layers.1.self_attn.k_proj.weight => 1024
model.layers.1.self_attn.o_proj.weight => 4096
model.layers.1.self_attn.q_proj.weight => 4096
model.layers.1.self_attn.v_proj.weight => 1024
model.layers.2.input_layernorm.weight => 4096
model.layers.2.mlp.down_proj.weight => 4096
model.layers.2.mlp.gate_proj.weight => 14336
model.layers.2.mlp.up_proj.weight => 14336
model.layers.2.post_attention_layernorm.weight => 4096
model.layers.2.self_attn.k_proj.weight => 1024
model.layers.2.self_attn.o_proj.weight => 4096
model.layers.2.self_attn.q_proj.weight => 4096
model.layers.2.self_attn.v_proj.weight => 1024
model.layers.3.input_layernorm.weight => 4096
model.layers.3.mlp.down_proj.weight => 4096
model.layers.3.mlp.gate_proj.weight => 14336
model.layers.3.mlp.up_proj.weight => 14336
model.layers.3.post_attention_layernorm.weight => 4096
model.layers.3.self_attn.k_proj.weight => 1024
model.layers.3.self_attn.o_proj.weight => 4096
model.layers.3.self_attn.q_proj.weight => 4096
model.layers.3.self_attn.v_proj.weight => 1024
model.layers.4.input_layernorm.weight => 4096
model.layers.4.mlp.down_proj.weight => 4096
model.layers.4.mlp.gate_proj.weight => 14336
model.layers.4.mlp.up_proj.weight => 14336
model.layers.4.post_attention_layernorm.weight => 4096
model.layers.4.self_attn.k_proj.weight => 1024
model.layers.4.self_attn.o_proj.weight => 4096
model.layers.4.self_attn.q_proj.weight => 4096
model.layers.4.self_attn.v_proj.weight => 1024
model.layers.5.input_layernorm.weight => 4096
model.layers.5.mlp.down_proj.weight => 4096
model.layers.5.mlp.gate_proj.weight => 14336
model.layers.5.mlp.up_proj.weight => 14336
model.layers.5.post_attention_layernorm.weight => 4096
model.layers.5.self_attn.k_proj.weight => 1024
model.layers.5.self_attn.o_proj.weight => 4096
model.layers.5.self_attn.q_proj.weight => 4096
model.layers.5.self_attn.v_proj.weight => 1024
model.layers.6.input_layernorm.weight => 4096
model.layers.6.mlp.down_proj.weight => 4096
model.layers.6.mlp.gate_proj.weight => 14336
model.layers.6.mlp.up_proj.weight => 14336
model.layers.6.post_attention_layernorm.weight => 4096
model.layers.6.self_attn.k_proj.weight => 1024
model.layers.6.self_attn.o_proj.weight => 4096
model.layers.6.self_attn.q_proj.weight => 4096
model.layers.6.self_attn.v_proj.weight => 1024
model.layers.7.input_layernorm.weight => 4096
model.layers.7.mlp.down_proj.weight => 4096
model.layers.7.mlp.gate_proj.weight => 14336
model.layers.7.mlp.up_proj.weight => 14336
model.layers.7.post_attention_layernorm.weight => 4096
model.layers.7.self_attn.k_proj.weight => 1024
model.layers.7.self_attn.o_proj.weight => 4096
model.layers.7.self_attn.q_proj.weight => 4096
model.layers.7.self_attn.v_proj.weight => 1024
model.layers.8.input_layernorm.weight => 4096
model.layers.8.mlp.down_proj.weight => 4096
model.layers.8.mlp.gate_proj.weight => 14336
model.layers.8.mlp.up_proj.weight => 14336
model.layers.8.post_attention_layernorm.weight => 4096
model.layers.8.self_attn.k_proj.weight => 1024
model.layers.8.self_attn.o_proj.weight => 4096
model.layers.8.self_attn.q_proj.weight => 4096
model.layers.8.self_attn.v_proj.weight => 1024
model.layers.10.input_layernorm.weight => 4096
model.layers.10.mlp.down_proj.weight => 4096
model.layers.10.mlp.gate_proj.weight => 14336
model.layers.10.mlp.up_proj.weight => 14336
model.layers.10.post_attention_layernorm.weight => 4096
model.layers.10.self_attn.k_proj.weight => 1024
model.layers.10.self_attn.o_proj.weight => 4096
model.layers.10.self_attn.q_proj.weight => 4096
model.layers.10.self_attn.v_proj.weight => 1024
model.layers.11.input_layernorm.weight => 4096
model.layers.11.mlp.down_proj.weight => 4096
model.layers.11.mlp.gate_proj.weight => 14336
model.layers.11.mlp.up_proj.weight => 14336
model.layers.11.post_attention_layernorm.weight => 4096
model.layers.11.self_attn.k_proj.weight => 1024
model.layers.11.self_attn.o_proj.weight => 4096
model.layers.11.self_attn.q_proj.weight => 4096
model.layers.11.self_attn.v_proj.weight => 1024
model.layers.12.input_layernorm.weight => 4096
model.layers.12.mlp.down_proj.weight => 4096
model.layers.12.mlp.gate_proj.weight => 14336
model.layers.12.mlp.up_proj.weight => 14336
model.layers.12.post_attention_layernorm.weight => 4096
model.layers.12.self_attn.k_proj.weight => 1024
model.layers.12.self_attn.o_proj.weight => 4096
model.layers.12.self_attn.q_proj.weight => 4096
model.layers.12.self_attn.v_proj.weight => 1024
model.layers.13.input_layernorm.weight => 4096
model.layers.13.mlp.down_proj.weight => 4096
model.layers.13.mlp.gate_proj.weight => 14336
model.layers.13.mlp.up_proj.weight => 14336
model.layers.13.post_attention_layernorm.weight => 4096
model.layers.13.self_attn.k_proj.weight => 1024
model.layers.13.self_attn.o_proj.weight => 4096
model.layers.13.self_attn.q_proj.weight => 4096
model.layers.13.self_attn.v_proj.weight => 1024
model.layers.14.input_layernorm.weight => 4096
model.layers.14.mlp.down_proj.weight => 4096
model.layers.14.mlp.gate_proj.weight => 14336
model.layers.14.mlp.up_proj.weight => 14336
model.layers.14.post_attention_layernorm.weight => 4096
model.layers.14.self_attn.k_proj.weight => 1024
model.layers.14.self_attn.o_proj.weight => 4096
model.layers.14.self_attn.q_proj.weight => 4096
model.layers.14.self_attn.v_proj.weight => 1024
model.layers.15.input_layernorm.weight => 4096
model.layers.15.mlp.down_proj.weight => 4096
model.layers.15.mlp.gate_proj.weight => 14336
model.layers.15.mlp.up_proj.weight => 14336
model.layers.15.post_attention_layernorm.weight => 4096
model.layers.15.self_attn.k_proj.weight => 1024
model.layers.15.self_attn.o_proj.weight => 4096
model.layers.15.self_attn.q_proj.weight => 4096
model.layers.15.self_attn.v_proj.weight => 1024
model.layers.16.input_layernorm.weight => 4096
model.layers.16.mlp.down_proj.weight => 4096
model.layers.16.mlp.gate_proj.weight => 14336
model.layers.16.mlp.up_proj.weight => 14336
model.layers.16.post_attention_layernorm.weight => 4096
model.layers.16.self_attn.k_proj.weight => 1024
model.layers.16.self_attn.o_proj.weight => 4096
model.layers.16.self_attn.q_proj.weight => 4096
model.layers.16.self_attn.v_proj.weight => 1024
model.layers.17.input_layernorm.weight => 4096
model.layers.17.mlp.down_proj.weight => 4096
model.layers.17.mlp.gate_proj.weight => 14336
model.layers.17.mlp.up_proj.weight => 14336
model.layers.17.post_attention_layernorm.weight => 4096
model.layers.17.self_attn.k_proj.weight => 1024
model.layers.17.self_attn.o_proj.weight => 4096
model.layers.17.self_attn.q_proj.weight => 4096
model.layers.17.self_attn.v_proj.weight => 1024
model.layers.18.input_layernorm.weight => 4096
model.layers.18.mlp.down_proj.weight => 4096
model.layers.18.mlp.gate_proj.weight => 14336
model.layers.18.mlp.up_proj.weight => 14336
model.layers.18.post_attention_layernorm.weight => 4096
model.layers.18.self_attn.k_proj.weight => 1024
model.layers.18.self_attn.o_proj.weight => 4096
model.layers.18.self_attn.q_proj.weight => 4096
model.layers.18.self_attn.v_proj.weight => 1024
model.layers.19.input_layernorm.weight => 4096
model.layers.19.mlp.down_proj.weight => 4096
model.layers.19.mlp.gate_proj.weight => 14336
model.layers.19.mlp.up_proj.weight => 14336
model.layers.19.post_attention_layernorm.weight => 4096
model.layers.19.self_attn.k_proj.weight => 1024
model.layers.19.self_attn.o_proj.weight => 4096
model.layers.19.self_attn.q_proj.weight => 4096
model.layers.19.self_attn.v_proj.weight => 1024
model.layers.20.mlp.gate_proj.weight => 14336
model.layers.20.self_attn.k_proj.weight => 1024
model.layers.20.self_attn.o_proj.weight => 4096
model.layers.20.self_attn.q_proj.weight => 4096
model.layers.20.self_attn.v_proj.weight => 1024
model.layers.9.input_layernorm.weight => 4096
model.layers.9.mlp.down_proj.weight => 4096
model.layers.9.mlp.gate_proj.weight => 14336
model.layers.9.mlp.up_proj.weight => 14336
model.layers.9.post_attention_layernorm.weight => 4096
model.layers.9.self_attn.k_proj.weight => 1024
model.layers.9.self_attn.o_proj.weight => 4096
model.layers.9.self_attn.q_proj.weight => 4096
model.layers.9.self_attn.v_proj.weight => 1024
model.layers.20.input_layernorm.weight => 4096
model.layers.20.mlp.down_proj.weight => 4096
model.layers.20.mlp.up_proj.weight => 14336
model.layers.20.post_attention_layernorm.weight => 4096
model.layers.21.input_layernorm.weight => 4096
model.layers.21.mlp.down_proj.weight => 4096
model.layers.21.mlp.gate_proj.weight => 14336
model.layers.21.mlp.up_proj.weight => 14336
model.layers.21.post_attention_layernorm.weight => 4096
model.layers.21.self_attn.k_proj.weight => 1024
model.layers.21.self_attn.o_proj.weight => 4096
model.layers.21.self_attn.q_proj.weight => 4096
model.layers.21.self_attn.v_proj.weight => 1024
model.layers.22.input_layernorm.weight => 4096
model.layers.22.mlp.down_proj.weight => 4096
model.layers.22.mlp.gate_proj.weight => 14336
model.layers.22.mlp.up_proj.weight => 14336
model.layers.22.post_attention_layernorm.weight => 4096
model.layers.22.self_attn.k_proj.weight => 1024
model.layers.22.self_attn.o_proj.weight => 4096
model.layers.22.self_attn.q_proj.weight => 4096
model.layers.22.self_attn.v_proj.weight => 1024
model.layers.23.input_layernorm.weight => 4096
model.layers.23.mlp.down_proj.weight => 4096
model.layers.23.mlp.gate_proj.weight => 14336
model.layers.23.mlp.up_proj.weight => 14336
model.layers.23.post_attention_layernorm.weight => 4096
model.layers.23.self_attn.k_proj.weight => 1024
model.layers.23.self_attn.o_proj.weight => 4096
model.layers.23.self_attn.q_proj.weight => 4096
model.layers.23.self_attn.v_proj.weight => 1024
model.layers.24.input_layernorm.weight => 4096
model.layers.24.mlp.down_proj.weight => 4096
model.layers.24.mlp.gate_proj.weight => 14336
model.layers.24.mlp.up_proj.weight => 14336
model.layers.24.post_attention_layernorm.weight => 4096
model.layers.24.self_attn.k_proj.weight => 1024
model.layers.24.self_attn.o_proj.weight => 4096
model.layers.24.self_attn.q_proj.weight => 4096
model.layers.24.self_attn.v_proj.weight => 1024
model.layers.25.input_layernorm.weight => 4096
model.layers.25.mlp.down_proj.weight => 4096
model.layers.25.mlp.gate_proj.weight => 14336
model.layers.25.mlp.up_proj.weight => 14336
model.layers.25.post_attention_layernorm.weight => 4096
model.layers.25.self_attn.k_proj.weight => 1024
model.layers.25.self_attn.o_proj.weight => 4096
model.layers.25.self_attn.q_proj.weight => 4096
model.layers.25.self_attn.v_proj.weight => 1024
model.layers.26.input_layernorm.weight => 4096
model.layers.26.mlp.down_proj.weight => 4096
model.layers.26.mlp.gate_proj.weight => 14336
model.layers.26.mlp.up_proj.weight => 14336
model.layers.26.post_attention_layernorm.weight => 4096
model.layers.26.self_attn.k_proj.weight => 1024
model.layers.26.self_attn.o_proj.weight => 4096
model.layers.26.self_attn.q_proj.weight => 4096
model.layers.26.self_attn.v_proj.weight => 1024
model.layers.27.input_layernorm.weight => 4096
model.layers.27.mlp.down_proj.weight => 4096
model.layers.27.mlp.gate_proj.weight => 14336
model.layers.27.mlp.up_proj.weight => 14336
model.layers.27.post_attention_layernorm.weight => 4096
model.layers.27.self_attn.k_proj.weight => 1024
model.layers.27.self_attn.o_proj.weight => 4096
model.layers.27.self_attn.q_proj.weight => 4096
model.layers.27.self_attn.v_proj.weight => 1024
model.layers.28.input_layernorm.weight => 4096
model.layers.28.mlp.down_proj.weight => 4096
model.layers.28.mlp.gate_proj.weight => 14336
model.layers.28.mlp.up_proj.weight => 14336
model.layers.28.post_attention_layernorm.weight => 4096
model.layers.28.self_attn.k_proj.weight => 1024
model.layers.28.self_attn.o_proj.weight => 4096
model.layers.28.self_attn.q_proj.weight => 4096
model.layers.28.self_attn.v_proj.weight => 1024
model.layers.29.input_layernorm.weight => 4096
model.layers.29.mlp.down_proj.weight => 4096
model.layers.29.mlp.gate_proj.weight => 14336
model.layers.29.mlp.up_proj.weight => 14336
model.layers.29.post_attention_layernorm.weight => 4096
model.layers.29.self_attn.k_proj.weight => 1024
model.layers.29.self_attn.o_proj.weight => 4096
model.layers.29.self_attn.q_proj.weight => 4096
model.layers.29.self_attn.v_proj.weight => 1024
model.layers.30.input_layernorm.weight => 4096
model.layers.30.mlp.down_proj.weight => 4096
model.layers.30.mlp.gate_proj.weight => 14336
model.layers.30.mlp.up_proj.weight => 14336
model.layers.30.post_attention_layernorm.weight => 4096
model.layers.30.self_attn.k_proj.weight => 1024
model.layers.30.self_attn.o_proj.weight => 4096
model.layers.30.self_attn.q_proj.weight => 4096
model.layers.30.self_attn.v_proj.weight => 1024
model.layers.31.mlp.gate_proj.weight => 14336
model.layers.31.mlp.up_proj.weight => 14336
model.layers.31.self_attn.k_proj.weight => 1024
model.layers.31.self_attn.o_proj.weight => 4096
model.layers.31.self_attn.q_proj.weight => 4096
model.layers.31.self_attn.v_proj.weight => 1024
lm_head.weight => 128256
model.layers.31.input_layernorm.weight => 4096
model.layers.31.mlp.down_proj.weight => 4096
model.layers.31.post_attention_layernorm.weight => 4096
model.norm.weight => 4096

@b4rtaz
Copy link
Owner

b4rtaz commented May 15, 2024

I reccomend to use the same appraoch as you can see in the convert_pth method. You should build a list with layer names, then you need to pass it to the loop. BTW: this loop could be extracted from these two functions.

@b4rtaz
Copy link
Owner

b4rtaz commented May 24, 2024

@DifferentialityDevelopment I'm closing this pull request. The convert-hf.py script introduced in the 0.7.0 version supports the safe tensor format and 3 model types.

@b4rtaz b4rtaz closed this May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants