Skip to content

GGUF, DPO, packing + more

Compare
Choose a tag to compare
@danielhanchen danielhanchen released this 18 Jan 18:28
· 109 commits to main since this release
b8b1eaf

Upgrade Unsloth via pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git. No dependency updates will be done.

  1. 6x faster GGUF conversion and QLoRA to float16 merging support
# To merge to 16bit:
model.save_pretrained_merged("dir", tokenizer, save_method = "merged_16bit")
# To merge to 4bit:
model.save_pretrained_merged("dir", tokenizer, save_method = "merged_4bit")
# To save to GGUF:
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q4_k_m")
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q8_0")
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "f16")
# All methods supported (listed below)

To push to HF:

model.push_to_hub_merged("hf_username/dir", tokenizer, save_method = "merged_16bit")
model.push_to_hub_merged("hf_username/dir", tokenizer, save_method = "merged_4bit")
model.push_to_hub_gguf("hf_username/dir", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf("hf_username/dir", tokenizer, quantization_method = "q8_0")
  1. 4x faster model downloading + >= 500MB less GPU fragmentation by pre-quantized models:
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/llama-2-13b-bnb-4bit",
    "unsloth/codellama-34b-bnb-4bit",
    "unsloth/tinyllama-bnb-4bit",
  1. packing = True support making training 5x faster via TRL.
  2. DPO support! 188% faster DPO training + no OOMs!
  3. Dropout. Bias LoRA support
  4. RSLoRA (Rank stabilized LoRA), LoftQ support
  5. Llama-Factory support as a UI - https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison
  6. Tonnes of bug fixes
  7. And if you can - please support out work via Kofi! https://ko-fi.com/unsloth

GGUF:

Choose for `quantization_method` to be:
"not_quantized"  : "Recommended. Fast conversion. Slow inference, big files.",
"fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
"quantized"      : "Recommended. Slow conversion. Fast inference, small files.",
"f32"     : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
"f16"     : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
"q8_0"    : "Fast conversion. High resource use, but generally acceptable.",
"q4_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
"q5_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
"q2_k"    : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
"q3_k_l"  : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_m"  : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_s"  : "Uses Q3_K for all tensors",
"q4_0"    : "Original quant method, 4-bit.",
"q4_1"    : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
"q4_k_s"  : "Uses Q4_K for all tensors",
"q5_0"    : "Higher accuracy, higher resource usage and slower inference.",
"q5_1"    : "Even higher accuracy, resource usage and slower inference.",
"q5_k_s"  : "Uses Q5_K for all tensors",
"q6_k"    : "Uses Q8_K for all tensors",