Skip to content

llamafile v0.8

Compare
Choose a tag to compare
@jart jart released this 24 Apr 22:05
· 89 commits to main since this release
82f87bd

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

llamafile lets you distribute and run LLMs with a single file

llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. llamafile goes 2x faster than llama.cpp and 25x faster than ollama for some use cases like CPU prompt evaluation. It has a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.

This release further improves performance and introduces support for new models.

  • Support for LLaMA3 is now available
  • Support for Grok has been introduced
  • Support for Mixtral 8x22b has been introduced
  • Support for Command-R models has been introduced
  • MoE models (e.g. Mixtral, Grok) now go 2-5x faster on CPU 4db03a1
  • F16 is now 20% faster on Raspberry Pi 5 (TinyLLaMA 1.1b prompt eval improved 62 -> 75 tok/sec)
  • F16 is now 30% faster on Skylake (TinyLLaMA 1.1b prompt eval improved 171 -> 219 tok/sec)
  • F16 is now 60% faster on Apple M2 (Mistral 7b prompt eval improved 79 -> 128 tok/sec)
  • Add ability to override chat template in web gui when creating llamafiles da5cbe4
  • Improve markdown and syntax highlighting in server (#88)
  • CPU feature detection has been improved

Downloads

You can download prebuilt llamafiles from:

Errata

  • The new web gui chat template override feature isn't working as intended. If you want to use LLaMA3 8B then you need to manually copy and paste the chat templates from our README into the llamafile web GUI.
  • The llamafile-quantize program may fail with an assertion error when K-quantizing weights from an F32 converted file. You can work around this by asking llama.cpp's convert.py script to output an FP16 GGUF file, and then running lllamafile-quantize on that instead.