Release llamafile v0.8 · Mozilla-Ocho/llamafile

llamafile lets you distribute and run LLMs with a single file

llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. llamafile goes 2x faster than llama.cpp and 25x faster than ollama for some use cases like CPU prompt evaluation. It has a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.

This release further improves performance and introduces support for new models.

Support for LLaMA3 is now available
Support for Grok has been introduced
Support for Mixtral 8x22b has been introduced
Support for Command-R models has been introduced
MoE models (e.g. Mixtral, Grok) now go 2-5x faster on CPU 4db03a1
F16 is now 20% faster on Raspberry Pi 5 (TinyLLaMA 1.1b prompt eval improved 62 -> 75 tok/sec)
F16 is now 30% faster on Skylake (TinyLLaMA 1.1b prompt eval improved 171 -> 219 tok/sec)
F16 is now 60% faster on Apple M2 (Mistral 7b prompt eval improved 79 -> 128 tok/sec)
Add ability to override chat template in web gui when creating llamafiles da5cbe4
Improve markdown and syntax highlighting in server (#88)
CPU feature detection has been improved

Downloads

You can download prebuilt llamafiles from:

https://huggingface.co/jartine
llamafiles quantized and compiled by us
https://huggingface.co/models?library=llamafile
llamafiles built by our user community

Errata

The new web gui chat template override feature isn't working as intended. If you want to use LLaMA3 8B then you need to manually copy and paste the chat templates from our README into the llamafile web GUI.
The llamafile-quantize program may fail with an assertion error when K-quantizing weights from an F32 converted file. You can work around this by asking llama.cpp's convert.py script to output an FP16 GGUF file, and then running lllamafile-quantize on that instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llamafile v0.8

Downloads

Errata