More Neural Engine Transformers

Hardware-accelerated transformers on your Mac via CoreML. (Yes, that includes LLMs like GPT.)

🔋 Performance with near-zero CPU usage

🔌 Plug-n-play with preconverted CoreML models

🦍 Support for some of the largest Neural Engine models (up to 2.8B parameters)

🐍 Easy Python access to your Mac's hardware accelerators

Try It

Generate text with a base gpt2 model like this:

❯ ./setup.sh && source env/bin/activate
❯ python3 convert.py
❯ python3 generate.py
Loading model from path gpt2.mlpackage using ComputeUnit.CPU_AND_NE...
Loaded model in 790.604ms.

[Prompt] Before boarding your rocket to Mars, remember to pack these items:
...

That model is tiny—sometimes the results are a bit nonsensical. You can run larger models for better results:

Model	Parameters	Size	Download
gpt2	124M	250MB	link
gpt2-medium	350M	700MB	link
gpt2-large	774M	1.5GB	link
gpt2-xl	1558M	3GB	link
pythia-1b	1011M	2GB	link

You can also see evals/QUALITY.md for some example generations.

Why CoreML?

Apple Silicon Macs have custom hardware built for machine learning (✨the neural engine). Its fast and energy efficient but the only way to use it is through Apple's CoreML framework. This repo makes that easy.

Is it fast?

The gpt2-xl model (1.5B) generates ~5 words/sec (7.5 tokens/sec) running purely on Neural Engine. Smaller models are faster (every model is ~2x faster than the next largest).

See evals/SPEED.md for device benchmarks.

gpt2-xl-demo.mov

What about iOS?

Smaller models (gpt2, gpt2-medium) should run on most devices. Depending on how much memory the device has larger models may also work. iOS 17 added support for runtime quantization which in theory will allow for larger models on all devices—none of the models in the repo use this yet.

Can it run LLaMa?

Maybe. The smallest official LLaMa model is 4.5x the size of gpt2-xl. With runtime quantization (iOS17/macOS Sonoma+) and a newer device (M1 seems to have a model size limit of ~4GB) it might be possible.

Contribute

PRs welcome! New models ☑️ Fixing bugs ☑️ Speedups ☑️

Thanks

This project stitches together several previously open-sourced tools. Thanks y’all.

coremltools - to make CoreML models
ane-ml-transformers - to make CoreML models go fast
nanoGPT - for a hackable GPT2 implementation
huggingface - for weights + tokenizers
ANE-Optimized-Whisper-OpenAI - for splitting the embedding layer
whisper.coreml - for an example of cross KV caches
whisper_ane - for another ane example
Netron - for clutch visualization
ChatGPT - for bouncing ideas

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
evals		evals
models		models
src		src
.gitignore		.gitignore
ACKNOWLEDGEMENTS.md		ACKNOWLEDGEMENTS.md
DEBUGGING.md		DEBUGGING.md
InstrumentsTemplate.tracetemplate		InstrumentsTemplate.tracetemplate
README.md		README.md
SETUP.md		SETUP.md
convert.py		convert.py
generate.py		generate.py
requirements.txt		requirements.txt
setup.sh		setup.sh

smpanaro/more-ane-transformers

Folders and files

Latest commit

History

Repository files navigation

More Neural Engine Transformers

Try It

Why CoreML?

Is it fast?

What about iOS?

Can it run LLaMa?

Contribute

Thanks

About

Topics

Resources

Stars

Watchers

Forks

Languages