Running and orchestrating large language models (LLMs) on Kubernetes with macOS nodes.
To run and orchestrate LLMs on Kubernetes with macOS nodes, we need the following components:
- Virtual Kubelet: For running
pods
on macOS nodes (forked from virtual-kubelet/cri). - Containerd: For pulling and running Ollama LLM image (forked from containerd/containerd).
- Runm: A lightweight runtime derived from llama.cpp for running LLMs on macOS nodes (source code will be available soon).
- Bronze Willow: CNI Plugin for macOS (source code will be available soon).
This project is inspired by llama.cpp, Ollama and kind.
- A Kubernetes cluster.
- Mac with Apple Silicon chip.
$ make # optional
$ sudo ./bin/demo create
✓ Starting containerd 🚢
✓ Preparing virtual nodes 📦
✓ Creating network 🌐
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
bj-k8s01 Ready control-plane 214d v1.28.2
bj-k8s02 Ready worker 214d v1.28.2
bj-k8s03 Ready worker 214d v1.28.2
weiqiangt-mba Ready agent 23d v1.15.2-vk-cri-fb9cc09-dev
xiaodong-m1 Ready agent 23d v1.15.2-vk-cri-fb9cc09-dev
After running the above commands, you should see the macOS nodes appear in the output of kubectl get nodes
. In the example above, weiqiangt-mba
and xiaodong-m1
are the macOS nodes.
$ kubectl apply -f k8s/tinyllama.yml
$ kubectl apply -f k8s/mods.yaml
# Retrieve the command for editing config file of mods.
$ echo sed -i \'s/localhost:11434/$(kubectl get svc -o json tinyllama-services | jq '.spec.clusterIP' -r)/g\' '~/.config/mods/mods.yml'
sed -i 's/localhost:11434/198.19.50.27/g' ~/.config/mods/mods.yml
# Copy the output.
$ kubectl exec -it $(kubectl get pods -l app=mods -o jsonpath='{.items[0].metadata.name}') -- bash
root@mods-deployment-77c464f4b8-zn6g5:/# echo "Execute the copied command."
root@mods-deployment-77c464f4b8-zn6g5:/# mods -f "What are some of the best ways to save money?"
$ sudo ./bin/demo delete
✓ Deleting demo 🧹
- KCD Shanghai 2024 (Accepted)
- KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 (In Evaluation)
Beyond Containers, Orchestrate LLMs with Kubernetes on macOS
With the growing popularity of generative AI, there is an increasing demand for large language models (LLMs) inference capabilities. Kubernetes, being the most popular orchestration platform, is a natural fit for these inference needs. Although GPUs are expensive and often in short supply, Apple Silicon M-series chips (with Unified Memory Architecture) have been proven to be an effective alternative for running LLMs (see ggerganov/llama.cpp performance discussion). However, the prevalent Kubernetes ecosystem is predominantly focused on Linux-based containers. In this presentation, we will showcase our efforts to facilitate LLMs inference on Kubernetes using macOS nodes. We will demonstrate how to employ Virtual Kubelet, Containerd, ShimV2, and runm (derived from llama.cpp: ggerganov/llama.cpp) for deploying open-source foundation models such as gemma, llama2, and mistral on Kubernetes. Additionally, we will discuss our motivation and the challenges encountered during our development journey. Our goal is to encourage the community to expand the Kubernetes ecosystem to inclusively support the execution of LLMs on macOS platforms.
- Enable running and orchestrating LLMs on Kubernetes with macOS nodes
- Provide an alternative solution for running LLMs on Kubernetes
- Inspire the community to build a more inclusive Kubernetes ecosystem that supports running LLMs on macOS