MaKllama

Powered by DALL·E 3

MaKllama

Running and orchestrating large language models (LLMs) on Kubernetes with macOS nodes.

Main Components

To run and orchestrate LLMs on Kubernetes with macOS nodes, we need the following components:

Virtual Kubelet: For running pods on macOS nodes (forked from virtual-kubelet/cri).
Containerd: For pulling and running Ollama LLM image (forked from containerd/containerd).
Runm: A lightweight runtime derived from llama.cpp for running LLMs on macOS nodes (source code will be available soon).
Bronze Willow: CNI Plugin for macOS (source code will be available soon).

This project is inspired by llama.cpp, Ollama and kind.

Quick Start (~ 1 minute)

1. Prerequisites

A Kubernetes cluster.
- kind is not supported.
- Antrea is preferred for CNI.
- kubeconfig should locate at ~/.kube/config.
Mac with Apple Silicon chip.

2. Start Containerd + Virtual Kubelet + BW

$ make # optional
$ sudo ./bin/demo create
 ✓ Starting containerd 🚢
 ✓ Preparing virtual nodes 📦
 ✓ Creating network 🌐
$ kubectl get nodes
NAME            STATUS     ROLES           AGE    VERSION
bj-k8s01        Ready      control-plane   214d   v1.28.2
bj-k8s02        Ready      worker          214d   v1.28.2
bj-k8s03        Ready      worker          214d   v1.28.2
weiqiangt-mba   Ready      agent           23d    v1.15.2-vk-cri-fb9cc09-dev
xiaodong-m1     Ready      agent           23d    v1.15.2-vk-cri-fb9cc09-dev

After running the above commands, you should see the macOS nodes appear in the output of kubectl get nodes. In the example above, weiqiangt-mba and xiaodong-m1 are the macOS nodes.

3. Deploy TinyLlama with 2 Replicas

$ kubectl apply -f k8s/tinyllama.yml

4. Deploy Mods

$ kubectl apply -f k8s/mods.yaml

5. Access OpenAI API Compatible Endpoint through Mods

# Retrieve the command for editing config file of mods.
$ echo sed -i \'s/localhost:11434/$(kubectl get svc -o json tinyllama-services | jq '.spec.clusterIP' -r)/g\' '~/.config/mods/mods.yml'
sed -i 's/localhost:11434/198.19.50.27/g' ~/.config/mods/mods.yml
# Copy the output.
$ kubectl exec -it $(kubectl get pods -l app=mods -o jsonpath='{.items[0].metadata.name}') -- bash
root@mods-deployment-77c464f4b8-zn6g5:/# echo "Execute the copied command."
root@mods-deployment-77c464f4b8-zn6g5:/# mods -f "What are some of the best ways to save money?"

6. Stop Containerd + Virtual Kubelet + BW

$ sudo ./bin/demo delete
 ✓ Deleting demo 🧹

Community

Open an issue

Session Submissions

KCD Shanghai 2024 (Accepted)
KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 (In Evaluation)

Title

Beyond Containers, Orchestrate LLMs with Kubernetes on macOS

Description

With the growing popularity of generative AI, there is an increasing demand for large language models (LLMs) inference capabilities. Kubernetes, being the most popular orchestration platform, is a natural fit for these inference needs. Although GPUs are expensive and often in short supply, Apple Silicon M-series chips (with Unified Memory Architecture) have been proven to be an effective alternative for running LLMs (see ggerganov/llama.cpp performance discussion). However, the prevalent Kubernetes ecosystem is predominantly focused on Linux-based containers. In this presentation, we will showcase our efforts to facilitate LLMs inference on Kubernetes using macOS nodes. We will demonstrate how to employ Virtual Kubelet, Containerd, ShimV2, and runm (derived from llama.cpp: ggerganov/llama.cpp) for deploying open-source foundation models such as gemma, llama2, and mistral on Kubernetes. Additionally, we will discuss our motivation and the challenges encountered during our development journey. Our goal is to encourage the community to expand the Kubernetes ecosystem to inclusively support the execution of LLMs on macOS platforms.

Benefits to the Ecosystem

Enable running and orchestrating LLMs on Kubernetes with macOS nodes
Provide an alternative solution for running LLMs on Kubernetes
Inspire the community to build a more inclusive Kubernetes ecosystem that supports running LLMs on macOS

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
.vscode		.vscode
assets		assets
bin		bin
cmd		cmd
debug		debug
k8s		k8s
pkg		pkg
.gitignore		.gitignore
Dockerfile.mods		Dockerfile.mods
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
mods.yml		mods.yml

License

makllama/makllama

Folders and files

Latest commit

History

Repository files navigation

MaKllama

Table of Contents

Main Components

Quick Start (~ 1 minute)

1. Prerequisites

2. Start Containerd + Virtual Kubelet + BW

3. Deploy TinyLlama with 2 Replicas

4. Deploy Mods

5. Access OpenAI API Compatible Endpoint through Mods

6. Stop Containerd + Virtual Kubelet + BW

Community

Session Submissions

Title

Description

Benefits to the Ecosystem

About

Topics

Resources

License

Stars

Watchers

Forks

Languages