Files
homelab/active/software_ai_stack/ai_stack.md
ducoterra 31739320aa
All checks were successful
Podman DDNS Image / build-and-push-ddns (push) Successful in 1m4s
add apple m4 max benchmark
2026-02-25 16:01:08 -05:00

528 lines
16 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Self Hosted AI Stack
- [Self Hosted AI Stack](#self-hosted-ai-stack)
- [Notes](#notes)
- [Podman Volume Locations](#podman-volume-locations)
- [Setup](#setup)
- [Create the AI user](#create-the-ai-user)
- [Helper aliases](#helper-aliases)
- [Create the models dir](#create-the-models-dir)
- [Install the Hugging Face CLI](#install-the-hugging-face-cli)
- [Samba Model Storage](#samba-model-storage)
- [Download models](#download-models)
- [Text models](#text-models)
- [GPT-OSS](#gpt-oss)
- [Mistral](#mistral)
- [Qwen](#qwen)
- [GLM](#glm)
- [Gemma](#gemma)
- [Dolphin](#dolphin)
- [Image models](#image-models)
- [Z-Image](#z-image)
- [Flux](#flux)
- [Embedding Models](#embedding-models)
- [Qwen Embedding](#qwen-embedding)
- [Nomic Embedding](#nomic-embedding)
- [llama.cpp](#llamacpp)
- [stable-diffusion.cpp](#stable-diffusioncpp)
- [open-webui](#open-webui)
- [Install Services with Quadlets](#install-services-with-quadlets)
- [Internal and External Pods](#internal-and-external-pods)
- [Llama CPP Server](#llama-cpp-server)
- [Llama CPP Embedding Server](#llama-cpp-embedding-server)
- [Stable Diffusion CPP](#stable-diffusion-cpp)
- [Open Webui](#open-webui-1)
- [Install the update script](#install-the-update-script)
- [Install Guest Open Webui with Start/Stop Services](#install-guest-open-webui-with-startstop-services)
- [Benchmark Results](#benchmark-results)
## Notes
### Podman Volume Locations
`~/.local/share/containers/storage/volumes/`
## Setup
### Create the AI user
```bash
# Create your local ai user. This will be the user you launch podman processes from.
useradd -m ai
loginctl enable-linger ai
su -l ai
mkdir -p /home/ai/.config/containers/systemd/
mkdir -p /home/ai/.ssh
```
Models are big. You'll want some tools to help find large files quickly when space runs out.
### Helper aliases
Add these to your .bashrc:
```bash
# Calculate all folder sizes in current dir
alias {dudir,dud}='du -h --max-depth 1 | sort -h'
# Calculate all file sizes in current dir
alias {dufile,duf}='ls -lhSr'
# Restart llama-server / follow logs
alias llama-reload="systemctl --user daemon-reload && systemctl --user restart llama-server.service"
alias llama-logs="journalctl --user -fu llama-server"
# Restart stable diffusion gen and edit server / follow logs
alias sd-gen-reload='systemctl --user daemon-reload && systemctl --user restart stable-diffusion-gen-server'
alias sd-gen-logs='journalctl --user -xeu stable-diffusion-gen-server'
alias sd-edit-reload='systemctl --user daemon-reload && systemctl --user restart stable-diffusion-edit-server'
alias sd-edit-logs='journalctl --user -xeu stable-diffusion-edit-server'
```
### Create the models dir
```bash
mkdir -p /home/ai/models/{text,image,video,embedding,tts,stt}
```
### Install the Hugging Face CLI
<https://huggingface.co/docs/huggingface_hub/en/guides/cli#getting-started>
```bash
# Install
curl -LsSf https://hf.co/cli/install.sh | bash
# Login
hf auth login
```
### Samba Model Storage
I recommend adding network storage for keeping models offloaded. This mounts a samba share at `/srv/models`.
```bash
dnf install -y cifs-utils
# Add this to /etc/fstab
//driveripper.reeselink.com/smb_models /srv/models cifs _netdev,nofail,uid=1001,gid=1001,credentials=/etc/samba/credentials 0 0
# Then mount
systemctl daemon-reload
mount -a --mkdir
```
Here are some sync commands that I use to keep the samba share in sync with the home directory:
```bash
# Sync models from home dir to the samba share
rsync -av --progress /home/ai/models/ /srv/models/
```
### Download models
In general I try to run 8 bit quantized minimum.
#### Text models
<https://huggingface.co/ggml-org/collections>
##### GPT-OSS
<https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune#recommended-settings>
```bash
# gpt-oss-120b
mkdir gpt-oss-120b && cd gpt-oss-120b
hf download --local-dir . ggml-org/gpt-oss-120b-GGUF
# gpt-oss-20b
mkdir gpt-oss-20b && cd gpt-oss-20b
hf download --local-dir . ggml-org/gpt-oss-20b-GGUF
```
##### Mistral
```bash
# devstral-small-2-24b
mkdir devstral-small-2-24b && cd devstral-small-2-24b
hf download --local-dir . ggml-org/Devstral-Small-2-24B-Instruct-2512-GGUF Devstral-Small-2-24B-Instruct-2512-Q8_0.gguf
# ministral-3-14b
mkdir ministral-3-14b && cd ministral-3-14b
hf download --local-dir . ggml-org/Ministral-3-14B-Reasoning-2512-GGUF
# ministral-3-3b-instruct
mkdir ministral-3-3b-instruct && cd ministral-3-3b-instruct
hf download --local-dir . ggml-org/Ministral-3-3B-Instruct-2512-GGUF
```
##### Qwen
```bash
# qwen3-30b-a3b-thinking
mkdir qwen3-30b-a3b-thinking && cd qwen3-30b-a3b-thinking
hf download --local-dir . ggml-org/Qwen3-30B-A3B-Thinking-2507-Q8_0-GGUF
# qwen3-30b-a3b-instruct
mkdir qwen3-30b-a3b-instruct && cd qwen3-30b-a3b-instruct
hf download --local-dir . ggml-org/Qwen3-30B-A3B-Instruct-2507-Q8_0-GGUF
# qwen3-vl-30b-a3b-thinking
mkdir qwen3-vl-30b-a3b-thinking && cd qwen3-vl-30b-a3b-thinking
hf download --local-dir . Qwen/Qwen3-VL-30B-A3B-Thinking-GGUF Qwen3VL-30B-A3B-Thinking-Q8_0.gguf
hf download --local-dir . Qwen/Qwen3-VL-30B-A3B-Thinking-GGUF mmproj-Qwen3VL-30B-A3B-Thinking-F16.gguf
# qwen3-vl-30b-a3b-instruct
mkdir qwen3-vl-30b-a3b-instruct && cd qwen3-vl-30b-a3b-instruct
hf download --local-dir . Qwen/Qwen3-VL-30B-A3B-Instruct-GGUF Qwen3VL-30B-A3B-Instruct-Q8_0.gguf
hf download --local-dir . Qwen/Qwen3-VL-30B-A3B-Instruct-GGUF mmproj-Qwen3VL-30B-A3B-Instruct-F16.gguf
# qwen3-coder-30b-a3b-instruct
mkdir qwen3-coder-30b-a3b-instruct && cd qwen3-coder-30b-a3b-instruct
hf download --local-dir . ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF
# qwen3-coder-next
mkdir qwen3-coder-next && cd qwen3-coder-next
hf download --local-dir . unsloth/Qwen3-Coder-Next-GGUF --include "Q8_0/*.gguf"
```
##### GLM
```bash
# glm-4.7-flash-30b
mkdir glm-4.7-flash-30b && cd glm-4.7-flash-30b
hf download --local-dir . unsloth/GLM-4.7-Flash-GGUF GLM-4.7-Flash-Q8_0.gguf
```
##### Gemma
```bash
# Note "it" vs "pt" suffixes. "it" is instruction following, "pt" is the base model (not as good for out-of-the-box use)
# gemma-3-27b-it
mkdir gemma-3-27b-it && cd gemma-3-27b-it
hf download --local-dir . unsloth/gemma-3-27b-it-GGUF gemma-3-27b-it-Q8_0.gguf
hf download --local-dir . unsloth/gemma-3-27b-it-GGUF mmproj-F16.gguf
```
##### Dolphin
```bash
# dolphin-mistral-24b-venice
mkdir dolphin-mistral-24b-venice && cd dolphin-mistral-24b-venice
cd dolphin-mistral-24b-venice
hf download --local-dir . bartowski/cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-GGUF cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-Q8_0.gguf
```
#### Image models
##### Z-Image
```bash
# z-turbo
# Fastest image generation in 8 steps. Great a text and prompt following.
# Lacks variety.
mkdir /home/ai/models/image/z-turbo && cd /home/ai/models/image/z-turbo
hf download --local-dir . leejet/Z-Image-Turbo-GGUF z_image_turbo-Q8_0.gguf
hf download --local-dir . black-forest-labs/FLUX.1-schnell ae.safetensors
hf download --local-dir . unsloth/Qwen3-4B-Instruct-2507-GGUF Qwen3-4B-Instruct-2507-Q8_0.gguf
```
##### Flux
```bash
# flux2-klein
# Capable of editing images in 4 steps (though 5 is my recommended steps)
mkdir /home/ai/models/image/flux2-klein && cd /home/ai/models/image/flux2-klein
hf download --local-dir . leejet/FLUX.2-klein-9B-GGUF flux-2-klein-9b-Q8_0.gguf
hf download --local-dir . black-forest-labs/FLUX.2-dev ae.safetensors
hf download --local-dir . unsloth/Qwen3-8B-GGUF Qwen3-8B-Q8_0.gguf
```
#### Embedding Models
##### Qwen Embedding
```bash
mkdir /home/ai/models/embedding/qwen3-vl-embed && cd /home/ai/models/embedding/qwen3-vl-embed
hf download --local-dir . dam2452/Qwen3-VL-Embedding-8B-GGUF Qwen3-VL-Embedding-8B-Q8_0.gguf
```
##### Nomic Embedding
```bash
# nomic-embed-text-v2
mkdir /home/ai/models/embedding/nomic-embed-text-v2
hf download --local-dir /home/ai/models/embedding/nomic-embed-text-v2 ggml-org/Nomic-Embed-Text-V2-GGUF
```
## llama.cpp
<https://github.com/ggml-org/llama.cpp/tree/master/tools/server>
```bash
# Build the llama.cpp container image
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
export BUILD_TAG=$(date +"%Y-%m-%d-%H-%M-%S")
# Vulkan (better performance as of Feb 2026)
podman build -f .devops/vulkan.Dockerfile -t llama-cpp-vulkan:${BUILD_TAG} -t llama-cpp-vulkan:latest .
# ROCM
podman build -f .devops/rocm.Dockerfile -t llama-cpp-rocm:${BUILD_TAG} -t llama-cpp-rocm:latest .
# Run llama demo server (Available on port 8000)
podman run \
--rm \
--name llama-server-demo \
--device=/dev/kfd \
--device=/dev/dri \
-v /home/ai/models/text:/models:z \
-p 8000:8000 \
localhost/llama-cpp-vulkan:latest \
--host 0.0.0.0 \
--port 8000 \
-c 32768 \
--perf \
--n-gpu-layers all \
--jinja \
--models-max 1 \
--models-dir /models
```
## stable-diffusion.cpp
Server: <https://github.com/leejet/stable-diffusion.cpp/tree/master/examples/server>
CLI: <https://github.com/leejet/stable-diffusion.cpp/tree/master/examples/cli>
```bash
git clone https://github.com/leejet/stable-diffusion.cpp.git
cd stable-diffusion.cpp
git submodule update --init --recursive
export BUILD_TAG=$(date +"%Y-%m-%d-%H-%M-%S")
# Vulkan
podman build -f Dockerfile.vulkan -t stable-diffusion-cpp:${BUILD_TAG} -t stable-diffusion-cpp:latest .
```
```bash
# Generate an image with z-turbo
podman run --rm \
-v /home/ai/models:/models:z \
-v /home/ai/output:/output:z \
--device /dev/kfd \
--device /dev/dri \
localhost/stable-diffusion-cpp:latest \
--diffusion-model /models/image/z-turbo/z_image_turbo-Q8_0.gguf \
--vae /models/image/z-turbo/ae.safetensors \
--llm /models/image/z-turbo/Qwen3-4B-Instruct-2507-Q8_0.gguf \
-v \
--cfg-scale 1.0 \
--vae-conv-direct \
--diffusion-conv-direct \
--fa \
--mmap \
--seed -1 \
--steps 8 \
-H 1024 \
-W 1024 \
-o /output/output.png \
-p "A photorealistic dragon"
# Edit the generated image with flux2-klein
podman run --rm \
-v /home/ai/models:/models:z \
-v /home/ai/output:/output:z \
--device /dev/kfd \
--device /dev/dri \
localhost/stable-diffusion-cpp:latest \
--diffusion-model /models/image/flux2-klein/flux-2-klein-9b-Q8_0.gguf \
--vae /models/image/flux2-klein/ae.safetensors \
--llm /models/image/flux2-klein/Qwen3-8B-Q8_0.gguf \
-v \
--cfg-scale 1.0 \
--sampling-method euler \
--vae-conv-direct \
--diffusion-conv-direct \
--fa \
--mmap \
--steps 5 \
-H 1024 \
-W 1024 \
-r /output/output.png \
-o /output/edit.png \
-p "Replace the dragon with an old car"
```
## open-webui
```bash
mkdir /home/ai/.env
# Create a file called open-webui-env with `WEBUI_SECRET_KEY="some-random-key"
scp active/software_ai_stack/secrets/open-webui-env deskwork-ai:.env/
# Will be available on port 8080
podman run \
-d \
-p 8080:8080 \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
```
Use the following connections:
| Service | Endpoint |
| ------------------------- | ----------------------------------------- |
| llama.cpp server | <http://host.containers.internal:8000> |
| llama.cpp embed | <http://host.containers.internal:8001> |
| stable-diffusion.cpp | <http://host.containers.internal:1234/v1> |
| stable-diffusion.cpp edit | <http://host.containers.internal:1235/v1> |
## Install Services with Quadlets
### Internal and External Pods
These will be used to restrict internet access to our llama.cpp and
stable-diffusion.cpp services while allowing the frontend services to
communicate with those containers.
```bash
scp -r active/software_ai_stack/quadlets_pods/* deskwork-ai:.config/containers/systemd/
ssh deskwork-ai
systemctl --user daemon-reload
systemctl --user start ai-internal-pod.service ai-external-pod.service
```
### Llama CPP Server
Installs the llama.cpp server to run our text models.
```bash
scp -r active/software_ai_stack/quadlets_llama_server/* deskwork-ai:.config/containers/systemd/
ssh deskwork-ai
systemctl --user daemon-reload
systemctl --user restart ai-internal-pod.service
```
### Llama CPP Embedding Server
Installs the llama.cpp server to run our embedding models
```bash
scp -r active/software_ai_stack/quadlets_llama_embed/* deskwork-ai:.config/containers/systemd/
ssh deskwork-ai
systemctl --user daemon-reload
systemctl --user restart ai-internal-pod.service
```
### Stable Diffusion CPP
Installs the stable-diffusion.cpp server to run our image models.
```bash
scp -r active/software_ai_stack/quadlets_stable_diffusion/* deskwork-ai:.config/containers/systemd/
ssh deskwork-ai
systemctl --user daemon-reload
systemctl --user restart ai-internal-pod.service
```
### Open Webui
Installs the open webui frontend.
```bash
scp -r active/software_ai_stack/quadlets_openwebui/* deskwork-ai:.config/containers/systemd/
ssh deskwork-ai
systemctl --user daemon-reload
systemctl --user restart ai-external-pod.service
```
Note, all services will be available at `host.containers.internal`. So llama.cpp
will be up at `http://host.containers.internal:8000`.
### Install the update script
```bash
# 1. Builds the latest llama.cpp and stable-diffusion.cpp
# 2. Pulls the latest open-webui
# 3. Restarts all services
scp active/software_ai_stack/update-script.sh deskwork-ai:
ssh deskwork-ai
chmod +x update-script.sh
./update-script.sh
```
### Install Guest Open Webui with Start/Stop Services
Optionally install a guest openwebui service.
```bash
scp -r active/software_ai_stack/systemd/. deskwork-ai:.config/systemd/user/
ssh deskwork-ai
systemctl --user daemon-reload
systemctl --user enable open-webui-guest-start.timer
systemctl --user enable open-webui-guest-stop.timer
```
## Benchmark Results
Benchmarks are run with [unsloth gpt-oss-20b Q8_0](https://huggingface.co/unsloth/gpt-oss-20b-GGUF/blob/main/gpt-oss-20b-Q8_0.gguf)
```bash
# Run the llama.cpp pod (AMD)
podman run -it --rm \
--device=/dev/kfd \
--device=/dev/dri \
-v /home/ai/models/text:/models:z \
--entrypoint /bin/bash \
ghcr.io/ggml-org/llama.cpp:full-vulkan
# Benchmark command
./llama-bench -m /models/benchmark/gpt-oss-20b-Q8_0.gguf
```
Framework Desktop
| model | size | params | backend | ngl | test | t/s |
| ---------------- | --------: | ------: | ------- | ---: | ----: | -------------: |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | Vulkan | 99 | pp512 | 1128.50 ± 7.60 |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | Vulkan | 99 | tg128 | 77.94 ± 0.08 |
| model | size | params | backend | ngl | test | t/s |
| ---------------- | --------: | ------: | ------- | ---: | ----: | ------------: |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | ROCm | 99 | pp512 | 526.05 ± 7.04 |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | ROCm | 99 | tg128 | 70.98 ± 0.01 |
AMD R9700
| model | size | params | backend | ngl | test | t/s |
| ---------------- | --------: | ------: | ------- | ---: | ----: | ---------------: |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | Vulkan | 99 | pp512 | 3756.79 ± 203.97 |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | Vulkan | 99 | tg128 | 174.24 ± 0.32 |
NVIDIA GeForce RTX 4080 SUPER
| model | size | params | backend | ngl | test | t/s |
| ---------------- | --------: | ------: | ------- | ---: | ----: | ------------: |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | CUDA | 99 | tg128 | 193.28 ± 1.03 |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | CUDA | 99 | tg256 | 193.55 ± 0.34 |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | CUDA | 99 | tg512 | 187.39 ± 0.10 |
NVIDIA GeForce RTX 3090
| model | size | params | backend | ngl | test | t/s |
| ---------------- | --------: | ------: | ------- | ---: | ----: | --------------: |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | CUDA | 99 | pp512 | 4297.72 ± 35.60 |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | CUDA | 99 | tg128 | 197.73 ± 0.62 |
Apple M4 max
| model | test | t/s |
| :---------------------------- | -----: | -------------: |
| unsloth/gpt-oss-20b-Q8_0-GGUF | pp2048 | 1579.12 ± 7.12 |
| unsloth/gpt-oss-20b-Q8_0-GGUF | tg32 | 113.00 ± 2.81 |