771 lines
22 KiB
Markdown
771 lines
22 KiB
Markdown
# Self Hosted AI Stack
|
||
|
||
- [Self Hosted AI Stack](#self-hosted-ai-stack)
|
||
- [Notes](#notes)
|
||
- [Podman Volume Locations](#podman-volume-locations)
|
||
- [List of Internal Links](#list-of-internal-links)
|
||
- [Quick Install](#quick-install)
|
||
- [Text Stack](#text-stack)
|
||
- [Image Stack](#image-stack)
|
||
- [Setup](#setup)
|
||
- [Create the AI user](#create-the-ai-user)
|
||
- [Helper aliases](#helper-aliases)
|
||
- [Create the models dir](#create-the-models-dir)
|
||
- [Install the Hugging Face CLI](#install-the-hugging-face-cli)
|
||
- [Samba Model Storage](#samba-model-storage)
|
||
- [Download models](#download-models)
|
||
- [Text models](#text-models)
|
||
- [GPT-OSS](#gpt-oss)
|
||
- [Mistral](#mistral)
|
||
- [Qwen](#qwen)
|
||
- [GLM](#glm)
|
||
- [Gemma](#gemma)
|
||
- [Dolphin](#dolphin)
|
||
- [LiquidAI](#liquidai)
|
||
- [Level 1 Techs](#level-1-techs)
|
||
- [Image models](#image-models)
|
||
- [Z-Image](#z-image)
|
||
- [Flux](#flux)
|
||
- [Embedding Models](#embedding-models)
|
||
- [Qwen Embedding](#qwen-embedding)
|
||
- [Nomic Embedding](#nomic-embedding)
|
||
- [llama.cpp](#llamacpp)
|
||
- [stable-diffusion.cpp](#stable-diffusioncpp)
|
||
- [open-webui](#open-webui)
|
||
- [lite-llm](#lite-llm)
|
||
- [Install Services with Quadlets](#install-services-with-quadlets)
|
||
- [Internal and External Pods](#internal-and-external-pods)
|
||
- [Llama CPP Server (Port 8000)](#llama-cpp-server-port-8000)
|
||
- [Llama CPP Embedding Server (Port 8001)](#llama-cpp-embedding-server-port-8001)
|
||
- [Llama CPP Instruct Server (Port 8002)](#llama-cpp-instruct-server-port-8002)
|
||
- [Stable Diffusion CPP (Port 1234 and 1235)](#stable-diffusion-cpp-port-1234-and-1235)
|
||
- [Open Webui (Port 8080)](#open-webui-port-8080)
|
||
- [Install the update script](#install-the-update-script)
|
||
- [Install Guest Open Webui with Start/Stop Services](#install-guest-open-webui-with-startstop-services)
|
||
- [Benchmark Results](#benchmark-results)
|
||
- [Testing with Curl](#testing-with-curl)
|
||
- [OpenAI API](#openai-api)
|
||
- [Misc](#misc)
|
||
- [Qwen3.5 Settings](#qwen35-settings)
|
||
|
||
## Notes
|
||
|
||
```bash
|
||
# Shortcut for downloading models
|
||
hf-download ()
|
||
{
|
||
if [ $# -ne 3 ]; then
|
||
echo "ERROR: Expected 3 arguments, but only got $#" 1>&2
|
||
return 1
|
||
fi
|
||
BASE_DIR='/opt/ai/models'
|
||
mkdir -p $BASE_DIR/$1
|
||
pushd $BASE_DIR/$1 2>&1 >/dev/null
|
||
hf download --local-dir . $2 $3
|
||
popd 2>&1 >/dev/null
|
||
}
|
||
```
|
||
|
||
### Podman Volume Locations
|
||
|
||
`~/.local/share/containers/storage/volumes/`
|
||
|
||
### List of Internal Links
|
||
|
||
- llama-cpp
|
||
- llama-embed
|
||
- llama-instruct
|
||
- image-gen
|
||
- image-edit
|
||
- openwebui
|
||
|
||
## Quick Install
|
||
|
||
### Text Stack
|
||
|
||
```bash
|
||
ansible-playbook \
|
||
-i ansible/inventory.yaml \
|
||
active/software_ai_stack/install_ai_text_stack.yaml
|
||
```
|
||
|
||
### Image Stack
|
||
|
||
```bash
|
||
ansible-playbook \
|
||
-i ansible/inventory.yaml \
|
||
active/software_ai_stack/install_ai_image_stack.yaml
|
||
```
|
||
|
||
## Setup
|
||
|
||
### Create the AI user
|
||
|
||
```bash
|
||
# Create your local ai user. This will be the user you launch podman processes from.
|
||
useradd -m ai
|
||
loginctl enable-linger ai
|
||
su -l ai
|
||
mkdir -p /home/ai/.config/containers/systemd/
|
||
mkdir -p /home/ai/.ssh
|
||
```
|
||
|
||
Models are big. You'll want some tools to help find large files quickly when space runs out.
|
||
|
||
### Helper aliases
|
||
|
||
Add these to your .bashrc:
|
||
|
||
```bash
|
||
# Calculate all folder sizes in current dir
|
||
alias {dudir,dud}='du -h --max-depth 1 | sort -h'
|
||
|
||
# Calculate all file sizes in current dir
|
||
alias {dufile,duf}='ls -lhSr'
|
||
|
||
# Restart llama-server / follow logs
|
||
alias llama-reload="systemctl --user daemon-reload && systemctl --user restart llama-server.service"
|
||
alias llama-logs="journalctl --user -fu llama-server"
|
||
|
||
# Restart stable diffusion gen and edit server / follow logs
|
||
alias sd-gen-reload='systemctl --user daemon-reload && systemctl --user restart stable-diffusion-gen-server'
|
||
alias sd-gen-logs='journalctl --user -xeu stable-diffusion-gen-server'
|
||
alias sd-edit-reload='systemctl --user daemon-reload && systemctl --user restart stable-diffusion-edit-server'
|
||
alias sd-edit-logs='journalctl --user -xeu stable-diffusion-edit-server'
|
||
```
|
||
|
||
### Create the models dir
|
||
|
||
```bash
|
||
mkdir -p /home/ai/models/{text,image,video,embedding,tts,stt}
|
||
```
|
||
|
||
### Install the Hugging Face CLI
|
||
|
||
<https://huggingface.co/docs/huggingface_hub/en/guides/cli#getting-started>
|
||
|
||
```bash
|
||
# Install
|
||
curl -LsSf https://hf.co/cli/install.sh | bash
|
||
|
||
# Login
|
||
hf auth login
|
||
```
|
||
|
||
### Samba Model Storage
|
||
|
||
I recommend adding network storage for keeping models offloaded. This mounts a samba share at `/srv/models`.
|
||
|
||
```bash
|
||
dnf install -y cifs-utils
|
||
|
||
# Add this to /etc/fstab
|
||
//driveripper.reeselink.com/smb_models /srv/models cifs _netdev,nofail,uid=1001,gid=1001,credentials=/etc/samba/credentials 0 0
|
||
|
||
# Then mount
|
||
systemctl daemon-reload
|
||
mount -a --mkdir
|
||
```
|
||
|
||
Here are some sync commands that I use to keep the samba share in sync with the home directory:
|
||
|
||
```bash
|
||
# Sync models from home dir to the samba share
|
||
rsync -av --progress /home/ai/models/ /srv/models/
|
||
```
|
||
|
||
### Download models
|
||
|
||
In general I try to run 8 bit quantized minimum.
|
||
|
||
#### Text models
|
||
|
||
<https://huggingface.co/ggml-org/collections>
|
||
|
||
##### GPT-OSS
|
||
|
||
<https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune#recommended-settings>
|
||
|
||
```bash
|
||
# gpt-oss-120b
|
||
mkdir gpt-oss-120b && cd gpt-oss-120b
|
||
hf download --local-dir . ggml-org/gpt-oss-120b-GGUF
|
||
|
||
# gpt-oss-20b
|
||
mkdir gpt-oss-20b && cd gpt-oss-20b
|
||
hf download --local-dir . ggml-org/gpt-oss-20b-GGUF
|
||
```
|
||
|
||
##### Mistral
|
||
|
||
```bash
|
||
# devstral-small-2-24b
|
||
mkdir devstral-small-2-24b && cd devstral-small-2-24b
|
||
hf download --local-dir . ggml-org/Devstral-Small-2-24B-Instruct-2512-GGUF Devstral-Small-2-24B-Instruct-2512-Q8_0.gguf
|
||
|
||
# ministral-3-14b
|
||
mkdir ministral-3-14b && cd ministral-3-14b
|
||
hf download --local-dir . ggml-org/Ministral-3-14B-Reasoning-2512-GGUF
|
||
|
||
# ministral-3-3b-instruct
|
||
mkdir ministral-3-3b-instruct && cd ministral-3-3b-instruct
|
||
hf download --local-dir . ggml-org/Ministral-3-3B-Instruct-2512-GGUF
|
||
```
|
||
|
||
##### Qwen
|
||
|
||
```bash
|
||
# qwen3.5-4b
|
||
mkdir qwen3.5-4b && cd qwen3.5-4b
|
||
hf download --local-dir . unsloth/Qwen3.5-4B-GGUF Qwen3.5-4B-Q8_0.gguf
|
||
hf download --local-dir . unsloth/Qwen3.5-4B-GGUF mmproj-F16.gguf
|
||
|
||
# qwen3.5-35b-a3b
|
||
mkdir qwen3.5-35b-a3b && cd qwen3.5-35b-a3b
|
||
hf download --local-dir . unsloth/Qwen3.5-35B-A3B-GGUF Qwen3.5-35B-A3B-Q8_0.gguf
|
||
hf download --local-dir . unsloth/Qwen3.5-35B-A3B-GGUF mmproj-F16.gguf
|
||
|
||
# qwen3-30b-a3b-instruct
|
||
mkdir qwen3-30b-a3b-instruct && cd qwen3-30b-a3b-instruct
|
||
hf download --local-dir . ggml-org/Qwen3-30B-A3B-Instruct-2507-Q8_0-GGUF
|
||
|
||
# qwen3-vl-30b-a3b-thinking
|
||
mkdir qwen3-vl-30b-a3b-thinking && cd qwen3-vl-30b-a3b-thinking
|
||
hf download --local-dir . Qwen/Qwen3-VL-30B-A3B-Thinking-GGUF Qwen3VL-30B-A3B-Thinking-Q8_0.gguf
|
||
hf download --local-dir . Qwen/Qwen3-VL-30B-A3B-Thinking-GGUF mmproj-Qwen3VL-30B-A3B-Thinking-F16.gguf
|
||
|
||
# qwen3-vl-30b-a3b-instruct
|
||
mkdir qwen3-vl-30b-a3b-instruct && cd qwen3-vl-30b-a3b-instruct
|
||
hf download --local-dir . Qwen/Qwen3-VL-30B-A3B-Instruct-GGUF Qwen3VL-30B-A3B-Instruct-Q8_0.gguf
|
||
hf download --local-dir . Qwen/Qwen3-VL-30B-A3B-Instruct-GGUF mmproj-Qwen3VL-30B-A3B-Instruct-F16.gguf
|
||
|
||
# qwen3-coder-30b-a3b-instruct
|
||
mkdir qwen3-coder-30b-a3b-instruct && cd qwen3-coder-30b-a3b-instruct
|
||
hf download --local-dir . ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF
|
||
|
||
# qwen3-coder-next
|
||
mkdir qwen3-coder-next && cd qwen3-coder-next
|
||
hf download --local-dir . unsloth/Qwen3-Coder-Next-GGUF --include "Q8_0/*.gguf"
|
||
|
||
# qwen3-8b (benchmarks)
|
||
mkdir qwen3-8b && cd qwen3-8b
|
||
hf download --local-dir . Qwen/Qwen3-8B-GGUF Qwen3-8B-Q8_0.gguf
|
||
```
|
||
|
||
##### GLM
|
||
|
||
```bash
|
||
# glm-4.7-flash-30b
|
||
mkdir glm-4.7-flash-30b && cd glm-4.7-flash-30b
|
||
hf download --local-dir . unsloth/GLM-4.7-Flash-GGUF GLM-4.7-Flash-Q8_0.gguf
|
||
```
|
||
|
||
##### Gemma
|
||
|
||
```bash
|
||
# Note "it" vs "pt" suffixes. "it" is instruction following, "pt" is the base model (not as good for out-of-the-box use)
|
||
# gemma-3-27b-it
|
||
mkdir gemma-3-27b-it && cd gemma-3-27b-it
|
||
hf download --local-dir . unsloth/gemma-3-27b-it-GGUF gemma-3-27b-it-Q8_0.gguf
|
||
hf download --local-dir . unsloth/gemma-3-27b-it-GGUF mmproj-F16.gguf
|
||
```
|
||
|
||
##### Dolphin
|
||
|
||
```bash
|
||
# dolphin-mistral-24b-venice
|
||
mkdir dolphin-mistral-24b-venice && cd dolphin-mistral-24b-venice
|
||
hf download --local-dir . bartowski/cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-GGUF cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-Q8_0.gguf
|
||
```
|
||
|
||
##### LiquidAI
|
||
|
||
```bash
|
||
# lfm2-24b
|
||
mkdir lfm2-24b && cd lfm2-24b
|
||
hf download --local-dir . LiquidAI/LFM2-24B-A2B-GGUF LFM2-24B-A2B-Q8_0.gguf
|
||
```
|
||
|
||
##### Level 1 Techs
|
||
|
||
```bash
|
||
# kappa-20b
|
||
# https://huggingface.co/eousphoros/kappa-20b-131k-GGUF-Q8_0/tree/main
|
||
mkdir kappa-20b && cd kappa-20b
|
||
hf download --local-dir . eousphoros/kappa-20b-131k-GGUF-Q8_0
|
||
```
|
||
|
||
#### Image models
|
||
|
||
##### Z-Image
|
||
|
||
```bash
|
||
# z-turbo
|
||
# Fastest image generation in 8 steps. Great a text and prompt following.
|
||
# Lacks variety.
|
||
mkdir /home/ai/models/image/z-turbo && cd /home/ai/models/image/z-turbo
|
||
hf download --local-dir . leejet/Z-Image-Turbo-GGUF z_image_turbo-Q8_0.gguf
|
||
hf download --local-dir . black-forest-labs/FLUX.1-schnell ae.safetensors
|
||
hf download --local-dir . unsloth/Qwen3-4B-Instruct-2507-GGUF Qwen3-4B-Instruct-2507-Q8_0.gguf
|
||
```
|
||
|
||
##### Flux
|
||
|
||
```bash
|
||
# flux2-klein
|
||
# Capable of editing images in 4 steps (though 5 is my recommended steps)
|
||
mkdir /home/ai/models/image/flux2-klein && cd /home/ai/models/image/flux2-klein
|
||
hf download --local-dir . leejet/FLUX.2-klein-9B-GGUF flux-2-klein-9b-Q8_0.gguf
|
||
hf download --local-dir . black-forest-labs/FLUX.2-dev ae.safetensors
|
||
hf download --local-dir . unsloth/Qwen3-8B-GGUF Qwen3-8B-Q8_0.gguf
|
||
```
|
||
|
||
#### Embedding Models
|
||
|
||
##### Qwen Embedding
|
||
|
||
```bash
|
||
mkdir qwen3-embed-4b && cd qwen3-embed-4b
|
||
hf download --local-dir . Qwen/Qwen3-Embedding-4B-GGUF Qwen3-Embedding-4B-Q8_0.gguf
|
||
```
|
||
|
||
##### Nomic Embedding
|
||
|
||
```bash
|
||
# nomic-embed-text-v2
|
||
mkdir /home/ai/models/embedding/nomic-embed-text-v2
|
||
hf download --local-dir /home/ai/models/embedding/nomic-embed-text-v2 ggml-org/Nomic-Embed-Text-V2-GGUF
|
||
```
|
||
|
||
## llama.cpp
|
||
|
||
<https://github.com/ggml-org/llama.cpp/tree/master/tools/server>
|
||
|
||
```bash
|
||
# Build the llama.cpp container image
|
||
git clone https://github.com/ggml-org/llama.cpp.git
|
||
cd llama.cpp
|
||
export BUILD_TAG=$(date +"%Y-%m-%d-%H-%M-%S")
|
||
|
||
# Vulkan (better performance as of Feb 2026)
|
||
podman build -f .devops/vulkan.Dockerfile -t llama-cpp-vulkan:${BUILD_TAG} -t llama-cpp-vulkan:latest .
|
||
|
||
# ROCM
|
||
podman build -f .devops/rocm.Dockerfile -t llama-cpp-rocm:${BUILD_TAG} -t llama-cpp-rocm:latest .
|
||
|
||
# Run llama demo server (Available on port 8000)
|
||
podman run \
|
||
--rm \
|
||
--name llama-server-demo \
|
||
--device=/dev/kfd \
|
||
--device=/dev/dri \
|
||
-v /home/ai/models/text:/models:z \
|
||
-p 8010:8000 \
|
||
localhost/llama-cpp-vulkan:latest \
|
||
--host 0.0.0.0 \
|
||
--port 8000 \
|
||
-c 16000 \
|
||
--perf \
|
||
--n-gpu-layers all \
|
||
--jinja \
|
||
--models-max 1 \
|
||
--models-dir /models \
|
||
--chat-template-kwargs '{"enable_thinking": false}' \
|
||
-m /models/qwen3.5-35b-a3b
|
||
```
|
||
|
||
Embedding models
|
||
|
||
```bash
|
||
podman run \
|
||
--rm \
|
||
--name llama-server-demo \
|
||
--device=/dev/kfd \
|
||
--device=/dev/dri \
|
||
-v /home/ai/models/text:/models:z \
|
||
-p 8000:8000 \
|
||
localhost/llama-cpp-vulkan:latest \
|
||
--host 0.0.0.0 \
|
||
--port 8001 \
|
||
-c 512 \
|
||
--perf \
|
||
--n-gpu-layers all \
|
||
--models-max 1 \
|
||
--models-dir /models \
|
||
--embedding
|
||
```
|
||
|
||
```bash
|
||
# Test with curl
|
||
curl -X POST "https://llama-embed.reeselink.com/embedding" --data '{"model": "qwen3-embed-4b", "content":"Star Wars is better than Star Trek"}'
|
||
```
|
||
|
||
## stable-diffusion.cpp
|
||
|
||
Server: <https://github.com/leejet/stable-diffusion.cpp/tree/master/examples/server>
|
||
|
||
CLI: <https://github.com/leejet/stable-diffusion.cpp/tree/master/examples/cli>
|
||
|
||
```bash
|
||
git clone https://github.com/leejet/stable-diffusion.cpp.git
|
||
cd stable-diffusion.cpp
|
||
git submodule update --init --recursive
|
||
export BUILD_TAG=$(date +"%Y-%m-%d-%H-%M-%S")
|
||
|
||
# Vulkan
|
||
podman build -f Dockerfile.vulkan -t stable-diffusion-cpp:${BUILD_TAG} -t stable-diffusion-cpp:latest .
|
||
```
|
||
|
||
```bash
|
||
# Generate an image with z-turbo
|
||
podman run --rm \
|
||
-v /home/ai/models:/models:z \
|
||
-v /home/ai/output:/output:z \
|
||
--device /dev/kfd \
|
||
--device /dev/dri \
|
||
localhost/stable-diffusion-cpp:latest \
|
||
--diffusion-model /models/image/z-turbo/z_image_turbo-Q8_0.gguf \
|
||
--vae /models/image/z-turbo/ae.safetensors \
|
||
--llm /models/image/z-turbo/Qwen3-4B-Instruct-2507-Q8_0.gguf \
|
||
-v \
|
||
--cfg-scale 1.0 \
|
||
--vae-conv-direct \
|
||
--diffusion-conv-direct \
|
||
--fa \
|
||
--mmap \
|
||
--seed -1 \
|
||
--steps 8 \
|
||
-H 1024 \
|
||
-W 1024 \
|
||
-o /output/output.png \
|
||
-p "A photorealistic dragon"
|
||
|
||
# Edit the generated image with flux2-klein
|
||
podman run --rm \
|
||
-v /home/ai/models:/models:z \
|
||
-v /home/ai/output:/output:z \
|
||
--device /dev/kfd \
|
||
--device /dev/dri \
|
||
localhost/stable-diffusion-cpp:latest \
|
||
--diffusion-model /models/image/flux2-klein/flux-2-klein-9b-Q8_0.gguf \
|
||
--vae /models/image/flux2-klein/ae.safetensors \
|
||
--llm /models/image/flux2-klein/Qwen3-8B-Q8_0.gguf \
|
||
-v \
|
||
--cfg-scale 1.0 \
|
||
--sampling-method euler \
|
||
--vae-conv-direct \
|
||
--diffusion-conv-direct \
|
||
--fa \
|
||
--mmap \
|
||
--steps 5 \
|
||
-H 1024 \
|
||
-W 1024 \
|
||
-r /output/output.png \
|
||
-o /output/edit.png \
|
||
-p "Replace the dragon with an old car"
|
||
|
||
# Video generation with wan2.2
|
||
podman run --rm \
|
||
-v /home/ai/models:/models:z \
|
||
-v /home/ai/output:/output:z \
|
||
--device /dev/kfd \
|
||
--device /dev/dri \
|
||
localhost/stable-diffusion-cpp:latest \
|
||
-M vid_gen \
|
||
--diffusion-model /models/video/wan2.2/Wan2.2-T2V-A14B-LowNoise-Q5_K_M.gguf \
|
||
--high-noise-diffusion-model /models/video/wan2.2/Wan2.2-T2V-A14B-HighNoise-Q5_K_M.gguf \
|
||
--vae /models/video/wan2.2/wan_2.1_vae.safetensors \
|
||
--t5xxl /models/video/wan2.2/umt5-xxl-encoder-Q5_K_M.gguf \
|
||
--cfg-scale 3.5 \
|
||
--sampling-method euler \
|
||
--steps 10 \
|
||
--high-noise-cfg-scale 3.5 \
|
||
--high-noise-sampling-method euler \
|
||
--high-noise-steps 8 \
|
||
--vae-conv-direct \
|
||
--diffusion-conv-direct \
|
||
--vae-tiling \
|
||
-v \
|
||
-n "Colorful tones, overexposed, static, blurred details, subtitles, style, artwork, painting, picture, still, overall graying, worst quality, low quality, JPEG compression residue, ugly, mutilated, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, deformed limbs, finger fusion, still pictures, messy backgrounds, three legs, many people in the background, walking backwards" \
|
||
-W 512 \
|
||
-H 512 \
|
||
--diffusion-fa \
|
||
--video-frames 24 \
|
||
--flow-shift 3.0 \
|
||
-o /output/video_output \
|
||
-p "A normal business meeting. People discuss business for 2 seconds. Suddenly, a horde of furries carrying assault rifles bursts into the room and causes a panic. Hatsune Miku leads the charge screaming in rage."
|
||
```
|
||
|
||
## open-webui
|
||
|
||
```bash
|
||
mkdir /home/ai/.env
|
||
# Create a file called open-webui-env with `WEBUI_SECRET_KEY="some-random-key"
|
||
scp active/software_ai_stack/secrets/open-webui-env deskwork-ai:.env/
|
||
|
||
# Will be available on port 8080
|
||
podman run \
|
||
-d \
|
||
-p 8080:8080 \
|
||
-v open-webui:/app/backend/data \
|
||
--name open-webui \
|
||
--restart always \
|
||
ghcr.io/open-webui/open-webui:main
|
||
```
|
||
|
||
Use the following connections:
|
||
|
||
| Service | Endpoint |
|
||
| ------------------------- | ----------------------------------------- |
|
||
| llama.cpp server | <http://host.containers.internal:8000> |
|
||
| llama.cpp embed | <http://host.containers.internal:8001> |
|
||
| stable-diffusion.cpp | <http://host.containers.internal:1234/v1> |
|
||
| stable-diffusion.cpp edit | <http://host.containers.internal:1235/v1> |
|
||
|
||
## lite-llm
|
||
|
||
<https://docs.litellm.ai/docs/proxy/configs>
|
||
|
||
```bash
|
||
podman run \
|
||
--rm \
|
||
--name litellm \
|
||
-p 4000:4000
|
||
```
|
||
|
||
## Install Services with Quadlets
|
||
|
||
### Internal and External Pods
|
||
|
||
These will be used to restrict internet access to our llama.cpp and
|
||
stable-diffusion.cpp services while allowing the frontend services to
|
||
communicate with those containers.
|
||
|
||
```bash
|
||
scp -r active/software_ai_stack/quadlets_pods/* deskwork-ai:.config/containers/systemd/
|
||
ssh deskwork-ai
|
||
systemctl --user daemon-reload
|
||
systemctl --user start ai-internal-pod.service ai-external-pod.service
|
||
```
|
||
|
||
### Llama CPP Server (Port 8000)
|
||
|
||
Installs the llama.cpp server to run our text models.
|
||
|
||
```bash
|
||
scp -r active/software_ai_stack/quadlets_llama_think/* deskwork-ai:.config/containers/systemd/
|
||
ssh deskwork-ai
|
||
systemctl --user daemon-reload
|
||
systemctl --user restart ai-internal-pod.service
|
||
```
|
||
|
||
### Llama CPP Embedding Server (Port 8001)
|
||
|
||
Installs the llama.cpp server to run our embedding models
|
||
|
||
```bash
|
||
scp -r active/software_ai_stack/quadlets_llama_embed/* deskwork-ai:.config/containers/systemd/
|
||
ssh deskwork-ai
|
||
systemctl --user daemon-reload
|
||
systemctl --user restart ai-internal-pod.service
|
||
```
|
||
|
||
### Llama CPP Instruct Server (Port 8002)
|
||
|
||
Installs the llama.cpp server to run a constant instruct (no thinking) model for quick replies
|
||
|
||
```bash
|
||
scp -r active/software_ai_stack/quadlets_llama_instruct/* deskwork-ai:.config/containers/systemd/
|
||
ssh deskwork-ai
|
||
systemctl --user daemon-reload
|
||
systemctl --user restart ai-internal-pod.service
|
||
```
|
||
|
||
### Stable Diffusion CPP (Port 1234 and 1235)
|
||
|
||
Installs the stable-diffusion.cpp server to run our image models.
|
||
|
||
```bash
|
||
scp -r active/software_ai_stack/quadlets_stable_diffusion/* deskwork-ai:.config/containers/systemd/
|
||
ssh deskwork-ai
|
||
systemctl --user daemon-reload
|
||
systemctl --user restart ai-internal-pod.service
|
||
```
|
||
|
||
### Open Webui (Port 8080)
|
||
|
||
Installs the open webui frontend.
|
||
|
||
```bash
|
||
scp -r active/software_ai_stack/quadlets_openwebui/* deskwork-ai:.config/containers/systemd/
|
||
ssh deskwork-ai
|
||
systemctl --user daemon-reload
|
||
systemctl --user restart ai-external-pod.service
|
||
```
|
||
|
||
Note, all services will be available at `host.containers.internal`. So llama.cpp
|
||
will be up at `http://host.containers.internal:8000`.
|
||
|
||
### Install the update script
|
||
|
||
```bash
|
||
# 1. Builds the latest llama.cpp and stable-diffusion.cpp
|
||
# 2. Pulls the latest open-webui
|
||
# 3. Restarts all services
|
||
scp active/software_ai_stack/update-script.sh deskwork-ai:
|
||
ssh deskwork-ai
|
||
chmod +x update-script.sh
|
||
./update-script.sh
|
||
```
|
||
|
||
### Install Guest Open Webui with Start/Stop Services
|
||
|
||
Optionally install a guest openwebui service.
|
||
|
||
```bash
|
||
scp -r active/software_ai_stack/systemd/. deskwork-ai:.config/systemd/user/
|
||
ssh deskwork-ai
|
||
systemctl --user daemon-reload
|
||
systemctl --user enable open-webui-guest-start.timer
|
||
systemctl --user enable open-webui-guest-stop.timer
|
||
```
|
||
|
||
## Benchmark Results
|
||
|
||
Benchmarks are run with [unsloth gpt-oss-20b Q8_0](https://huggingface.co/unsloth/gpt-oss-20b-GGUF/blob/main/gpt-oss-20b-Q8_0.gguf)
|
||
|
||
```bash
|
||
# Run the llama.cpp pod (AMD)
|
||
podman run -it --rm \
|
||
--device=/dev/kfd \
|
||
--device=/dev/dri \
|
||
-v /home/ai/models/text:/models:z \
|
||
--entrypoint /bin/bash \
|
||
ghcr.io/ggml-org/llama.cpp:full-vulkan
|
||
|
||
# Benchmark command
|
||
./llama-bench -m /models/gpt-oss-20b/gpt-oss-20b-Q8_0.gguf -p 4096 -n 1024
|
||
```
|
||
|
||
Framework Desktop
|
||
|
||
| model | size | params | backend | ngl | test | t/s |
|
||
| ---------------- | --------: | ------: | ------- | ---: | -----: | ------------: |
|
||
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | Vulkan | 99 | pp4096 | 992.74 ± 6.07 |
|
||
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | Vulkan | 99 | tg1024 | 75.82 ± 0.07 |
|
||
|
||
AMD R9700
|
||
|
||
| model | size | params | backend | ngl | test | t/s |
|
||
| ---------------- | --------: | ------: | ------- | ---: | -----: | -------------: |
|
||
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | Vulkan | 99 | pp4096 | 3190.85 ± 8.24 |
|
||
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | Vulkan | 99 | tg1024 | 168.73 ± 0.15 |
|
||
|
||
NVIDIA GeForce RTX 4080 SUPER
|
||
|
||
| model | size | params | backend | ngl | test | t/s |
|
||
| ---------------- | --------: | ------: | ------- | ---: | ----: | ------------: |
|
||
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | CUDA | 99 | tg128 | 193.28 ± 1.03 |
|
||
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | CUDA | 99 | tg256 | 193.55 ± 0.34 |
|
||
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | CUDA | 99 | tg512 | 187.39 ± 0.10 |
|
||
|
||
NVIDIA GeForce RTX 3090
|
||
|
||
| model | size | params | backend | ngl | test | t/s |
|
||
| ---------------- | --------: | ------: | ----------- | ---: | -----: | --------------: |
|
||
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | CUDA,Vulkan | 99 | pp4096 | 3034.03 ± 80.36 |
|
||
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | CUDA,Vulkan | 99 | tg1024 | 181.05 ± 9.01 |
|
||
|
||
Apple M4 max
|
||
|
||
| model | test | t/s |
|
||
| :---------------------------- | -----: | -------------: |
|
||
| unsloth/gpt-oss-20b-Q8_0-GGUF | pp2048 | 1579.12 ± 7.12 |
|
||
| unsloth/gpt-oss-20b-Q8_0-GGUF | tg32 | 113.00 ± 2.81 |
|
||
|
||
## Testing with Curl
|
||
|
||
### OpenAI API
|
||
|
||
```bash
|
||
export TOKEN=$(cat active/software_ai_stack/secrets/aipi-token)
|
||
|
||
# List Models
|
||
curl https://aipi.reeseapps.com/v1/models \
|
||
-H "Authorization: Bearer $TOKEN" | jq
|
||
|
||
# Text
|
||
curl https://aipi.reeseapps.com/v1/chat/completions \
|
||
-H "Content-Type: application/json" \
|
||
-H "Authorization: Bearer $TOKEN" \
|
||
-d '{
|
||
"model": "llama-instruct/instruct",
|
||
"messages": [
|
||
{"role": "system", "content": "You are a helpful assistant."},
|
||
{"role": "user", "content": "Hello, how are you?"}
|
||
],
|
||
"temperature": 0.7,
|
||
"max_tokens": 500
|
||
}' | jq
|
||
|
||
# Completion
|
||
curl https://aipi.reeseapps.com/v1/completions \
|
||
-H "Content-Type: application/json" \
|
||
-H "Authorization: Bearer $TOKEN" \
|
||
-d '{
|
||
"model": "llama-instruct/instruct",
|
||
"prompt": "Write a short poem about the ocean.",
|
||
"temperature": 0.7,
|
||
"max_tokens": 500,
|
||
"top_p": 1,
|
||
"frequency_penalty": 0,
|
||
"presence_penalty": 0
|
||
}' | jq
|
||
|
||
# Image Gen
|
||
curl https://aipi.reeseapps.com/v1/images/generations \
|
||
-H "Content-Type: application/json" \
|
||
-H "Authorization: Bearer $TOKEN" \
|
||
-d '{
|
||
"model": "sdd-gen/sd-cpp-local",
|
||
"prompt": "A futuristic city with flying cars at sunset, digital art",
|
||
"n": 1,
|
||
"size": "1024x1024"
|
||
}' | jq
|
||
|
||
# Image Edit
|
||
curl http://aipi.reeseapps.com/v1/images/edits \
|
||
-H "Authorization: Bearer $TOKEN" \
|
||
-d '{
|
||
"model": "sdd-edit/sd-cpp-local",
|
||
"image": "@path/to/your/image.jpg",
|
||
"prompt": "Add a sunset background",
|
||
"n": 1,
|
||
"size": "1024x1024"
|
||
}'
|
||
|
||
# Embed
|
||
curl \
|
||
"https://aipi.reeseapps.com/v1/embeddings" \
|
||
-H "Authorization: Bearer $TOKEN" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"model": "llama-embed/embed",
|
||
"input":"This is the reason you ended up here:",
|
||
"encoding_format": "float"
|
||
}'
|
||
```
|
||
|
||
## Misc
|
||
|
||
### Qwen3.5 Settings
|
||
|
||
> We recommend using the following set of sampling parameters for generation
|
||
|
||
- Non-thinking mode for text tasks: temperature=1.0, top_p=1.00, top_k=20, min_p=0.0, presence_penalty=2.0, repetition_penalty=1.0
|
||
- Non-thinking mode for VL tasks: temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
|
||
- Thinking mode for text tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
|
||
- Thinking mode for VL or precise coding (e.g. WebDev) tasks : temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
|
||
|
||
> Please note that the support for sampling parameters varies according to inference frameworks.
|