Files
homelab/active/software_ai_stack/ai_stack.md
ducoterra f2015e2c71
All checks were successful
Podman DDNS Image / build-and-push-ddns (push) Successful in 1m3s
checkpoint commit
2026-05-05 06:26:40 -04:00

26 KiB

Self Hosted AI Stack

Notes

# Shortcut for downloading models
hf-download ()
{
    if [ $# -ne 3 ]; then
        echo "ERROR: Expected 3 arguments, but only got $#" 1>&2
        return 1
    fi
    BASE_DIR='/opt/ai/models'
    mkdir -p $BASE_DIR/$1
    pushd $BASE_DIR/$1 2>&1 >/dev/null
    hf download --local-dir . $2 $3
    popd 2>&1 >/dev/null
}

Podman Volume Locations

~/.local/share/containers/storage/volumes/

  • llama-cpp
  • llama-embed
  • llama-instruct
  • image-gen
  • image-edit
  • openwebui

Quick Install

Text Stack

ansible-playbook \
-i ansible/inventory.yaml \
active/software_ai_stack/install_ai_text_stack.yaml

Image Stack

ansible-playbook \
-i ansible/inventory.yaml \
active/software_ai_stack/install_ai_image_stack.yaml

Setup

Create the AI user

# Create your local ai user. This will be the user you launch podman processes from.
useradd -m ai
loginctl enable-linger ai
su -l ai
mkdir -p /home/ai/.config/containers/systemd/
mkdir -p /home/ai/.ssh

Models are big. You'll want some tools to help find large files quickly when space runs out.

Helper aliases

Add these to your .bashrc:

# Calculate all folder sizes in current dir 
alias {dudir,dud}='du -h --max-depth 1 | sort -h'

# Calculate all file sizes in current dir
alias {dufile,duf}='ls -lhSr'

# Restart llama-server / follow logs
alias llama-reload="systemctl --user daemon-reload && systemctl --user restart llama-server.service"
alias llama-logs="journalctl --user -fu llama-server"

# Restart stable diffusion gen and edit server / follow logs
alias sd-gen-reload='systemctl --user daemon-reload && systemctl --user restart stable-diffusion-gen-server'
alias sd-gen-logs='journalctl --user -xeu stable-diffusion-gen-server'
alias sd-edit-reload='systemctl --user daemon-reload && systemctl --user restart stable-diffusion-edit-server'
alias sd-edit-logs='journalctl --user -xeu stable-diffusion-edit-server'

Create the models dir

mkdir -p /home/ai/models/{text,image,video,embedding,tts,stt}

Install the Hugging Face CLI

https://huggingface.co/docs/huggingface_hub/en/guides/cli#getting-started

# Install
curl -LsSf https://hf.co/cli/install.sh | bash

# Login
hf auth login

Samba Model Storage

I recommend adding network storage for keeping models offloaded. This mounts a samba share at /srv/models.

dnf install -y cifs-utils

# Add this to /etc/fstab
//driveripper.reeselink.com/smb_models /srv/models cifs _netdev,nofail,uid=1001,gid=1001,credentials=/etc/samba/credentials 0 0

# Then mount
systemctl daemon-reload
mount -a --mkdir

Here are some sync commands that I use to keep the samba share in sync with the home directory:

# Sync models from home dir to the samba share
rsync -av --progress /home/ai/models/ /srv/models/

Download models

In my completely subjective opinion: 5 bit quant is usually the sweet spot for unsloth models. Q5_K_S is usually just fine.

I usually download the F16 mmproj files. This is also completely subjective. BF16 is fine. F32 is overkill.

Text models

https://huggingface.co/ggml-org/collections

GPT-OSS

https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune#recommended-settings

# gpt-oss-120b
mkdir gpt-oss-120b && cd gpt-oss-120b
hf download --local-dir . ggml-org/gpt-oss-120b-GGUF

# gpt-oss-20b
mkdir gpt-oss-20b && cd gpt-oss-20b
hf download --local-dir . ggml-org/gpt-oss-20b-GGUF
Mistral
# devstral-small-2-24b
mkdir devstral-small-2-24b && cd devstral-small-2-24b
hf download --local-dir . ggml-org/Devstral-Small-2-24B-Instruct-2512-GGUF Devstral-Small-2-24B-Instruct-2512-Q8_0.gguf

# ministral-3-14b
mkdir ministral-3-14b && cd ministral-3-14b
hf download --local-dir . ggml-org/Ministral-3-14B-Reasoning-2512-GGUF

# ministral-3-3b-instruct
mkdir ministral-3-3b-instruct && cd ministral-3-3b-instruct
hf download --local-dir . ggml-org/Ministral-3-3B-Instruct-2512-GGUF
Qwen
# qwen3.6-35b-a3b
mkdir qwen3.6-35b-a3b && cd qwen3.6-35b-a3b
hf download --local-dir . unsloth/Qwen3.6-35B-A3B-GGUF Qwen3.6-35B-A3B-UD-Q5_K_M.gguf
hf download --local-dir . unsloth/Qwen3.6-35B-A3B-GGUF mmproj-F16.gguf

# qwen3.5-27b-opus
mkdir qwen3.5-27b-opus && cd qwen3.5-27b-opus
hf download --local-dir . Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF Qwen3.5-27B.Q4_K_M.gguf
hf download --local-dir . Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF mmproj-BF16.gguf

# qwen3.5-4b
mkdir qwen3.5-4b && cd qwen3.5-4b
hf download --local-dir . unsloth/Qwen3.5-4B-GGUF Qwen3.5-4B-Q8_0.gguf
hf download --local-dir . unsloth/Qwen3.5-4B-GGUF mmproj-F16.gguf

# qwen3.5-35b-a3b
mkdir qwen3.5-35b-a3b && cd qwen3.5-35b-a3b
hf download --local-dir . unsloth/Qwen3.5-35B-A3B-GGUF Qwen3.5-35B-A3B-Q8_0.gguf
hf download --local-dir . unsloth/Qwen3.5-35B-A3B-GGUF mmproj-F16.gguf

# qwen3-30b-a3b-instruct
mkdir qwen3-30b-a3b-instruct && cd qwen3-30b-a3b-instruct
hf download --local-dir . ggml-org/Qwen3-30B-A3B-Instruct-2507-Q8_0-GGUF

# qwen3-vl-30b-a3b-thinking
mkdir qwen3-vl-30b-a3b-thinking && cd qwen3-vl-30b-a3b-thinking
hf download --local-dir . Qwen/Qwen3-VL-30B-A3B-Thinking-GGUF Qwen3VL-30B-A3B-Thinking-Q8_0.gguf
hf download --local-dir . Qwen/Qwen3-VL-30B-A3B-Thinking-GGUF mmproj-Qwen3VL-30B-A3B-Thinking-F16.gguf

# qwen3-vl-30b-a3b-instruct
mkdir qwen3-vl-30b-a3b-instruct && cd qwen3-vl-30b-a3b-instruct
hf download --local-dir . Qwen/Qwen3-VL-30B-A3B-Instruct-GGUF Qwen3VL-30B-A3B-Instruct-Q8_0.gguf
hf download --local-dir . Qwen/Qwen3-VL-30B-A3B-Instruct-GGUF mmproj-Qwen3VL-30B-A3B-Instruct-F16.gguf

# qwen3-coder-30b-a3b-instruct
mkdir qwen3-coder-30b-a3b-instruct && cd qwen3-coder-30b-a3b-instruct
hf download --local-dir . ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF

# qwen3-coder-next
mkdir qwen3-coder-next && cd qwen3-coder-next
hf download --local-dir . unsloth/Qwen3-Coder-Next-GGUF --include "Q8_0/*.gguf"

# qwen3-8b (benchmarks)
mkdir qwen3-8b && cd qwen3-8b
hf download --local-dir . Qwen/Qwen3-8B-GGUF Qwen3-8B-Q8_0.gguf
GLM
# glm-4.7-flash-30b
mkdir glm-4.7-flash-30b && cd glm-4.7-flash-30b
hf download --local-dir . unsloth/GLM-4.7-Flash-GGUF GLM-4.7-Flash-Q8_0.gguf
Gemma
# Note "it" vs "pt" suffixes. "it" is instruction following, "pt" is the base model (not as good for out-of-the-box use)

# gemma-4-26b-a4b
mkdir gemma-4-26b-a4b && cd gemma-4-26b-a4b
hf download --local-dir . ggml-org/gemma-4-26B-A4B-it-GGUF gemma-4-26B-A4B-it-Q8_0.gguf
hf download --local-dir . ggml-org/gemma-4-26B-A4B-it-GGUF mmproj-gemma-4-26B-A4B-it-f16.gguf

# gemma-4-31b
mkdir gemma-4-31b && cd gemma-4-31b
hf download --local-dir . ggml-org/gemma-4-31B-it-GGUF gemma-4-31B-it-Q8_0.gguf
hf download --local-dir . ggml-org/gemma-4-31B-it-GGUF mmproj-gemma-4-31B-it-f16.gguf

# gemma-3-27b-it
mkdir gemma-3-27b-it && cd gemma-3-27b-it
hf download --local-dir . unsloth/gemma-3-27b-it-GGUF gemma-3-27b-it-Q8_0.gguf
hf download --local-dir . unsloth/gemma-3-27b-it-GGUF mmproj-F16.gguf
Dolphin
# dolphin-mistral-24b-venice
mkdir dolphin-mistral-24b-venice && cd dolphin-mistral-24b-venice
hf download --local-dir . bartowski/cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-GGUF cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-Q8_0.gguf
LiquidAI
# lfm2-24b
mkdir lfm2-24b && cd lfm2-24b
hf download --local-dir . LiquidAI/LFM2-24B-A2B-GGUF LFM2-24B-A2B-Q8_0.gguf
Level 1 Techs
# kappa-20b
# https://huggingface.co/eousphoros/kappa-20b-131k-GGUF-Q8_0/tree/main
mkdir kappa-20b && cd kappa-20b
hf download --local-dir . eousphoros/kappa-20b-131k-GGUF-Q8_0

Image models

Z-Image
# z-turbo
# Fastest image generation in 8 steps. Great a text and prompt following.
# Lacks variety.
mkdir /home/ai/models/image/z-turbo && cd /home/ai/models/image/z-turbo
hf download --local-dir . leejet/Z-Image-Turbo-GGUF z_image_turbo-Q8_0.gguf
hf download --local-dir . black-forest-labs/FLUX.1-schnell ae.safetensors
hf download --local-dir . unsloth/Qwen3-4B-Instruct-2507-GGUF Qwen3-4B-Instruct-2507-Q8_0.gguf
Flux
# flux2-klein
# Capable of editing images in 4 steps (though 5 is my recommended steps)
mkdir /home/ai/models/image/flux2-klein && cd /home/ai/models/image/flux2-klein
hf download --local-dir . leejet/FLUX.2-klein-9B-GGUF flux-2-klein-9b-Q8_0.gguf
hf download --local-dir . black-forest-labs/FLUX.2-dev ae.safetensors
hf download --local-dir . unsloth/Qwen3-8B-GGUF Qwen3-8B-Q8_0.gguf

Embedding Models

Qwen Embedding
mkdir qwen3-embed-4b && cd qwen3-embed-4b
hf download --local-dir . Qwen/Qwen3-Embedding-4B-GGUF Qwen3-Embedding-4B-Q8_0.gguf
Nomic Embedding
# nomic-embed-text-v2
mkdir /home/ai/models/embedding/nomic-embed-text-v2
hf download --local-dir /home/ai/models/embedding/nomic-embed-text-v2 ggml-org/Nomic-Embed-Text-V2-GGUF

llama.cpp

https://github.com/ggml-org/llama.cpp/tree/master/tools/server

# Build the llama.cpp container image
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
export BUILD_TAG=$(date +"%Y-%m-%d-%H-%M-%S")

# Vulkan (better performance as of Feb 2026)
podman build -f .devops/vulkan.Dockerfile -t llama-cpp-vulkan:${BUILD_TAG} -t llama-cpp-vulkan:latest .

# ROCM
podman build -f .devops/rocm.Dockerfile -t llama-cpp-rocm:${BUILD_TAG} -t llama-cpp-rocm:latest .

# Run llama demo server (Available on port 8010)
podman run \
--rm \
--name llama-server-demo \
--device=/dev/kfd \
--device=/dev/dri \
-v /home/ai/models/text:/models:z \
-p 8010:8000 \
--ipc host \
localhost/llama-cpp-vulkan:latest \
--host 0.0.0.0 \
--port 8000 \
-c 128000 \
--perf \
--n-gpu-layers all \
--jinja \
--models-max 1 \
--models-dir /models \
--chat-template-kwargs '{"enable_thinking": false}' \
-m /models/qwen3.5-35b-a3b

Embedding models

podman run \
--rm \
--name llama-server-demo \
--device=/dev/kfd \
--device=/dev/dri \
-v /home/ai/models/text:/models:z \
-p 8000:8000 \
localhost/llama-cpp-vulkan:latest \
--host 0.0.0.0 \
--port 8001 \
-c 512 \
--perf \
--n-gpu-layers all \
--models-max 1 \
--models-dir /models \
--embedding
# Test with curl
curl -X POST "https://llama-embed.reeselink.com/embedding" --data '{"model": "qwen3-embed-4b", "content":"Star Wars is better than Star Trek"}'

stable-diffusion.cpp

Server: https://github.com/leejet/stable-diffusion.cpp/tree/master/examples/server

CLI: https://github.com/leejet/stable-diffusion.cpp/tree/master/examples/cli

git clone https://github.com/leejet/stable-diffusion.cpp.git
cd stable-diffusion.cpp
git submodule update --init --recursive
export BUILD_TAG=$(date +"%Y-%m-%d-%H-%M-%S")

# Vulkan
podman build -f Dockerfile.vulkan -t stable-diffusion-cpp:${BUILD_TAG} -t stable-diffusion-cpp:latest .
# Generate an image with z-turbo
podman run --rm \
-v /home/ai/models:/models:z \
-v /home/ai/output:/output:z \
--device /dev/kfd \
--device /dev/dri \
localhost/stable-diffusion-cpp:latest \
--diffusion-model /models/image/z-turbo/z_image_turbo-Q8_0.gguf \
--vae /models/image/z-turbo/ae.safetensors  \
--llm /models/image/z-turbo/Qwen3-4B-Instruct-2507-Q8_0.gguf \
-v \
--cfg-scale 1.0 \
--vae-conv-direct \
--diffusion-conv-direct \
--fa \
--mmap \
--seed -1 \
--steps 8 \
-H 1024 \
-W 1024 \
-o /output/output.png \
-p "A photorealistic dragon"

# Edit the generated image with flux2-klein
podman run --rm \
-v /home/ai/models:/models:z \
-v /home/ai/output:/output:z \
--device /dev/kfd \
--device /dev/dri \
localhost/stable-diffusion-cpp:latest \
--diffusion-model  /models/image/flux2-klein/flux-2-klein-9b-Q8_0.gguf \
--vae /models/image/flux2-klein/ae.safetensors \
--llm /models/image/flux2-klein/Qwen3-8B-Q8_0.gguf \
-v \
--cfg-scale 1.0 \
--sampling-method euler \
--vae-conv-direct \
--diffusion-conv-direct \
--fa \
--mmap \
--steps 5 \
-H 1024 \
-W 1024 \
-r /output/output.png \
-o /output/edit.png \
-p "Replace the dragon with an old car"

# Video generation with wan2.2
podman run --rm \
-v /home/ai/models:/models:z \
-v /home/ai/output:/output:z \
--device /dev/kfd \
--device /dev/dri \
localhost/stable-diffusion-cpp:latest \
-M vid_gen \
--diffusion-model /models/video/wan2.2/Wan2.2-T2V-A14B-LowNoise-Q5_K_M.gguf \
--high-noise-diffusion-model /models/video/wan2.2/Wan2.2-T2V-A14B-HighNoise-Q5_K_M.gguf \
--vae /models/video/wan2.2/wan_2.1_vae.safetensors \
--t5xxl /models/video/wan2.2/umt5-xxl-encoder-Q5_K_M.gguf \
--cfg-scale 3.5 \
--sampling-method euler \
--steps 10 \
--high-noise-cfg-scale 3.5 \
--high-noise-sampling-method euler \
--high-noise-steps 8 \
--vae-conv-direct \
--diffusion-conv-direct \
--vae-tiling \
-v \
-n "Colorful tones, overexposed, static, blurred details, subtitles, style, artwork, painting, picture, still, overall graying, worst quality, low quality, JPEG compression residue, ugly, mutilated, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, deformed limbs, finger fusion, still pictures, messy backgrounds, three legs, many people in the background, walking backwards" \
-W 512 \
-H 512 \
--diffusion-fa \
--video-frames 24 \
--flow-shift 3.0 \
-o /output/video_output \
-p "A normal business meeting. People discuss business for 2 seconds. Suddenly, a horde of furries carrying assault rifles bursts into the room and causes a panic. Hatsune Miku leads the charge screaming in rage."

open-webui

mkdir /home/ai/.env
# Create a file called open-webui-env with `WEBUI_SECRET_KEY="some-random-key"
scp active/software_ai_stack/secrets/open-webui-env deskwork-ai:.env/

# Will be available on port 8080
podman run \
-d \
-p 8080:8080 \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main

Use the following connections:

Service Endpoint
llama.cpp server http://host.containers.internal:8000
llama.cpp embed http://host.containers.internal:8001
stable-diffusion.cpp http://host.containers.internal:1234/v1
stable-diffusion.cpp edit http://host.containers.internal:1235/v1

lite-llm

https://docs.litellm.ai/docs/proxy/configs

podman run \
--rm \
--name litellm \
-p 4000:4000

Install Services with Quadlets

API Keys

mkdir -p /home/ai/.llama-api
touch /home/ai/.llama-api/keys.env
chmod 600 /home/ai/.llama-api/keys.env
vim /home/ai/.llama-api/keys.env

LLAMA_API_KEY=

# Generate keys and append to file, then comma separate the keys
openssl rand -base64 48 >> keys.env
openssl rand -base64 48 >> keys.env
openssl rand -base64 48 >> keys.env

Internal and External Pods

These will be used to restrict internet access to our llama.cpp and stable-diffusion.cpp services while allowing the frontend services to communicate with those containers.

scp -r active/software_ai_stack/ai-internal.* deskwork-ai:.config/containers/systemd/
ssh deskwork-ai
systemctl --user daemon-reload
systemctl --user start ai-internal-pod.service

Llama CPP Server (Port 8000)

Installs the llama.cpp server to run our text models.

scp -r active/software_ai_stack/llama-think.container deskwork-ai:.config/containers/systemd/
ssh deskwork-ai
systemctl --user daemon-reload
systemctl --user restart ai-internal-pod.service

Llama CPP Embedding Server (Port 8001)

Installs the llama.cpp server to run our embedding models

scp -r active/software_ai_stack/llama-embed.container deskwork-ai:.config/containers/systemd/
ssh deskwork-ai
systemctl --user daemon-reload
systemctl --user restart ai-internal-pod.service

Llama CPP Instruct Server (Port 8002)

Installs the llama.cpp server to run a constant instruct (no thinking) model for quick replies

scp -r active/software_ai_stack/llama-instruct.container deskwork-ai:.config/containers/systemd/
ssh deskwork-ai
systemctl --user daemon-reload
systemctl --user restart ai-internal-pod.service

Stable Diffusion CPP (Port 1234 and 1235)

Installs the stable-diffusion.cpp server to run our image models.

scp -r active/software_ai_stack/quadlets_stable_diffusion/* deskwork-ai:.config/containers/systemd/
ssh deskwork-ai
systemctl --user daemon-reload
systemctl --user restart ai-internal-pod.service

Open Webui (Port 8080)

Installs the open webui frontend.

scp -r active/software_ai_stack/quadlets_openwebui/* deskwork-ai:.config/containers/systemd/
ssh deskwork-ai
systemctl --user daemon-reload
systemctl --user restart ai-external-pod.service

Note, all services will be available at host.containers.internal. So llama.cpp will be up at http://host.containers.internal:8000.

Install the update script

# 1. Builds the latest llama.cpp and stable-diffusion.cpp
# 2. Pulls the latest open-webui
# 3. Restarts all services
scp active/software_ai_stack/update-script.sh deskwork-ai:
ssh deskwork-ai
chmod +x update-script.sh
./update-script.sh

Install Guest Open Webui with Start/Stop Services

Optionally install a guest openwebui service.

scp -r active/software_ai_stack/systemd/. deskwork-ai:.config/systemd/user/
ssh deskwork-ai
systemctl --user daemon-reload
systemctl --user enable open-webui-guest-start.timer
systemctl --user enable open-webui-guest-stop.timer

Benchmark Results

Benchmarks are run with unsloth gpt-oss-20b Q8_0

# Run the llama.cpp pod (AMD)
podman run -it --rm \
--device=/dev/kfd \
--device=/dev/dri \
-v /home/ai/models/text:/models:z \
--entrypoint /bin/bash \
ghcr.io/ggml-org/llama.cpp:full-vulkan

# Benchmark command
./llama-bench -m /models/gpt-oss-20b/gpt-oss-20b-Q8_0.gguf -p 4096 -n 1024

Framework Desktop

model size params backend ngl test t/s
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 pp4096 992.74 ± 6.07
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 tg1024 75.82 ± 0.07

AMD R9700

model size params backend ngl test t/s
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 pp4096 3190.85 ± 8.24
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 tg1024 168.73 ± 0.15

NVIDIA GeForce RTX 4080 SUPER

model size params backend ngl test t/s
gpt-oss 20B Q8_0 11.27 GiB 20.91 B CUDA 99 tg128 193.28 ± 1.03
gpt-oss 20B Q8_0 11.27 GiB 20.91 B CUDA 99 tg256 193.55 ± 0.34
gpt-oss 20B Q8_0 11.27 GiB 20.91 B CUDA 99 tg512 187.39 ± 0.10

NVIDIA GeForce RTX 3090

model size params backend ngl test t/s
gpt-oss 20B Q8_0 11.27 GiB 20.91 B CUDA,Vulkan 99 pp4096 3034.03 ± 80.36
gpt-oss 20B Q8_0 11.27 GiB 20.91 B CUDA,Vulkan 99 tg1024 181.05 ± 9.01

Apple M4 max

model test t/s
unsloth/gpt-oss-20b-Q8_0-GGUF pp2048 1579.12 ± 7.12
unsloth/gpt-oss-20b-Q8_0-GGUF tg32 113.00 ± 2.81

Testing with Curl

OpenAI API

export TOKEN=$(cat active/software_ai_stack/secrets/aipi-token)

# List Models
curl https://llama-instruct.reeseapps.com/v1/models \
-H "Authorization: Bearer $TOKEN" | jq '.data'

# Text
curl https://llama-instruct.reeseapps.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
  "model": "llama-instruct/instruct",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello, how are you?"}
  ],
  "max_tokens": 500
}' | jq

# Completion
curl https://llama-instruct.reeseapps.com/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
  "model": "llama-instruct/instruct",
  "prompt": "Write a short poem about the ocean.",
  "max_tokens": 500
}' | jq

# Image Gen
curl https://image-gen.reeselink.com/v1/images/generations \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
  "model": "sdd-gen/sd-cpp-local",
  "prompt": "A futuristic city with flying cars at sunset, digital art",
  "n": 1,
  "size": "1024x1024"
}' | jq

# Image Edit
curl http://aipi.reeseapps.com/v1/images/edits \
-H "Authorization: Bearer $TOKEN" \
-d '{
  "model": "sdd-edit/sd-cpp-local",
  "image": "@path/to/your/image.jpg",
  "prompt": "Add a sunset background",
  "n": 1,
  "size": "1024x1024"
}'

# Embed
curl \
"https://llama-embed.reeseapps.com/v1/embeddings" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
  "model": "deskwork-embed/embed",
  "input":"This is the reason you ended up here:",
  "encoding_format": "float"
}'

VLLM

Run VLLM with Podman

# 'latest' and 'nightly' are both viable tags
podman run --rm \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface:z \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8010:8000 \
--ipc=host \
-e ROCBLAS_USE_HIPBLASLT=1 \
-e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
-e VLLM_TARGET_DEVICE=rocm \
-e HIP_FORCE_DEV_KERNARG=1 \
-e RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1 \
docker.io/vllm/vllm-openai-rocm:nightly \
--enable-offline-docs \

# Pick your model
Qwen/Qwen3.5-35B-A3B-FP8 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder
Qwen/Qwen3.5-9B --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder
Qwen/Qwen3.5-35B-A3B-FP8
google/gemma-4-26B-A4B-it
openai/gpt-oss-120b

Misc

Quantizing your own Models

# Create a scratch dir for downloading models
mkdir scratch && cd scratch

# qwen 3.5 35b
mkdir qwen3.5-35b-a3b && cd qwen3.5-35b-a3b
hf download --local-dir . Qwen/Qwen3.5-35B-A3B

# nemotron cascade
mkdir nemotron-cascade-2-30b-a3b && cd nemotron-cascade-2-30b-a3b
hf download --local-dir . nvidia/Nemotron-Cascade-2-30B-A3B

# Run the full 
podman run -it --rm \
--device=/dev/kfd \
--device=/dev/dri \
-v $(pwd):/models:z \
--entrypoint /bin/bash \
ghcr.io/ggml-org/llama.cpp:full-vulkan

# Run ./llama-quantize to see available quants
# 7 = q_8
# 18 = q_6_k
# 17 = q_5_k
# 15 = q_4_k
./llama-quantize /models/$MODEL_NAME.gguf /models/$MODEL_NAME-Q6_K.gguf 18
./llama-quantize /models/$MODEL_NAME.gguf /models/$MODEL_NAME-Q8_0.gguf 7

Qwen3.5 Settings

We recommend using the following set of sampling parameters for generation

  • Non-thinking mode for text tasks: temperature=1.0, top_p=1.00, top_k=20, min_p=0.0, presence_penalty=2.0, repetition_penalty=1.0
  • Non-thinking mode for VL tasks: temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
  • Thinking mode for text tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
  • Thinking mode for VL or precise coding (e.g. WebDev) tasks : temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

Please note that the support for sampling parameters varies according to inference frameworks.