Files
homelab/stories/docs/20-local_llms.md
ducoterra f2015e2c71
All checks were successful
Podman DDNS Image / build-and-push-ddns (push) Successful in 1m3s
checkpoint commit
2026-05-05 06:26:40 -04:00

224 lines
8.9 KiB
Markdown

# I refuse to pay for LLMs
But I want them anyway. And I don't just want LLMs, I want:
1. Image Generation
2. Image Editing
3. Speech to Text
4. Text to Speech
5. Web Searching
6. RAG Retrieval
7. Guest accounts with time-based access
8. Probably other things
On rootless podman with snapshots and backups and no compromises.
- [I refuse to pay for LLMs](#i-refuse-to-pay-for-llms)
- [Create your environment](#create-your-environment)
- [Local LLM First](#local-llm-first)
- [Ollama](#ollama)
- [LM Studio](#lm-studio)
- [llama.cpp](#llamacpp)
- [Ok, so you have a backend](#ok-so-you-have-a-backend)
- [What about llama-server?](#what-about-llama-server)
- [Anything LLM](#anything-llm)
- [Open Webui](#open-webui)
- [But we don't have image editing working](#but-we-dont-have-image-editing-working)
- [Stable Diffusion CPP](#stable-diffusion-cpp)
- [Making it Run with Quadlets](#making-it-run-with-quadlets)
## Create your environment
I created a user named `ai` to run all my AI services. Do that now:
```bash
useradd -m ai
loginctl enable-linger ai
su -l ai
mkdir -p /home/ai/.config/containers/systemd/
mkdir -p /home/ai/.ssh
```
## Local LLM First
On the Framework Desktop (or any AMD system) your options are ROCM or Vulkan drivers. Both are fine, with Vulkan pulling slightly ahead as of February 2026. Almost every backend you pick will support both, so pick a backend first.
### Ollama
is the natural place to start. Their "marketplace" is the best I've found for browsing models. They include short descriptions about what the models are good for and (almost) all of them work out of the box!
Bonus points: Ollama's API is well supported by interfaces like Anything LLM, Open Webui, a litany of F-Droid apps, and many other services.
Honestly, Ollama is still where I'd recommend anyone start. The installer is easy, performance is decent, the API is great, they (the Ollama team) curate models that work well on their platform, what's not to like?
Performance, mostly. llama.cpp just performs 20-30% better in my testing on models like gpt-oss-120b. Your mileage may vary, this is a great project.
### LM Studio
Everyone says to start with this. Ok, first of all, it's a GUI app. Yeah there's a toggle to run an API server but ain't no way I'm installing wayland on my pure, uncompromising, headless Fedora server.
I do have to admit it's the fastest way to get started with LLMs on desktop. But we're not here for desktops, we're here for servers. It runs llama.cpp in the backend anyway so skip past this and go for the good stuff.
### llama.cpp
We've landed on the best choice. You'll browse Hugging Face for models, be confused, and like it. You'll struggle to read the logs and feel right at home. You'll wonder why there isn't an intuitive CLI like Ollama. And you'll be rewarded with the fastest, most flexible way to run LLMs.
You'll need the Hugging Face CLI (`hf`). Install that.
First, download qwen3-vl-8b. This is a good jack of all trades model that supports vision, which is nice.
```bash
# Create a directory to hold your text models
# I put mine at /home/ai/models/text
mkdir -p /home/ai/models/text/qwen3-vl-8b-instruct
# Download the model from hugging face
hf download --local-dir /home/ai/models/text/qwen3-vl-8b-instruct Qwen/Qwen3-VL-8B-Instruct-GGUF Qwen3VL-8B-Instruct-Q4_K_M.gguf
# Also download the "mmproj" file for this model
# "mmproj" files allow a model to see images
hf download --local-dir /home/ai/models/text/qwen3-vl-8b-instruct Qwen/Qwen3-VL-8B-Instruct-GGUF mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf
```
With our model locked and loaded, we can run the llama.cpp server. We do have to build the llama.cpp server container first though because making this any easier would be a crime.
```bash
# Build the llama.cpp container image
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
export BUILD_TAG=$(date +"%Y-%m-%d-%H-%M-%S")
# Vulkan
podman build -f .devops/vulkan.Dockerfile -t llama-cpp-vulkan:${BUILD_TAG} -t llama-cpp-vulkan:latest .
# Run llama server (Available on port 8000)
# Add `--n-cpu-moe 32` to gpt-oss-120b to keep minimal number of expert in GPU
podman run \
--rm \
--name llama-server-demo \
--device=/dev/kfd \
--device=/dev/dri \
--pod systemd-ai-internal \
-v /home/ai/models/text:/models:z \
localhost/llama-cpp-vulkan:latest \
--port 8000 \
-c 16384 \
--perf \
--n-gpu-layers all \
--jinja \
--models-max 1 \
--models-dir /models
```
You should be able to access the llama.cpp server at http://{your-ip}:8000. From there you can select the only model you have downloaded (qwen3-vl-8b) and have a conversation.
## Ok, so you have a backend
Now we need a frontend. In my experience there are only 2 choices, but this is changing extremely fast.
### What about llama-server?
Good enough for testing. Honestly, if this meets your needs, more power to you.
### Anything LLM
I started here about a year ago. This is a fantastic frontend with RAG, speech to text, text to speech, web search, RAG, plugins, and decent user management. It supports Ollama, OpenAI, and a bunch of other backends.
Unfortunately, as of when I used it, there was no integrated image generation or image editing.
### Open Webui
This is, in my opinion, the best frontend experience you can get. The killer feature is side-by-side HTML rendering with your LLM response. If your LLM writes HTML/Javascript/CSS, it'll render in real time next to your chat. That's ridiculously cool.
It also supports image generation as a tool that your LLM can call. Prompts like "Generate an image of a dragon" will trigger a call to the image generation tool. Generated images show up in the chat and can be edited with another message.
```bash
mkdir /home/ai/.env
vim /home/ai/.env/open-webui-env
# Add this to the file, then save an exit
WEBUI_SECRET_KEY="some-random-key"
# Will be available on port 8080
podman run \
-d \
-p 8080 \
-v open-webui:/app/backend/data \
--env-file /home/ai/.env/open-webui-env \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
```
Use the following connections when configuring models/image editing:
| Service | Endpoint |
| -------------------- | ----------------------------------------- |
| llama.cpp | <http://host.containers.internal:8000> |
| stable-diffusion.cpp | <http://host.containers.internal:1234/v1> |
## But we don't have image editing working
In the past I used stable-diffusion-webui-forge. This project relied on a very
specific set of ROCM torch versions installed via pip from the nightly ROCM pip
repository. I had Stable Diffusion XL and Flux1.dev working on an AMD GPU, but I
couldn't get this working at all on the Framework Desktop.
I found out later this might be due to a ROCM driver bug, but we have bigger and better projects to work with.
### Stable Diffusion CPP
This project is llama.cpp equivalent for image generation. Open AI compatible API, tons of model support, excellent documentation, it's the best.
```bash
# Clone and build the stable diffusion cpp container
git clone https://github.com/leejet/stable-diffusion.cpp.git
cd stable-diffusion.cpp
git submodule update --init --recursive
export BUILD_TAG=$(date +"%Y-%m-%d-%H-%M-%S")
podman build -f Dockerfile.vulkan -t stable-diffusion-cpp:${BUILD_TAG} -t stable-diffusion-cpp:latest .
```
Stable diffusion CPP supports a CLI and a web server. Let's download a model and test out the CLI.
```bash
# z-turbo image model
# Fastest image generation in 8 steps. Great a text and prompt following.
# Lacks variety.
mkdir -p /home/ai/models/image/z-turbo
hf download --local-dir /home/ai/models/image/z-turbo QuantStack/FLUX.1-Kontext-dev-GGUF flux1-kontext-dev-Q4_K_M.gguf
hf download --local-dir /home/ai/models/image/z-turbo black-forest-labs/FLUX.1-schnell ae.safetensors
hf download --local-dir /home/ai/models/image/z-turbo unsloth/Qwen3-4B-Instruct-2507-GGUF Qwen3-4B-Instruct-2507-Q4_K_M.gguf
# Create our output directory
mkdir /home/ai/output
# Generate an image of a photorealistic dragon.
podman run --rm \
-v /home/ai/models:/models:z \
-v /home/ai/output:/output:z \
--device /dev/kfd \
--device /dev/dri \
localhost/stable-diffusion-cpp:latest \
--diffusion-model /models/image/z-turbo/z_image_turbo-Q4_K.gguf \
--vae /models/image/z-turbo/ae.safetensors \
--llm /models/image/z-turbo/Qwen3-4B-Instruct-2507-Q4_K_M.gguf \
--cfg-scale 1.0 \
-v \
--seed -1 \
--steps 8 \
--vae-conv-direct \
-H 1024 \
-W 1024 \
-o /output/output.png \
-p "A photorealistic dragon"
```
With any luck you should have a picture of a dragon in your output folder.
Since we know it works, we can tie everything together.
## Making it Run with Quadlets
Now that we have know our setup works we can glue it all together with systemd.
Take a look at [the framework desktop docs](https://gitea.reeseapps.com/services/homelab/src/branch/main/active/device_framework_desktop/framework_desktop.md#install-the-whole-thing-with-quadlets-tm) for the relevant commands.