All checks were successful
Podman DDNS Image / build-and-push-ddns (push) Successful in 1m3s
224 lines
8.9 KiB
Markdown
224 lines
8.9 KiB
Markdown
# I refuse to pay for LLMs
|
|
|
|
But I want them anyway. And I don't just want LLMs, I want:
|
|
|
|
1. Image Generation
|
|
2. Image Editing
|
|
3. Speech to Text
|
|
4. Text to Speech
|
|
5. Web Searching
|
|
6. RAG Retrieval
|
|
7. Guest accounts with time-based access
|
|
8. Probably other things
|
|
|
|
On rootless podman with snapshots and backups and no compromises.
|
|
|
|
- [I refuse to pay for LLMs](#i-refuse-to-pay-for-llms)
|
|
- [Create your environment](#create-your-environment)
|
|
- [Local LLM First](#local-llm-first)
|
|
- [Ollama](#ollama)
|
|
- [LM Studio](#lm-studio)
|
|
- [llama.cpp](#llamacpp)
|
|
- [Ok, so you have a backend](#ok-so-you-have-a-backend)
|
|
- [What about llama-server?](#what-about-llama-server)
|
|
- [Anything LLM](#anything-llm)
|
|
- [Open Webui](#open-webui)
|
|
- [But we don't have image editing working](#but-we-dont-have-image-editing-working)
|
|
- [Stable Diffusion CPP](#stable-diffusion-cpp)
|
|
- [Making it Run with Quadlets](#making-it-run-with-quadlets)
|
|
|
|
## Create your environment
|
|
|
|
I created a user named `ai` to run all my AI services. Do that now:
|
|
|
|
```bash
|
|
useradd -m ai
|
|
loginctl enable-linger ai
|
|
su -l ai
|
|
mkdir -p /home/ai/.config/containers/systemd/
|
|
mkdir -p /home/ai/.ssh
|
|
```
|
|
|
|
## Local LLM First
|
|
|
|
On the Framework Desktop (or any AMD system) your options are ROCM or Vulkan drivers. Both are fine, with Vulkan pulling slightly ahead as of February 2026. Almost every backend you pick will support both, so pick a backend first.
|
|
|
|
### Ollama
|
|
|
|
is the natural place to start. Their "marketplace" is the best I've found for browsing models. They include short descriptions about what the models are good for and (almost) all of them work out of the box!
|
|
|
|
Bonus points: Ollama's API is well supported by interfaces like Anything LLM, Open Webui, a litany of F-Droid apps, and many other services.
|
|
|
|
Honestly, Ollama is still where I'd recommend anyone start. The installer is easy, performance is decent, the API is great, they (the Ollama team) curate models that work well on their platform, what's not to like?
|
|
|
|
Performance, mostly. llama.cpp just performs 20-30% better in my testing on models like gpt-oss-120b. Your mileage may vary, this is a great project.
|
|
|
|
### LM Studio
|
|
|
|
Everyone says to start with this. Ok, first of all, it's a GUI app. Yeah there's a toggle to run an API server but ain't no way I'm installing wayland on my pure, uncompromising, headless Fedora server.
|
|
|
|
I do have to admit it's the fastest way to get started with LLMs on desktop. But we're not here for desktops, we're here for servers. It runs llama.cpp in the backend anyway so skip past this and go for the good stuff.
|
|
|
|
### llama.cpp
|
|
|
|
We've landed on the best choice. You'll browse Hugging Face for models, be confused, and like it. You'll struggle to read the logs and feel right at home. You'll wonder why there isn't an intuitive CLI like Ollama. And you'll be rewarded with the fastest, most flexible way to run LLMs.
|
|
|
|
You'll need the Hugging Face CLI (`hf`). Install that.
|
|
|
|
First, download qwen3-vl-8b. This is a good jack of all trades model that supports vision, which is nice.
|
|
|
|
```bash
|
|
# Create a directory to hold your text models
|
|
# I put mine at /home/ai/models/text
|
|
mkdir -p /home/ai/models/text/qwen3-vl-8b-instruct
|
|
|
|
# Download the model from hugging face
|
|
hf download --local-dir /home/ai/models/text/qwen3-vl-8b-instruct Qwen/Qwen3-VL-8B-Instruct-GGUF Qwen3VL-8B-Instruct-Q4_K_M.gguf
|
|
# Also download the "mmproj" file for this model
|
|
# "mmproj" files allow a model to see images
|
|
hf download --local-dir /home/ai/models/text/qwen3-vl-8b-instruct Qwen/Qwen3-VL-8B-Instruct-GGUF mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf
|
|
```
|
|
|
|
With our model locked and loaded, we can run the llama.cpp server. We do have to build the llama.cpp server container first though because making this any easier would be a crime.
|
|
|
|
```bash
|
|
# Build the llama.cpp container image
|
|
git clone https://github.com/ggml-org/llama.cpp.git
|
|
cd llama.cpp
|
|
export BUILD_TAG=$(date +"%Y-%m-%d-%H-%M-%S")
|
|
|
|
# Vulkan
|
|
podman build -f .devops/vulkan.Dockerfile -t llama-cpp-vulkan:${BUILD_TAG} -t llama-cpp-vulkan:latest .
|
|
|
|
# Run llama server (Available on port 8000)
|
|
# Add `--n-cpu-moe 32` to gpt-oss-120b to keep minimal number of expert in GPU
|
|
podman run \
|
|
--rm \
|
|
--name llama-server-demo \
|
|
--device=/dev/kfd \
|
|
--device=/dev/dri \
|
|
--pod systemd-ai-internal \
|
|
-v /home/ai/models/text:/models:z \
|
|
localhost/llama-cpp-vulkan:latest \
|
|
--port 8000 \
|
|
-c 16384 \
|
|
--perf \
|
|
--n-gpu-layers all \
|
|
--jinja \
|
|
--models-max 1 \
|
|
--models-dir /models
|
|
```
|
|
|
|
You should be able to access the llama.cpp server at http://{your-ip}:8000. From there you can select the only model you have downloaded (qwen3-vl-8b) and have a conversation.
|
|
|
|
## Ok, so you have a backend
|
|
|
|
Now we need a frontend. In my experience there are only 2 choices, but this is changing extremely fast.
|
|
|
|
### What about llama-server?
|
|
|
|
Good enough for testing. Honestly, if this meets your needs, more power to you.
|
|
|
|
### Anything LLM
|
|
|
|
I started here about a year ago. This is a fantastic frontend with RAG, speech to text, text to speech, web search, RAG, plugins, and decent user management. It supports Ollama, OpenAI, and a bunch of other backends.
|
|
|
|
Unfortunately, as of when I used it, there was no integrated image generation or image editing.
|
|
|
|
### Open Webui
|
|
|
|
This is, in my opinion, the best frontend experience you can get. The killer feature is side-by-side HTML rendering with your LLM response. If your LLM writes HTML/Javascript/CSS, it'll render in real time next to your chat. That's ridiculously cool.
|
|
|
|
It also supports image generation as a tool that your LLM can call. Prompts like "Generate an image of a dragon" will trigger a call to the image generation tool. Generated images show up in the chat and can be edited with another message.
|
|
|
|
```bash
|
|
mkdir /home/ai/.env
|
|
vim /home/ai/.env/open-webui-env
|
|
|
|
# Add this to the file, then save an exit
|
|
WEBUI_SECRET_KEY="some-random-key"
|
|
|
|
# Will be available on port 8080
|
|
podman run \
|
|
-d \
|
|
-p 8080 \
|
|
-v open-webui:/app/backend/data \
|
|
--env-file /home/ai/.env/open-webui-env \
|
|
--name open-webui \
|
|
--restart always \
|
|
ghcr.io/open-webui/open-webui:main
|
|
```
|
|
|
|
Use the following connections when configuring models/image editing:
|
|
|
|
| Service | Endpoint |
|
|
| -------------------- | ----------------------------------------- |
|
|
| llama.cpp | <http://host.containers.internal:8000> |
|
|
| stable-diffusion.cpp | <http://host.containers.internal:1234/v1> |
|
|
|
|
## But we don't have image editing working
|
|
|
|
In the past I used stable-diffusion-webui-forge. This project relied on a very
|
|
specific set of ROCM torch versions installed via pip from the nightly ROCM pip
|
|
repository. I had Stable Diffusion XL and Flux1.dev working on an AMD GPU, but I
|
|
couldn't get this working at all on the Framework Desktop.
|
|
|
|
I found out later this might be due to a ROCM driver bug, but we have bigger and better projects to work with.
|
|
|
|
### Stable Diffusion CPP
|
|
|
|
This project is llama.cpp equivalent for image generation. Open AI compatible API, tons of model support, excellent documentation, it's the best.
|
|
|
|
```bash
|
|
# Clone and build the stable diffusion cpp container
|
|
git clone https://github.com/leejet/stable-diffusion.cpp.git
|
|
cd stable-diffusion.cpp
|
|
git submodule update --init --recursive
|
|
export BUILD_TAG=$(date +"%Y-%m-%d-%H-%M-%S")
|
|
podman build -f Dockerfile.vulkan -t stable-diffusion-cpp:${BUILD_TAG} -t stable-diffusion-cpp:latest .
|
|
```
|
|
|
|
Stable diffusion CPP supports a CLI and a web server. Let's download a model and test out the CLI.
|
|
|
|
```bash
|
|
# z-turbo image model
|
|
# Fastest image generation in 8 steps. Great a text and prompt following.
|
|
# Lacks variety.
|
|
mkdir -p /home/ai/models/image/z-turbo
|
|
hf download --local-dir /home/ai/models/image/z-turbo QuantStack/FLUX.1-Kontext-dev-GGUF flux1-kontext-dev-Q4_K_M.gguf
|
|
hf download --local-dir /home/ai/models/image/z-turbo black-forest-labs/FLUX.1-schnell ae.safetensors
|
|
hf download --local-dir /home/ai/models/image/z-turbo unsloth/Qwen3-4B-Instruct-2507-GGUF Qwen3-4B-Instruct-2507-Q4_K_M.gguf
|
|
|
|
# Create our output directory
|
|
mkdir /home/ai/output
|
|
|
|
# Generate an image of a photorealistic dragon.
|
|
podman run --rm \
|
|
-v /home/ai/models:/models:z \
|
|
-v /home/ai/output:/output:z \
|
|
--device /dev/kfd \
|
|
--device /dev/dri \
|
|
localhost/stable-diffusion-cpp:latest \
|
|
--diffusion-model /models/image/z-turbo/z_image_turbo-Q4_K.gguf \
|
|
--vae /models/image/z-turbo/ae.safetensors \
|
|
--llm /models/image/z-turbo/Qwen3-4B-Instruct-2507-Q4_K_M.gguf \
|
|
--cfg-scale 1.0 \
|
|
-v \
|
|
--seed -1 \
|
|
--steps 8 \
|
|
--vae-conv-direct \
|
|
-H 1024 \
|
|
-W 1024 \
|
|
-o /output/output.png \
|
|
-p "A photorealistic dragon"
|
|
```
|
|
|
|
With any luck you should have a picture of a dragon in your output folder.
|
|
|
|
Since we know it works, we can tie everything together.
|
|
|
|
## Making it Run with Quadlets
|
|
|
|
Now that we have know our setup works we can glue it all together with systemd.
|
|
|
|
Take a look at [the framework desktop docs](https://gitea.reeseapps.com/services/homelab/src/branch/main/active/device_framework_desktop/framework_desktop.md#install-the-whole-thing-with-quadlets-tm) for the relevant commands. |