checkpoint commit
All checks were successful
Podman DDNS Image / build-and-push-ddns (push) Successful in 1m3s

This commit is contained in:
2026-05-05 06:26:40 -04:00
parent e43c534ceb
commit f2015e2c71
76 changed files with 4265 additions and 235 deletions

View File

@@ -34,6 +34,7 @@
- [open-webui](#open-webui)
- [lite-llm](#lite-llm)
- [Install Services with Quadlets](#install-services-with-quadlets)
- [API Keys](#api-keys)
- [Internal and External Pods](#internal-and-external-pods)
- [Llama CPP Server (Port 8000)](#llama-cpp-server-port-8000)
- [Llama CPP Embedding Server (Port 8001)](#llama-cpp-embedding-server-port-8001)
@@ -179,7 +180,11 @@ rsync -av --progress /home/ai/models/ /srv/models/
### Download models
In general I try to run 8 bit quantized minimum.
In my completely subjective opinion: 5 bit quant is usually the sweet spot for
unsloth models. Q5_K_S is usually just fine.
I usually download the F16 mmproj files. This is also completely subjective.
BF16 is fine. F32 is overkill.
#### Text models
@@ -218,8 +223,13 @@ hf download --local-dir . ggml-org/Ministral-3-3B-Instruct-2512-GGUF
##### Qwen
```bash
# qwen3.6-35b-a3b
mkdir qwen3.6-35b-a3b && cd qwen3.6-35b-a3b
hf download --local-dir . unsloth/Qwen3.6-35B-A3B-GGUF Qwen3.6-35B-A3B-UD-Q5_K_M.gguf
hf download --local-dir . unsloth/Qwen3.6-35B-A3B-GGUF mmproj-F16.gguf
# qwen3.5-27b-opus
mkdir qwen3.5-27b-opus && qwen3.5-27b-opus
mkdir qwen3.5-27b-opus && cd qwen3.5-27b-opus
hf download --local-dir . Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF Qwen3.5-27B.Q4_K_M.gguf
hf download --local-dir . Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF mmproj-BF16.gguf
@@ -555,6 +565,22 @@ podman run \
## Install Services with Quadlets
### API Keys
```bash
mkdir -p /home/ai/.llama-api
touch /home/ai/.llama-api/keys.env
chmod 600 /home/ai/.llama-api/keys.env
vim /home/ai/.llama-api/keys.env
LLAMA_API_KEY=
# Generate keys and append to file, then comma separate the keys
openssl rand -base64 48 >> keys.env
openssl rand -base64 48 >> keys.env
openssl rand -base64 48 >> keys.env
```
### Internal and External Pods
These will be used to restrict internet access to our llama.cpp and
@@ -562,10 +588,10 @@ stable-diffusion.cpp services while allowing the frontend services to
communicate with those containers.
```bash
scp -r active/software_ai_stack/quadlets_pods/* deskwork-ai:.config/containers/systemd/
scp -r active/software_ai_stack/ai-internal.* deskwork-ai:.config/containers/systemd/
ssh deskwork-ai
systemctl --user daemon-reload
systemctl --user start ai-internal-pod.service ai-external-pod.service
systemctl --user start ai-internal-pod.service
```
### Llama CPP Server (Port 8000)
@@ -573,7 +599,7 @@ systemctl --user start ai-internal-pod.service ai-external-pod.service
Installs the llama.cpp server to run our text models.
```bash
scp -r active/software_ai_stack/quadlets_llama_think/* deskwork-ai:.config/containers/systemd/
scp -r active/software_ai_stack/llama-think.container deskwork-ai:.config/containers/systemd/
ssh deskwork-ai
systemctl --user daemon-reload
systemctl --user restart ai-internal-pod.service
@@ -584,7 +610,7 @@ systemctl --user restart ai-internal-pod.service
Installs the llama.cpp server to run our embedding models
```bash
scp -r active/software_ai_stack/quadlets_llama_embed/* deskwork-ai:.config/containers/systemd/
scp -r active/software_ai_stack/llama-embed.container deskwork-ai:.config/containers/systemd/
ssh deskwork-ai
systemctl --user daemon-reload
systemctl --user restart ai-internal-pod.service
@@ -595,7 +621,7 @@ systemctl --user restart ai-internal-pod.service
Installs the llama.cpp server to run a constant instruct (no thinking) model for quick replies
```bash
scp -r active/software_ai_stack/quadlets_llama_instruct/* deskwork-ai:.config/containers/systemd/
scp -r active/software_ai_stack/llama-instruct.container deskwork-ai:.config/containers/systemd/
ssh deskwork-ai
systemctl --user daemon-reload
systemctl --user restart ai-internal-pod.service
@@ -711,11 +737,11 @@ Apple M4 max
export TOKEN=$(cat active/software_ai_stack/secrets/aipi-token)
# List Models
curl https://aipi.reeseapps.com/v1/models \
-H "Authorization: Bearer $TOKEN" | jq
curl https://llama-instruct.reeseapps.com/v1/models \
-H "Authorization: Bearer $TOKEN" | jq '.data'
# Text
curl https://aipi.reeseapps.com/v1/chat/completions \
curl https://llama-instruct.reeseapps.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
@@ -724,26 +750,21 @@ curl https://aipi.reeseapps.com/v1/chat/completions \
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, how are you?"}
],
"temperature": 0.7,
"max_tokens": 500
}' | jq
# Completion
curl https://aipi.reeseapps.com/v1/completions \
curl https://llama-instruct.reeseapps.com/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
"model": "llama-instruct/instruct",
"prompt": "Write a short poem about the ocean.",
"temperature": 0.7,
"max_tokens": 500,
"top_p": 1,
"frequency_penalty": 0,
"presence_penalty": 0
"max_tokens": 500
}' | jq
# Image Gen
curl https://aipi.reeseapps.com/v1/images/generations \
curl https://image-gen.reeselink.com/v1/images/generations \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
@@ -766,11 +787,11 @@ curl http://aipi.reeseapps.com/v1/images/edits \
# Embed
curl \
"https://aipi.reeseapps.com/v1/embeddings" \
"https://llama-embed.reeseapps.com/v1/embeddings" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-embed/embed",
"model": "deskwork-embed/embed",
"input":"This is the reason you ended up here:",
"encoding_format": "float"
}'
@@ -789,16 +810,20 @@ podman run --rm \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8010:8000 \
--ipc=host \
-e ROCBLAS_USE_HIPBLASLT=1 \
-e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
-e VLLM_TARGET_DEVICE=rocm \
-e HIP_FORCE_DEV_KERNARG=1 \
-e RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1 \
docker.io/vllm/vllm-openai-rocm:nightly \
--enable-offline-docs \
# Pick your model
Qwen/Qwen3.5-35B-A3B --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder
Qwen/Qwen3.5-35B-A3B-FP8 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder
Qwen/Qwen3.5-9B --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder
Qwen/Qwen3.5-35B-A3B-FP8
google/gemma-4-26B-A4B-it
openai/gpt-oss-120b
```
## Misc