Hoster Setup Guide
Everything you need to get your GPU server running and registered with LLMFinder. Takes about 15 minutes.
Quick Setup (recommended)
The setup wizard handles everything: installs Docker, detects your GPU, downloads a model, writes a docker-compose.yml, starts your server with a Cloudflare tunnel, and registers you โ all in one run.
curl -O https://llmfinder.net/llmfinder-hoster.py && python3 llmfinder-hoster.py
What the wizard does (v4.0):
- Installs Docker if needed
- Detects your GPU (NVIDIA/AMD/Intel/CPU) and picks the right image
- Shows model suggestions with size + download time estimates
- Downloads your chosen model to
~/llmfinder-models/ - Calculates optimal context size from GGUF metadata + available VRAM
- Writes a
docker-compose.ymlwith llama-server + cloudflared tunnel - Checks for port conflicts and stale processes before launching
- Starts everything with
docker compose up -d - Reads tunnel URL from Docker logs, runs health + inference checks
- Registers your server and issues your API key
Manual Setup
If you prefer to configure things yourself, follow these steps.
Install Docker
All LLMFinder server backends run via Docker Compose โ no native installs, no PATH issues, no missing .so files.
# Linux (Ubuntu/Debian)
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker
# Verify
docker --version
docker compose version
For NVIDIA GPU support, also install the container toolkit:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Download a model
Models are stored on the host at ~/llmfinder-models/ and mounted into the container โ no re-download needed when updating the image.
mkdir -p ~/llmfinder-models
# Download a GGUF model (example: Qwen 2.5 7B Q4)
curl -L "https://huggingface.co/bartowski/Qwen_Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen_Qwen2.5-7B-Instruct-Q4_K_M.gguf" \
-o ~/llmfinder-models/Qwen_Qwen2.5-7B-Instruct-Q4_K_M.gguf
.gguf filename.Model size guide:
| VRAM | Recommended model | Context |
|---|---|---|
| 8 GB | 7B Q4_K_M (~4.5GB) | 8kโ16k |
| 12 GB | 9B Q6_K (~7GB) | 16kโ32k |
| 16 GB | 13B Q8 or 9B FP16 | 32k |
| 24 GB | 26B Q4 (~15GB) | 32kโ64k |
| 48 GB+ | 70B Q4 | Full native ctx |
Create docker-compose.yml
Pick the tab for your GPU type. All include a cloudflared tunnel service.
services:
llama-server:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
container_name: llmfinder-node
restart: unless-stopped
ports:
- "8080:8080"
volumes:
- ~/llmfinder-models:/models:ro
environment:
- LLAMA_ARG_MODEL=/models/your-model.gguf
- LLAMA_ARG_CTX_SIZE=32768
- LLAMA_ARG_N_GPU_LAYERS=99
- LLAMA_ARG_PORT=8080
- LLAMA_ARG_HOST=0.0.0.0
- LLAMA_ARG_API_KEY=your-bearer-token
- LLAMA_ARG_CONT_BATCHING=1
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
cloudflared:
image: cloudflare/cloudflared:latest
container_name: llmfinder-tunnel
restart: unless-stopped
depends_on:
- llama-server
command: tunnel --no-autoupdate run --url http://llama-server:8080
services:
llama-server:
image: ghcr.io/ggml-org/llama.cpp:server-rocm
container_name: llmfinder-node
restart: unless-stopped
ports:
- "8080:8080"
volumes:
- ~/llmfinder-models:/models:ro
environment:
- LLAMA_ARG_MODEL=/models/your-model.gguf
- LLAMA_ARG_CTX_SIZE=32768
- LLAMA_ARG_N_GPU_LAYERS=99
- LLAMA_ARG_PORT=8080
- LLAMA_ARG_HOST=0.0.0.0
- LLAMA_ARG_API_KEY=your-bearer-token
- LLAMA_ARG_CONT_BATCHING=1
devices:
- /dev/kfd:/dev/kfd
- /dev/dri:/dev/dri
group_add:
- video
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
cloudflared:
image: cloudflare/cloudflared:latest
container_name: llmfinder-tunnel
restart: unless-stopped
depends_on:
- llama-server
command: tunnel --no-autoupdate run --url http://llama-server:8080
services:
llama-server:
image: ghcr.io/ggml-org/llama.cpp:server
container_name: llmfinder-node
restart: unless-stopped
ports:
- "8080:8080"
volumes:
- ~/llmfinder-models:/models:ro
environment:
- LLAMA_ARG_MODEL=/models/your-model.gguf
- LLAMA_ARG_CTX_SIZE=8192
- LLAMA_ARG_N_GPU_LAYERS=0
- LLAMA_ARG_THREADS=8
- LLAMA_ARG_PORT=8080
- LLAMA_ARG_HOST=0.0.0.0
- LLAMA_ARG_API_KEY=your-bearer-token
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
cloudflared:
image: cloudflare/cloudflared:latest
container_name: llmfinder-tunnel
restart: unless-stopped
depends_on:
- llama-server
command: tunnel --no-autoupdate run --url http://llama-server:8080
services:
ollama:
image: ollama/ollama
container_name: llmfinder-ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0:11434
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
cloudflared:
image: cloudflare/cloudflared:latest
container_name: llmfinder-tunnel
restart: unless-stopped
depends_on:
- ollama
command: tunnel --no-autoupdate run --url http://ollama:11434
volumes:
ollama_data:
docker compose exec ollama ollama pull qwen2.5:7bservices:
vllm:
image: vllm/vllm-openai:latest
container_name: llmfinder-vllm
restart: unless-stopped
ports:
- "8080:8080"
volumes:
- hf_cache:/root/.cache/huggingface
environment:
- HF_TOKEN=${HF_TOKEN:-}
command: >
--model Qwen/Qwen2.5-7B-Instruct
--port 8080
--host 0.0.0.0
--api-key your-bearer-token
--max-model-len 32768
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
cloudflared:
image: cloudflare/cloudflared:latest
container_name: llmfinder-tunnel
restart: unless-stopped
depends_on:
- vllm
command: tunnel --no-autoupdate run --url http://vllm:8080
volumes:
hf_cache:
export HF_TOKEN=hf_xxx && docker compose up -dStart everything
docker compose up -d
Get your tunnel URL (needed for registration):
# Wait ~10 seconds for tunnel to connect, then:
docker logs llmfinder-tunnel 2>&1 | grep trycloudflare
# Take the last URL printed
Verify the server is working:
# Health check
curl https://your-tunnel-url.trycloudflare.com/health
# Model list
curl https://your-tunnel-url.trycloudflare.com/v1/models
# Test inference
curl -X POST https://your-tunnel-url.trycloudflare.com/v1/chat/completions \
-H "Authorization: Bearer your-bearer-token" \
-H "Content-Type: application/json" \
-d '{"model":"your-model.gguf","messages":[{"role":"user","content":"Say OK"}],"max_tokens":5}'
docker compose logs -f ยท docker compose ps ยท docker compose down ยท docker compose pull && docker compose up -d (update)Register on LLMFinder
Once your server is running and the tunnel URL is working, register at:
Register as a Hoster โOr via API:
curl -X POST https://api.llmfinder.net/hosters/register \
-H "Content-Type: application/json" \
-d '{
"name": "Your Name",
"email": "[email protected]",
"endpoint_url": "https://your-tunnel.trycloudflare.com",
"api_key": "your-bearer-token",
"invite_code": "YOUR_INVITE_CODE",
"models": [{
"model_id": "your-model.gguf",
"model_alias": "My Model",
"context_window": 32768
}]
}'
Choosing context size
The wizard calculates this automatically from your GGUF metadata and VRAM. For manual setup, see the full explanation:
Context Window & GPU Settings โQuick rule of thumb: reserve ~20% of VRAM for the KV cache. A 7B model at 32k context needs ~4GB KV cache.
Security
The cloudflared tunnel URL is only known to LLMFinder (since you register it with us) โ customers can't guess or abuse it directly. Combined with a bearer token on the server, this provides solid protection.
Generate a secure bearer token:
python3 -c "import secrets; print(secrets.token_urlsafe(32))"
LLAMA_ARG_API_KEY in your compose file. Never run an open endpoint.Ready to start earning?
Your GPUs are sitting idle. Put them to work.
Register as a Hoster โ Operations Guide โ