Hosting

Hoster Setup Guide

Everything you need to get your GPU server running and registered with LLMFinder. Takes about 15 minutes.

🚧 LLMFinder is currently in beta. Access is invite-only. Request an invite →

⚡

Quick Setup (recommended)

The setup wizard handles everything: installs Docker, detects your GPU, downloads a model, writes a docker-compose.yml, starts your server with a Cloudflare tunnel, and registers you — all in one run.

curl -O https://llmfinder.net/llmfinder-hoster.py && python3 llmfinder-hoster.py

💡 You'll need an invite code — request one here.

What the wizard does (v4.0):

Installs Docker if needed
Detects your GPU (NVIDIA/AMD/Intel/CPU) and picks the right image
Shows model suggestions with size + download time estimates
Downloads your chosen model to ~/llmfinder-models/
Calculates optimal context size from GGUF metadata + available VRAM
Writes a docker-compose.yml with llama-server + cloudflared tunnel
Checks for port conflicts and stale processes before launching
Starts everything with docker compose up -d
Reads tunnel URL from Docker logs, runs health + inference checks
Registers your server and issues your API key

⚠️ Beta access required. LLMFinder is currently invite-only. Request access →.

Manual Setup

If you prefer to configure things yourself, follow these steps.

Install Docker

All LLMFinder server backends run via Docker Compose — no native installs, no PATH issues, no missing .so files.

# Linux (Ubuntu/Debian)
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker

# Verify
docker --version
docker compose version

💡 macOS/Windows: Install Docker Desktop.

For NVIDIA GPU support, also install the container toolkit:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Download a model

Models are stored on the host at ~/llmfinder-models/ and mounted into the container — no re-download needed when updating the image.

mkdir -p ~/llmfinder-models

# Download a GGUF model (example: Qwen 2.5 7B Q4)
curl -L "https://huggingface.co/bartowski/Qwen_Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen_Qwen2.5-7B-Instruct-Q4_K_M.gguf" \
  -o ~/llmfinder-models/Qwen_Qwen2.5-7B-Instruct-Q4_K_M.gguf

💡 Browse GGUF model IDs: huggingface.co → Text Generation → GGUF — the model ID is the .gguf filename.

Model size guide:

VRAM	Recommended model	Context
8 GB	7B Q4_K_M (~4.5GB)	8k–16k
12 GB	9B Q6_K (~7GB)	16k–32k
16 GB	13B Q8 or 9B FP16	32k
24 GB	26B Q4 (~15GB)	32k–64k
48 GB+	70B Q4	Full native ctx

Create docker-compose.yml

Pick the tab for your GPU type. All include a cloudflared tunnel service.

🟢 NVIDIA

🔴 AMD

💻 CPU

🦙 Ollama

⚡ vLLM

services:
  llama-server:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    container_name: llmfinder-node
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - ~/llmfinder-models:/models:ro
    environment:
      - LLAMA_ARG_MODEL=/models/your-model.gguf
      - LLAMA_ARG_CTX_SIZE=32768
      - LLAMA_ARG_N_GPU_LAYERS=99
      - LLAMA_ARG_PORT=8080
      - LLAMA_ARG_HOST=0.0.0.0
      - LLAMA_ARG_API_KEY=your-bearer-token
      - LLAMA_ARG_CONT_BATCHING=1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s

  cloudflared:
    image: cloudflare/cloudflared:latest
    container_name: llmfinder-tunnel
    restart: unless-stopped
    depends_on:
      - llama-server
    command: tunnel --no-autoupdate run --url http://llama-server:8080

services:
  llama-server:
    image: ghcr.io/ggml-org/llama.cpp:server-rocm
    container_name: llmfinder-node
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - ~/llmfinder-models:/models:ro
    environment:
      - LLAMA_ARG_MODEL=/models/your-model.gguf
      - LLAMA_ARG_CTX_SIZE=32768
      - LLAMA_ARG_N_GPU_LAYERS=99
      - LLAMA_ARG_PORT=8080
      - LLAMA_ARG_HOST=0.0.0.0
      - LLAMA_ARG_API_KEY=your-bearer-token
      - LLAMA_ARG_CONT_BATCHING=1
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    group_add:
      - video
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s

  cloudflared:
    image: cloudflare/cloudflared:latest
    container_name: llmfinder-tunnel
    restart: unless-stopped
    depends_on:
      - llama-server
    command: tunnel --no-autoupdate run --url http://llama-server:8080

services:
  llama-server:
    image: ghcr.io/ggml-org/llama.cpp:server
    container_name: llmfinder-node
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - ~/llmfinder-models:/models:ro
    environment:
      - LLAMA_ARG_MODEL=/models/your-model.gguf
      - LLAMA_ARG_CTX_SIZE=8192
      - LLAMA_ARG_N_GPU_LAYERS=0
      - LLAMA_ARG_THREADS=8
      - LLAMA_ARG_PORT=8080
      - LLAMA_ARG_HOST=0.0.0.0
      - LLAMA_ARG_API_KEY=your-bearer-token
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s

  cloudflared:
    image: cloudflare/cloudflared:latest
    container_name: llmfinder-tunnel
    restart: unless-stopped
    depends_on:
      - llama-server
    command: tunnel --no-autoupdate run --url http://llama-server:8080

services:
  ollama:
    image: ollama/ollama
    container_name: llmfinder-ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  cloudflared:
    image: cloudflare/cloudflared:latest
    container_name: llmfinder-tunnel
    restart: unless-stopped
    depends_on:
      - ollama
    command: tunnel --no-autoupdate run --url http://ollama:11434

volumes:
  ollama_data:

💡 After starting, pull your model: docker compose exec ollama ollama pull qwen2.5:7b

services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: llmfinder-vllm
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - hf_cache:/root/.cache/huggingface
    environment:
      - HF_TOKEN=${HF_TOKEN:-}
    command: >
      --model Qwen/Qwen2.5-7B-Instruct
      --port 8080
      --host 0.0.0.0
      --api-key your-bearer-token
      --max-model-len 32768
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  cloudflared:
    image: cloudflare/cloudflared:latest
    container_name: llmfinder-tunnel
    restart: unless-stopped
    depends_on:
      - vllm
    command: tunnel --no-autoupdate run --url http://vllm:8080

volumes:
  hf_cache:

💡 For gated models: export HF_TOKEN=hf_xxx && docker compose up -d

Start everything

docker compose up -d

Get your tunnel URL (needed for registration):

# Wait ~10 seconds for tunnel to connect, then:
docker logs llmfinder-tunnel 2>&1 | grep trycloudflare
# Take the last URL printed

Verify the server is working:

# Health check
curl https://your-tunnel-url.trycloudflare.com/health

# Model list
curl https://your-tunnel-url.trycloudflare.com/v1/models

# Test inference
curl -X POST https://your-tunnel-url.trycloudflare.com/v1/chat/completions \
  -H "Authorization: Bearer your-bearer-token" \
  -H "Content-Type: application/json" \
  -d '{"model":"your-model.gguf","messages":[{"role":"user","content":"Say OK"}],"max_tokens":5}'

💡 Useful commands: docker compose logs -f · docker compose ps · docker compose down · docker compose pull && docker compose up -d (update)

Register on LLMFinder

Once your server is running and the tunnel URL is working, register at:

Or via API:

curl -X POST https://api.llmfinder.net/hosters/register \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Your Name",
    "email": "[email protected]",
    "endpoint_url": "https://your-tunnel.trycloudflare.com",
    "api_key": "your-bearer-token",
    "invite_code": "YOUR_INVITE_CODE",
    "models": [{
      "model_id": "your-model.gguf",
      "model_alias": "My Model",
      "context_window": 32768
    }]
  }'

📐

Choosing context size

The wizard calculates this automatically from your GGUF metadata and VRAM. For manual setup, see the full explanation:

Context Window & GPU Settings →

Quick rule of thumb: reserve ~20% of VRAM for the KV cache. A 7B model at 32k context needs ~4GB KV cache.

🔒

Security

The cloudflared tunnel URL is only known to LLMFinder (since you register it with us) — customers can't guess or abuse it directly. Combined with a bearer token on the server, this provides solid protection.

Generate a secure bearer token:

python3 -c "import secrets; print(secrets.token_urlsafe(32))"

⚠️ Always set LLAMA_ARG_API_KEY in your compose file. Never run an open endpoint.

Ready to start earning?

Your GPUs are sitting idle. Put them to work.

← Hosting Overview Operations Guide →