Hosting

Hoster Setup Guide

Everything you need to get your GPU server running and registered with LLMFinder. Takes about 15 minutes.

๐Ÿšง LLMFinder is currently in beta. Access is invite-only. Request an invite โ†’
โšก

Quick Setup (recommended)

The setup wizard handles everything: installs Docker, detects your GPU, downloads a model, writes a docker-compose.yml, starts your server with a Cloudflare tunnel, and registers you โ€” all in one run.

curl -O https://llmfinder.net/llmfinder-hoster.py && python3 llmfinder-hoster.py
๐Ÿ’ก You'll need an invite code โ€” request one here.

What the wizard does (v4.0):

  • Installs Docker if needed
  • Detects your GPU (NVIDIA/AMD/Intel/CPU) and picks the right image
  • Shows model suggestions with size + download time estimates
  • Downloads your chosen model to ~/llmfinder-models/
  • Calculates optimal context size from GGUF metadata + available VRAM
  • Writes a docker-compose.yml with llama-server + cloudflared tunnel
  • Checks for port conflicts and stale processes before launching
  • Starts everything with docker compose up -d
  • Reads tunnel URL from Docker logs, runs health + inference checks
  • Registers your server and issues your API key
โš ๏ธ Beta access required. LLMFinder is currently invite-only. Request access โ†’.

Manual Setup

If you prefer to configure things yourself, follow these steps.

1

Install Docker

All LLMFinder server backends run via Docker Compose โ€” no native installs, no PATH issues, no missing .so files.

# Linux (Ubuntu/Debian)
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker

# Verify
docker --version
docker compose version
๐Ÿ’ก macOS/Windows: Install Docker Desktop.

For NVIDIA GPU support, also install the container toolkit:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
2

Download a model

Models are stored on the host at ~/llmfinder-models/ and mounted into the container โ€” no re-download needed when updating the image.

mkdir -p ~/llmfinder-models

# Download a GGUF model (example: Qwen 2.5 7B Q4)
curl -L "https://huggingface.co/bartowski/Qwen_Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen_Qwen2.5-7B-Instruct-Q4_K_M.gguf" \
  -o ~/llmfinder-models/Qwen_Qwen2.5-7B-Instruct-Q4_K_M.gguf
๐Ÿ’ก Browse GGUF model IDs: huggingface.co โ†’ Text Generation โ†’ GGUF โ€” the model ID is the .gguf filename.

Model size guide:

VRAMRecommended modelContext
8 GB7B Q4_K_M (~4.5GB)8kโ€“16k
12 GB9B Q6_K (~7GB)16kโ€“32k
16 GB13B Q8 or 9B FP1632k
24 GB26B Q4 (~15GB)32kโ€“64k
48 GB+70B Q4Full native ctx
3

Create docker-compose.yml

Pick the tab for your GPU type. All include a cloudflared tunnel service.

๐ŸŸข NVIDIA
๐Ÿ”ด AMD
๐Ÿ’ป CPU
๐Ÿฆ™ Ollama
โšก vLLM
services:
  llama-server:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    container_name: llmfinder-node
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - ~/llmfinder-models:/models:ro
    environment:
      - LLAMA_ARG_MODEL=/models/your-model.gguf
      - LLAMA_ARG_CTX_SIZE=32768
      - LLAMA_ARG_N_GPU_LAYERS=99
      - LLAMA_ARG_PORT=8080
      - LLAMA_ARG_HOST=0.0.0.0
      - LLAMA_ARG_API_KEY=your-bearer-token
      - LLAMA_ARG_CONT_BATCHING=1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s

  cloudflared:
    image: cloudflare/cloudflared:latest
    container_name: llmfinder-tunnel
    restart: unless-stopped
    depends_on:
      - llama-server
    command: tunnel --no-autoupdate run --url http://llama-server:8080
services:
  llama-server:
    image: ghcr.io/ggml-org/llama.cpp:server-rocm
    container_name: llmfinder-node
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - ~/llmfinder-models:/models:ro
    environment:
      - LLAMA_ARG_MODEL=/models/your-model.gguf
      - LLAMA_ARG_CTX_SIZE=32768
      - LLAMA_ARG_N_GPU_LAYERS=99
      - LLAMA_ARG_PORT=8080
      - LLAMA_ARG_HOST=0.0.0.0
      - LLAMA_ARG_API_KEY=your-bearer-token
      - LLAMA_ARG_CONT_BATCHING=1
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    group_add:
      - video
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s

  cloudflared:
    image: cloudflare/cloudflared:latest
    container_name: llmfinder-tunnel
    restart: unless-stopped
    depends_on:
      - llama-server
    command: tunnel --no-autoupdate run --url http://llama-server:8080
services:
  llama-server:
    image: ghcr.io/ggml-org/llama.cpp:server
    container_name: llmfinder-node
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - ~/llmfinder-models:/models:ro
    environment:
      - LLAMA_ARG_MODEL=/models/your-model.gguf
      - LLAMA_ARG_CTX_SIZE=8192
      - LLAMA_ARG_N_GPU_LAYERS=0
      - LLAMA_ARG_THREADS=8
      - LLAMA_ARG_PORT=8080
      - LLAMA_ARG_HOST=0.0.0.0
      - LLAMA_ARG_API_KEY=your-bearer-token
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s

  cloudflared:
    image: cloudflare/cloudflared:latest
    container_name: llmfinder-tunnel
    restart: unless-stopped
    depends_on:
      - llama-server
    command: tunnel --no-autoupdate run --url http://llama-server:8080
services:
  ollama:
    image: ollama/ollama
    container_name: llmfinder-ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  cloudflared:
    image: cloudflare/cloudflared:latest
    container_name: llmfinder-tunnel
    restart: unless-stopped
    depends_on:
      - ollama
    command: tunnel --no-autoupdate run --url http://ollama:11434

volumes:
  ollama_data:
๐Ÿ’ก After starting, pull your model: docker compose exec ollama ollama pull qwen2.5:7b
services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: llmfinder-vllm
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - hf_cache:/root/.cache/huggingface
    environment:
      - HF_TOKEN=${HF_TOKEN:-}
    command: >
      --model Qwen/Qwen2.5-7B-Instruct
      --port 8080
      --host 0.0.0.0
      --api-key your-bearer-token
      --max-model-len 32768
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  cloudflared:
    image: cloudflare/cloudflared:latest
    container_name: llmfinder-tunnel
    restart: unless-stopped
    depends_on:
      - vllm
    command: tunnel --no-autoupdate run --url http://vllm:8080

volumes:
  hf_cache:
๐Ÿ’ก For gated models: export HF_TOKEN=hf_xxx && docker compose up -d
4

Start everything

docker compose up -d

Get your tunnel URL (needed for registration):

# Wait ~10 seconds for tunnel to connect, then:
docker logs llmfinder-tunnel 2>&1 | grep trycloudflare
# Take the last URL printed

Verify the server is working:

# Health check
curl https://your-tunnel-url.trycloudflare.com/health

# Model list
curl https://your-tunnel-url.trycloudflare.com/v1/models

# Test inference
curl -X POST https://your-tunnel-url.trycloudflare.com/v1/chat/completions \
  -H "Authorization: Bearer your-bearer-token" \
  -H "Content-Type: application/json" \
  -d '{"model":"your-model.gguf","messages":[{"role":"user","content":"Say OK"}],"max_tokens":5}'
๐Ÿ’ก Useful commands: docker compose logs -f ยท docker compose ps ยท docker compose down ยท docker compose pull && docker compose up -d (update)
5

Register on LLMFinder

Once your server is running and the tunnel URL is working, register at:

Register as a Hoster โ†’

Or via API:

curl -X POST https://api.llmfinder.net/hosters/register \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Your Name",
    "email": "[email protected]",
    "endpoint_url": "https://your-tunnel.trycloudflare.com",
    "api_key": "your-bearer-token",
    "invite_code": "YOUR_INVITE_CODE",
    "models": [{
      "model_id": "your-model.gguf",
      "model_alias": "My Model",
      "context_window": 32768
    }]
  }'
๐Ÿ“

Choosing context size

The wizard calculates this automatically from your GGUF metadata and VRAM. For manual setup, see the full explanation:

Context Window & GPU Settings โ†’

Quick rule of thumb: reserve ~20% of VRAM for the KV cache. A 7B model at 32k context needs ~4GB KV cache.

๐Ÿ”’

Security

The cloudflared tunnel URL is only known to LLMFinder (since you register it with us) โ€” customers can't guess or abuse it directly. Combined with a bearer token on the server, this provides solid protection.

Generate a secure bearer token:

python3 -c "import secrets; print(secrets.token_urlsafe(32))"
โš ๏ธ Always set LLAMA_ARG_API_KEY in your compose file. Never run an open endpoint.

Ready to start earning?

Your GPUs are sitting idle. Put them to work.

Register as a Hoster โ†’ Operations Guide โ†’