Gaming PC Hermes Agent Setup — CUDA-Accelerated AI Workstation¶

Q: Can my gaming PC run Hermes Agent while I'm gaming?

Yes. Set OLLAMA_KEEP_ALIVE=10m so models unload from VRAM when idle. During gaming, Hermes Agent can use cloud fallback models via OpenRouter or wait until you're done.

Q: What VRAM do I need for local LLMs with Hermes Agent?

6-8GB VRAM handles 7B parameter models. 12-16GB handles 13-14B models. 24GB handles 32B models. Always pull nomic-embed-text for embeddings regardless of VRAM.

Q: Which GPU is best for Hermes Agent local inference?

Any RTX 30-series or newer GPU works well. The RTX 4090 (24GB VRAM) is ideal for running 32B+ parameter models. Multiple GPUs can be combined with NVLink for even larger models.

Got a gaming rig with an NVIDIA GPU? It's a perfect Hermes Agent host for developers and power users. Run models locally at full speed with CUDA acceleration — no per-token API costs, no latency, no rate limits. This gaming PC setup guide covers everything from NVIDIA drivers to persistent operation.

Overview¶

A gaming PC delivers the highest raw inference performance for local AI models. With 8–24GB of VRAM and CUDA acceleration, you can run models from 7B to 70B parameters locally at 50–200 tokens per second. Combined with Hermes Agent's model selection capabilities, you get a powerful AI workstation.

How It Works¶

Feature	Gaming PC Advantage
GPU memory	8–24GB VRAM (runs 7B–70B models)
Inference speed	50–200 tokens/sec with CUDA
Cost per token	$0.00 (electricity only)
Multi-GPU	Possible with NVLink
Existing hardware	You already own it

Minimum Specs¶

GPU: NVIDIA GTX 1060 (6GB) or better — RTX 3060+ recommended
RAM: 16GB system memory minimum, 32GB+ ideal
Storage: 50GB free for models and embeddings
OS: Linux (Ubuntu 22.04/24.04) or Windows 11 + WSL2

Step-by-Step Linux Setup (Recommended)¶

Step 1: NVIDIA Drivers + CUDA¶

# Ubuntu 24.04
sudo apt update
sudo apt install nvidia-driver-550 nvidia-cuda-toolkit

# Verify
nvidia-smi

Step 2: Install Ollama with GPU Acceleration¶

curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.2 "say hello"
# Check: nvidia-smi should show ollama using GPU memory

Step 3: Pull Models That Fit Your VRAM¶

# 6–8GB VRAM: 7B–8B models
ollama pull llama3.2          # ~4.7GB
ollama pull mistral:7b         # ~4.1GB

# 12–16GB VRAM: 13B–14B models
ollama pull qwen2.5:14b        # ~8.5GB
ollama pull phi4:14b           # ~9.1GB

# 20–24GB VRAM: 32B–34B models
ollama pull qwen2.5:32b        # ~20GB
ollama pull codestral:22b      # ~13GB

# Embeddings (always pull)
ollama pull nomic-embed-text   # ~274MB

Step 4: Install Hermes Agent¶

pip install hermes-agent

hermes profile create gpu-agent
hermes profile use gpu-agent
hermes config set model.default ollama/qwen2.5:14b
hermes config set model.fallback openrouter/anthropic/claude-sonnet-4

Step 5: Maximize GPU Performance¶

# Set GPU to persistence mode
sudo nvidia-smi -pm 1

# Ollama optimization
export OLLAMA_NUM_PARALLEL=1
export OLLAMA_MAX_LOADED_MODELS=1
export OLLAMA_KEEP_ALIVE=10m

Step 6: Persistent Operation¶

sudo tee /etc/systemd/system/hermes-gateway.service << 'EOF'
[Unit]
Description=Hermes Agent Gateway
After=network.target

[Service]
Type=simple
User=your-username
ExecStart=/home/your-username/.local/bin/hermes gateway run
Restart=always
RestartSec=10
Environment=HOME=/home/your-username

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable --now hermes-gateway

Step 7: Browser Automation¶

pip install playwright
playwright install chromium
sudo apt install libnvidia-gl-550  # For headless GPU rendering

Benefits¶

Zero API costs: All inference runs locally on your GPU
Maximum speed: 50–200 tokens/sec with CUDA
Complete privacy: Models and data never leave your machine
No rate limits: Unlimited inference 24/7
Multi-GPU ready: Scale with NVLink for larger models
Flexible fallback: Add cloud models via OpenRouter for frontier capabilities

Windows + WSL2 Alternative¶

If you game on Windows, follow the WSL2 guide. GPU passthrough works — Ollama inside WSL2 sees your NVIDIA GPU via nvidia-smi.

FAQ¶

Can my gaming PC run Hermes Agent while I'm gaming?¶

Yes. Set OLLAMA_KEEP_ALIVE=10m so models unload from VRAM when idle. During gaming, Hermes Agent can use cloud fallback models or wait until you're done.

What VRAM do I need for local LLMs?¶

6–8GB VRAM handles 7B models. 12–16GB handles 13–14B models. 24GB handles 32B models. See our model selection guide for detailed recommendations.

Which GPU is best for Hermes Agent?¶

Any RTX 30-series or newer works well. The RTX 4090 (24GB VRAM) is ideal for running 32B+ models. Multiple GPUs can be combined with NVLink.

Hermes Agent Setup Overview — Compare all platforms
Windows WSL2 Setup — GPU passthrough on Windows
Mac Mini M4 Setup — Silent alternative
Model Selection Guide — Choose the right model
Troubleshooting Guide — Common GPU issues

Next: Cloud VPS Setup · Model Selection