Model Selection Guide — Choose the Right AI Model Every Time¶
Choosing the right model for each task is one of the highest-leverage decisions you make with Hermes Agent. A well-chosen model saves latency, cost, and context — missteps waste all three. This model selection guide covers task-to-model mapping, cost optimization, local vs cloud tradeoffs, and fallback chain patterns.
Overview¶
Hermes Agent supports 200+ models across Ollama (local), OpenRouter, and direct provider APIs. Following best practices for model selection means using the cheapest model that reliably handles each task, reserving frontier models for complex reasoning.
How It Works¶
The Decision Framework¶
Ask three questions before every model selection:
- What does the task require? — Reasoning depth, factual precision, creative generation, or tool orchestration?
- What is the latency budget? — Real-time interaction, batch processing, or background work?
- What is the cost tolerance? — Per-request budget, daily volume, or fixed monthly spend?
Task-to-Model Mapping¶
| Task Type | Model Tier | Examples |
|---|---|---|
| Classification & Routing | Small/Lightweight | Intent detection, email triage, field extraction |
| Data Extraction & Summarization | Mid-tier | Meeting transcripts, action items, structured data |
| Complex Reasoning & Code | Frontier | Multi-step debugging, architecture decisions |
| Creative & Narrative | Style-appropriate | Brand content, brainstorming (test with audience) |
Cost Optimization Strategies¶
Caching. If 40% of your queries are variants of the same 5 intents, cache results before hitting models.
Prompt compression. Summarize conversation history before expensive calls. Strip irrelevant tool output.
Task batching. Process similar records together — many providers offer 50% discounts on batch endpoints.
Model fallback chains. Try cheapest model first → if confidence low, escalate to mid-tier → if still uncertain, invoke frontier. This "progressive enhancement" gives frontier quality on hard cases while keeping average cost low.
Local vs Cloud: The Real Tradeoffs¶
Local models (Ollama) win when: data privacy is non-negotiable, latency must be sub-50ms, high volume makes fixed hardware cost cheaper than per-token pricing. See gaming PC setup or Mac Mini setup.
Cloud models (OpenRouter/Anthropic/OpenAI) win when: you need frontier reasoning, zero infrastructure maintenance, elastic scaling. See cloud VPS setup.
Hybrid for most teams: Classify/extract with local models (free, private); reason/generate with cloud models. Configure Hermes Agent with Ollama as primary and OpenRouter as fallback.
The Fallback Chain Pattern¶
- Check cache — identical query answered recently? Return cached.
- Try primary model — send to default model for this task class
- Validate output — check structure, completeness, confidence
- Escalate if needed — retry with stronger model
- Track fallback rate — if >10%, primary model is too weak
Practical Decision Tree¶
Classification/routing? → Small model
Data extraction from known formats? → Mid-tier
Summarization of medium-length text? → Mid-tier
Multi-step reasoning or debugging? → Frontier
Creative writing or brand content? → Style-appropriate (test!)
Background batch job? → Cheapest reliable model
Latency critical (sub-second)? → Local or fastest cloud
Data sensitive/regulated? → Local or approved cloud with BAA
Benefits¶
- 3-10x cost reduction: Use small models for 80% of routine tasks
- Lower latency: Local models for classification give sub-second responses
- Privacy by default: Sensitive data stays on local models
- Frontier quality on demand: Fallback chains escalate only when needed
FAQ¶
When should I use a local model vs a cloud model?¶
Use local models (Ollama) for classification, extraction, and data-sensitive tasks — they're free and private. Use cloud models for complex reasoning, code generation, and tasks requiring frontier capabilities. Most teams benefit from a hybrid approach.
How do I reduce my API costs with Hermes Agent?¶
Implement caching for repeated queries, use prompt compression before expensive calls, batch non-urgent tasks, and create fallback chains that try cheaper models first. Track cost per successful task, not per token.
What's the best model for code generation?¶
Frontier models like Claude Sonnet 4, GPT-4o, and DeepSeek V3 excel at code generation. For boilerplate or simple functions, smaller local models like CodeLlama or DeepSeek Coder handle it efficiently at zero cost.
Related Pages¶
- Best Practices Overview — All guides
- Setup Guides — Platform-specific model configuration
- Memory Management — Context optimization for models
- MCP Server Design — Design tools for model efficiency
Revisit model choices monthly — model quality and pricing change fast.