Reflexion — Self-Improving Agent Patterns¶

Reflexion is the pattern of having an agent evaluate its own output, detect issues, and adjust its approach before delivering a final result. Instead of a human catching mistakes, the agent catches them itself — then fixes them.

This guide covers production Reflexion patterns with Hermes: evaluation loop design, error detection strategies, strategy adjustment mechanisms, and concrete examples you can deploy.

The Core Loop¶

Every Reflexion implementation follows the same structure:

Task → Execute → Evaluate → Pass? → Return Result
                    ↑           │
                    │    No     │
                    └─ Adjust ──┘

Execute: The agent attempts the task. Evaluate: A separate evaluation step (often a different model or prompt) judges the output. Pass? If the output meets quality criteria, return it. If not, identify what's wrong. Adjust: Modify the strategy based on the evaluation feedback — then execute again.

The key insight: evaluation is cheaper than execution, and correction is cheaper than starting over. A $0.01 evaluation that catches a mistake saves the cost of a bad output plus the cost of human review.

Pattern 1: Output Quality Reflexion¶

The most common pattern. After generating output, evaluate it against criteria, and refine if needed.

Implementation¶

def reflexion_loop(task: str, criteria: list[str], max_iterations: int = 3) -> str:
    """
    Execute a task with Reflexion-based quality improvement.

    Args:
        task: The task description
        criteria: List of quality criteria to evaluate against
        max_iterations: Maximum refinement attempts
    """
    strategy_notes = ""  # Accumulated improvement notes

    for iteration in range(max_iterations):
        # Step 1: Execute with current strategy
        output = hermes_llm.complete(
            messages=[
                {"role": "system", "content": f"Complete this task: {task}\n\nStrategy notes from previous attempts:\n{strategy_notes}"},
                {"role": "user", "content": task}
            ],
            tier="standard"
        )

        # Step 2: Evaluate against criteria
        evaluation = hermes_llm.complete(
            messages=[
                {"role": "system", "content": f"""Evaluate this output against the following criteria.
                For each criterion, score PASS or FAIL. If FAIL, explain why.

                Criteria:
                {chr(10).join(f'- {c}' for c in criteria)}

                Output to evaluate:
                {output}"""},
            ],
            tier="lightweight"  # Evaluation is cheap
        )

        # Step 3: Check if all criteria passed
        if "FAIL" not in evaluation:
            return output  # Success!

        # Step 4: Extract improvement notes for next attempt
        strategy_notes += f"\nIteration {iteration + 1} feedback: {evaluation}"

    # Return best attempt if max iterations reached
    return output  # With a warning flag

Criteria Examples¶

For blog posts: - Contains a clear hook in the first paragraph - Supports claims with specific examples or data - Includes a call-to-action at the end - Appropriate reading level for target audience (8th-10th grade) - No grammatical errors or typos

For code generation: - Code runs without syntax errors - Includes error handling for edge cases - Follows project conventions (naming, structure) - Contains comments for non-obvious logic - Passes the provided test cases

For data analysis: - All claims are backed by the data shown - Calculations are explained step by step - Limitations or caveats are explicitly stated - Conclusions don't overstate the evidence - Format is appropriate for the audience (executive vs. technical)

Production Config¶

reflexion:
  quality:
    evaluator_model_tier: "lightweight"  # Cheap evaluation
    executor_model_tier: "standard"      # Main work
    max_iterations: 3
    early_exit_threshold: 0.85           # If 85%+ criteria pass, accept
    criteria_weights:                    # Not all criteria are equal
      accuracy: 3
      completeness: 2
      formatting: 1

Pattern 2: Factual Verification Reflexion¶

For outputs where factual accuracy is critical — claims must be verified against sources.

def factual_reflexion(claim: str, sources: list[str]) -> dict:
    """
    Verify claims against sources and refine if necessary.
    Returns: {"output": str, "verified": bool, "issues": list}
    """
    for iteration in range(3):
        # Step 1: Generate output with source citations
        output = hermes_llm.complete(
            messages=[
                {"role": "system", "content": "Generate analysis citing specific sources for each claim. Format: [Claim] (Source: [citation])"},
                {"role": "user", "content": f"Analyze: {claim}\nSources: {sources}"}
            ],
            tier="premium"
        )

        # Step 2: Cross-reference each claim against sources
        verification = hermes_llm.complete(
            messages=[
                {"role": "system", "content": """For each claim in the output, verify it against the original sources.
                Return JSON with 'verified' (bool) and 'issues' (array of problematic claims).

                Sources:
                """ + "\n---\n".join(sources)},
                {"role": "user", "content": output}
            ],
            tier="standard"
        )

        verification_data = json.loads(verification)

        if verification_data["verified"]:
            return {"output": output, "verified": True, "issues": []}

        # Step 3: Feed issues back for correction
        sources = sources + [f"CORRECTION NEEDED: {verification_data['issues']}"]

    return {"output": output, "verified": False, "issues": verification_data.get("issues", [])}

When to Use¶

Writing reports that cite specific data points
Generating documentation from API specs
Creating content that references statistics or studies
Any output where being wrong has real consequences

Cost-Benefit¶

Factual verification adds 1-2 additional LLM calls per iteration. For high-stakes content, this is negligible compared to the cost of publishing incorrect information.

Pattern 3: Strategy Adjustment Reflexion¶

Instead of just fixing the output, the agent changes how it approaches the task.

class StrategyReflexion:
    """
    The agent maintains a strategy stack. On failure, it doesn't just
    tweak the output — it changes the strategy.
    """

    STRATEGIES = [
        "direct_answer",        # Answer directly from knowledge
        "research_first",       # Research before answering
        "decompose",            # Break into sub-tasks
        "analogy",              # Use analogies to explain
        "step_by_step",         # Chain-of-thought reasoning
        "examples_driven",      # Lead with concrete examples
        "counterfactual",       # Explore what-if scenarios
    ]

    def execute(self, task: str) -> str:
        attempted_strategies = []
        best_output = None
        best_score = 0

        for iteration in range(5):
            # Select strategy based on what hasn't worked
            strategy = self._select_strategy(task, attempted_strategies)

            # Execute with strategy prompt
            output = hermes_llm.complete(
                messages=[
                    {"role": "system", "content": f"Strategy: {strategy}\n\n"
                     f"Previously attempted strategies and their issues:\n"
                     f"{self._format_history(attempted_strategies)}\n\n"
                     f"Try a different approach this time."},
                    {"role": "user", "content": task}
                ],
                tier="premium"
            )

            # Evaluate
            score, feedback = self._evaluate(task, output)
            attempted_strategies.append({
                "strategy": strategy,
                "score": score,
                "feedback": feedback
            })

            if score > best_score:
                best_score = score
                best_output = output

            if score >= 0.9:
                return output

        return best_output

Strategy Selection Logic¶

def _select_strategy(self, task: str, history: list) -> str:
    tried = {h["strategy"] for h in history}

    # Analyze the task to determine appropriate strategies
    task_analysis = hermes_llm.complete(
        messages=[
            {"role": "system", "content": "Classify this task: simple_factual, complex_reasoning, creative, analytical, instructional"},
            {"role": "user", "content": task}
        ],
        tier="lightweight"
    )

    # Map task types to preferred strategy order
    strategy_order = {
        "simple_factual": ["direct_answer", "research_first"],
        "complex_reasoning": ["step_by_step", "decompose", "analogy"],
        "creative": ["examples_driven", "analogy", "counterfactual"],
        "analytical": ["decompose", "step_by_step", "research_first"],
        "instructional": ["step_by_step", "examples_driven", "analogy"],
    }

    for strategy in strategy_order.get(task_analysis, self.STRATEGIES):
        if strategy not in tried:
            return strategy

    return self.STRATEGIES[len(tried) % len(self.STRATEGIES)]  # Cycle

Pattern 4: Multi-Evaluator Reflexion¶

Use multiple evaluators with different perspectives for robust quality assessment:

EVALUATORS = {
    "accuracy": "Evaluate factual accuracy. Are all claims correct and supported?",
    "clarity": "Evaluate clarity and readability. Would the target audience understand this?",
    "completeness": "Evaluate completeness. Does this fully answer the question?",
    "bias": "Evaluate for bias or one-sidedness. Are multiple perspectives fairly represented?",
    "actionability": "Evaluate actionability. Can the reader act on this information?",
}

def multi_evaluator_reflexion(task: str, active_evaluators: list[str]) -> str:
    """Run Reflexion with multiple evaluator perspectives."""
    for iteration in range(3):
        output = hermes_llm.complete(
            messages=[{"role": "user", "content": task}],
            tier="standard"
        )

        all_pass = True
        feedback = []

        for evaluator_name in active_evaluators:
            eval_prompt = EVALUATORS[evaluator_name]
            result = hermes_llm.complete(
                messages=[
                    {"role": "system", "content": f"{eval_prompt}\nScore PASS or FAIL with explanation."},
                    {"role": "user", "content": output}
                ],
                tier="lightweight"
            )

            if "FAIL" in result:
                all_pass = False
                feedback.append(f"[{evaluator_name}] {result}")

        if all_pass:
            return output

        task = f"{task}\n\nPrevious attempt feedback:\n" + "\n".join(feedback)

    return output

Evaluator Selection by Task Type¶

Task Type	Required Evaluators	Optional
Technical documentation	accuracy, clarity, completeness	-
Marketing copy	clarity, actionability, bias	accuracy
Analysis/report	accuracy, completeness, bias	clarity
Tutorial/how-to	clarity, completeness, actionability	-
Opinion/thought leadership	clarity, bias	accuracy, completeness

Production Deployment with Hermes¶

Configuration¶

# hermes/config/reflexion.yaml
reflexion:
  default_max_iterations: 3
  cost_ceiling_per_task: 0.50  # USD — hard stop

  evaluators:
    default_set: [accuracy, clarity, completeness]

  model_mapping:
    execution:
      creative: "standard"
      analytical: "premium"
      simple: "lightweight"
    evaluation:
      default: "lightweight"  # Always use cheapest capable model

  failure_handling:
    max_iterations_reached: "return_with_warning"  # or "raise_error"
    cost_ceiling_reached: "return_best_so_far"

  logging:
    trace_all_iterations: true
    store_intermediate_outputs: true  # For debugging

Hermes Cron Integration¶

Schedule Reflexion-based quality checks:

# hermes/cron/content_quality_check.yaml
name: content_quality_reflexion
schedule: "0 8,14,20 * * *"  # Three times daily
task: reflexion.quality_check.recent_content
params:
  lookback_hours: 6
  evaluators: [accuracy, clarity, bias]
  auto_fix: true  # Automatically apply corrections
timeout: 600
notify_on: ["failure"]

Monitoring Reflexion Performance¶

Track these metrics to optimize your Reflexion setup:

Pass rate on first attempt: High (>70%) means your initial prompts are good. Low means you're over-relying on Reflexion.
Average iterations to pass: Should be under 2. Higher means your evaluation criteria might be too strict or your initial execution is weak.
Cost per task (with vs. without Reflexion): Reflexion adds 30-80% to task cost but typically improves quality by 40-60%.
False pass rate: Spot-check outputs that passed evaluation. If >5% have errors, your evaluation criteria or model are insufficient.

Common Pitfalls¶

Over-reliance on Reflexion: If every task needs 3 iterations, fix your initial prompts. Reflexion is a safety net, not a crutch.

Too many criteria: Start with 3 criteria. More than 5 creates evaluation noise and slows convergence.

Same model for both: Don't use the same model for execution and evaluation — it's grading its own homework. Use a lightweight model for evaluation.

No cost ceiling: Always set a cost_ceiling_per_task. Reflexion loops without a ceiling can burn through budget.

Infinite loops with unclear criteria: "Make it better" isn't evaluable. Criteria must be specific and falsifiable: "Score PASS or FAIL."

Decision Tree¶

Need quality improvement on outputs?
├─ Yes, accuracy is critical → Factual Verification Reflexion
├─ Yes, general quality matters → Output Quality Reflexion
├─ Yes, approach keeps failing → Strategy Adjustment Reflexion
├─ Yes, for high-stakes content → Multi-Evaluator Reflexion
└─ No → Direct execution, no Reflexion overhead

Next: LangGraph Integration · CrewAI Integration · Architecture