Skip to content

Reflexion — Self-Improving Agent Patterns

Reflexion is the pattern of having an agent evaluate its own output, detect issues, and adjust its approach before delivering a final result. Instead of a human catching mistakes, the agent catches them itself — then fixes them.

This guide covers production Reflexion patterns with Hermes: evaluation loop design, error detection strategies, strategy adjustment mechanisms, and concrete examples you can deploy.

The Core Loop

Every Reflexion implementation follows the same structure:

Task → Execute → Evaluate → Pass? → Return Result
                    ↑           │
                    │    No     │
                    └─ Adjust ──┘

Execute: The agent attempts the task. Evaluate: A separate evaluation step (often a different model or prompt) judges the output. Pass? If the output meets quality criteria, return it. If not, identify what's wrong. Adjust: Modify the strategy based on the evaluation feedback — then execute again.

The key insight: evaluation is cheaper than execution, and correction is cheaper than starting over. A $0.01 evaluation that catches a mistake saves the cost of a bad output plus the cost of human review.

Pattern 1: Output Quality Reflexion

The most common pattern. After generating output, evaluate it against criteria, and refine if needed.

Implementation

def reflexion_loop(task: str, criteria: list[str], max_iterations: int = 3) -> str:
    """
    Execute a task with Reflexion-based quality improvement.

    Args:
        task: The task description
        criteria: List of quality criteria to evaluate against
        max_iterations: Maximum refinement attempts
    """
    strategy_notes = ""  # Accumulated improvement notes

    for iteration in range(max_iterations):
        # Step 1: Execute with current strategy
        output = hermes_llm.complete(
            messages=[
                {"role": "system", "content": f"Complete this task: {task}\n\nStrategy notes from previous attempts:\n{strategy_notes}"},
                {"role": "user", "content": task}
            ],
            tier="standard"
        )

        # Step 2: Evaluate against criteria
        evaluation = hermes_llm.complete(
            messages=[
                {"role": "system", "content": f"""Evaluate this output against the following criteria.
                For each criterion, score PASS or FAIL. If FAIL, explain why.

                Criteria:
                {chr(10).join(f'- {c}' for c in criteria)}

                Output to evaluate:
                {output}"""},
            ],
            tier="lightweight"  # Evaluation is cheap
        )

        # Step 3: Check if all criteria passed
        if "FAIL" not in evaluation:
            return output  # Success!

        # Step 4: Extract improvement notes for next attempt
        strategy_notes += f"\nIteration {iteration + 1} feedback: {evaluation}"

    # Return best attempt if max iterations reached
    return output  # With a warning flag

Criteria Examples

For blog posts: - Contains a clear hook in the first paragraph - Supports claims with specific examples or data - Includes a call-to-action at the end - Appropriate reading level for target audience (8th-10th grade) - No grammatical errors or typos

For code generation: - Code runs without syntax errors - Includes error handling for edge cases - Follows project conventions (naming, structure) - Contains comments for non-obvious logic - Passes the provided test cases

For data analysis: - All claims are backed by the data shown - Calculations are explained step by step - Limitations or caveats are explicitly stated - Conclusions don't overstate the evidence - Format is appropriate for the audience (executive vs. technical)

Production Config

reflexion:
  quality:
    evaluator_model_tier: "lightweight"  # Cheap evaluation
    executor_model_tier: "standard"      # Main work
    max_iterations: 3
    early_exit_threshold: 0.85           # If 85%+ criteria pass, accept
    criteria_weights:                    # Not all criteria are equal
      accuracy: 3
      completeness: 2
      formatting: 1

Pattern 2: Factual Verification Reflexion

For outputs where factual accuracy is critical — claims must be verified against sources.

def factual_reflexion(claim: str, sources: list[str]) -> dict:
    """
    Verify claims against sources and refine if necessary.
    Returns: {"output": str, "verified": bool, "issues": list}
    """
    for iteration in range(3):
        # Step 1: Generate output with source citations
        output = hermes_llm.complete(
            messages=[
                {"role": "system", "content": "Generate analysis citing specific sources for each claim. Format: [Claim] (Source: [citation])"},
                {"role": "user", "content": f"Analyze: {claim}\nSources: {sources}"}
            ],
            tier="premium"
        )

        # Step 2: Cross-reference each claim against sources
        verification = hermes_llm.complete(
            messages=[
                {"role": "system", "content": """For each claim in the output, verify it against the original sources.
                Return JSON with 'verified' (bool) and 'issues' (array of problematic claims).

                Sources:
                """ + "\n---\n".join(sources)},
                {"role": "user", "content": output}
            ],
            tier="standard"
        )

        verification_data = json.loads(verification)

        if verification_data["verified"]:
            return {"output": output, "verified": True, "issues": []}

        # Step 3: Feed issues back for correction
        sources = sources + [f"CORRECTION NEEDED: {verification_data['issues']}"]

    return {"output": output, "verified": False, "issues": verification_data.get("issues", [])}

When to Use

  • Writing reports that cite specific data points
  • Generating documentation from API specs
  • Creating content that references statistics or studies
  • Any output where being wrong has real consequences

Cost-Benefit

Factual verification adds 1-2 additional LLM calls per iteration. For high-stakes content, this is negligible compared to the cost of publishing incorrect information.

Pattern 3: Strategy Adjustment Reflexion

Instead of just fixing the output, the agent changes how it approaches the task.

class StrategyReflexion:
    """
    The agent maintains a strategy stack. On failure, it doesn't just
    tweak the output — it changes the strategy.
    """

    STRATEGIES = [
        "direct_answer",        # Answer directly from knowledge
        "research_first",       # Research before answering
        "decompose",            # Break into sub-tasks
        "analogy",              # Use analogies to explain
        "step_by_step",         # Chain-of-thought reasoning
        "examples_driven",      # Lead with concrete examples
        "counterfactual",       # Explore what-if scenarios
    ]

    def execute(self, task: str) -> str:
        attempted_strategies = []
        best_output = None
        best_score = 0

        for iteration in range(5):
            # Select strategy based on what hasn't worked
            strategy = self._select_strategy(task, attempted_strategies)

            # Execute with strategy prompt
            output = hermes_llm.complete(
                messages=[
                    {"role": "system", "content": f"Strategy: {strategy}\n\n"
                     f"Previously attempted strategies and their issues:\n"
                     f"{self._format_history(attempted_strategies)}\n\n"
                     f"Try a different approach this time."},
                    {"role": "user", "content": task}
                ],
                tier="premium"
            )

            # Evaluate
            score, feedback = self._evaluate(task, output)
            attempted_strategies.append({
                "strategy": strategy,
                "score": score,
                "feedback": feedback
            })

            if score > best_score:
                best_score = score
                best_output = output

            if score >= 0.9:
                return output

        return best_output

Strategy Selection Logic

def _select_strategy(self, task: str, history: list) -> str:
    tried = {h["strategy"] for h in history}

    # Analyze the task to determine appropriate strategies
    task_analysis = hermes_llm.complete(
        messages=[
            {"role": "system", "content": "Classify this task: simple_factual, complex_reasoning, creative, analytical, instructional"},
            {"role": "user", "content": task}
        ],
        tier="lightweight"
    )

    # Map task types to preferred strategy order
    strategy_order = {
        "simple_factual": ["direct_answer", "research_first"],
        "complex_reasoning": ["step_by_step", "decompose", "analogy"],
        "creative": ["examples_driven", "analogy", "counterfactual"],
        "analytical": ["decompose", "step_by_step", "research_first"],
        "instructional": ["step_by_step", "examples_driven", "analogy"],
    }

    for strategy in strategy_order.get(task_analysis, self.STRATEGIES):
        if strategy not in tried:
            return strategy

    return self.STRATEGIES[len(tried) % len(self.STRATEGIES)]  # Cycle

Pattern 4: Multi-Evaluator Reflexion

Use multiple evaluators with different perspectives for robust quality assessment:

EVALUATORS = {
    "accuracy": "Evaluate factual accuracy. Are all claims correct and supported?",
    "clarity": "Evaluate clarity and readability. Would the target audience understand this?",
    "completeness": "Evaluate completeness. Does this fully answer the question?",
    "bias": "Evaluate for bias or one-sidedness. Are multiple perspectives fairly represented?",
    "actionability": "Evaluate actionability. Can the reader act on this information?",
}

def multi_evaluator_reflexion(task: str, active_evaluators: list[str]) -> str:
    """Run Reflexion with multiple evaluator perspectives."""
    for iteration in range(3):
        output = hermes_llm.complete(
            messages=[{"role": "user", "content": task}],
            tier="standard"
        )

        all_pass = True
        feedback = []

        for evaluator_name in active_evaluators:
            eval_prompt = EVALUATORS[evaluator_name]
            result = hermes_llm.complete(
                messages=[
                    {"role": "system", "content": f"{eval_prompt}\nScore PASS or FAIL with explanation."},
                    {"role": "user", "content": output}
                ],
                tier="lightweight"
            )

            if "FAIL" in result:
                all_pass = False
                feedback.append(f"[{evaluator_name}] {result}")

        if all_pass:
            return output

        task = f"{task}\n\nPrevious attempt feedback:\n" + "\n".join(feedback)

    return output

Evaluator Selection by Task Type

Task Type Required Evaluators Optional
Technical documentation accuracy, clarity, completeness -
Marketing copy clarity, actionability, bias accuracy
Analysis/report accuracy, completeness, bias clarity
Tutorial/how-to clarity, completeness, actionability -
Opinion/thought leadership clarity, bias accuracy, completeness

Production Deployment with Hermes

Configuration

# hermes/config/reflexion.yaml
reflexion:
  default_max_iterations: 3
  cost_ceiling_per_task: 0.50  # USD — hard stop

  evaluators:
    default_set: [accuracy, clarity, completeness]

  model_mapping:
    execution:
      creative: "standard"
      analytical: "premium"
      simple: "lightweight"
    evaluation:
      default: "lightweight"  # Always use cheapest capable model

  failure_handling:
    max_iterations_reached: "return_with_warning"  # or "raise_error"
    cost_ceiling_reached: "return_best_so_far"

  logging:
    trace_all_iterations: true
    store_intermediate_outputs: true  # For debugging

Hermes Cron Integration

Schedule Reflexion-based quality checks:

# hermes/cron/content_quality_check.yaml
name: content_quality_reflexion
schedule: "0 8,14,20 * * *"  # Three times daily
task: reflexion.quality_check.recent_content
params:
  lookback_hours: 6
  evaluators: [accuracy, clarity, bias]
  auto_fix: true  # Automatically apply corrections
timeout: 600
notify_on: ["failure"]

Monitoring Reflexion Performance

Track these metrics to optimize your Reflexion setup:

  • Pass rate on first attempt: High (>70%) means your initial prompts are good. Low means you're over-relying on Reflexion.
  • Average iterations to pass: Should be under 2. Higher means your evaluation criteria might be too strict or your initial execution is weak.
  • Cost per task (with vs. without Reflexion): Reflexion adds 30-80% to task cost but typically improves quality by 40-60%.
  • False pass rate: Spot-check outputs that passed evaluation. If >5% have errors, your evaluation criteria or model are insufficient.

Common Pitfalls

Over-reliance on Reflexion: If every task needs 3 iterations, fix your initial prompts. Reflexion is a safety net, not a crutch.

Too many criteria: Start with 3 criteria. More than 5 creates evaluation noise and slows convergence.

Same model for both: Don't use the same model for execution and evaluation — it's grading its own homework. Use a lightweight model for evaluation.

No cost ceiling: Always set a cost_ceiling_per_task. Reflexion loops without a ceiling can burn through budget.

Infinite loops with unclear criteria: "Make it better" isn't evaluable. Criteria must be specific and falsifiable: "Score PASS or FAIL."

Decision Tree

Need quality improvement on outputs?
├─ Yes, accuracy is critical → Factual Verification Reflexion
├─ Yes, general quality matters → Output Quality Reflexion
├─ Yes, approach keeps failing → Strategy Adjustment Reflexion
├─ Yes, for high-stakes content → Multi-Evaluator Reflexion
└─ No → Direct execution, no Reflexion overhead

Next: LangGraph Integration · CrewAI Integration · Architecture