System Monitoring¶
Autonomous agents need continuous health monitoring. When agents run 24/7 without human supervision, monitoring is the safety net.
Monitoring Layers¶
Layer 1: Process Health
├─ Is the agent process running?
├─ CPU/Memory within bounds?
└─ Any zombie processes?
Layer 2: Operational Health
├─ Are crons firing on schedule?
├─ Are emails being sent/received?
├─ Are API keys still valid?
└─ Token usage within budget?
Layer 3: Quality Health
├─ Are outputs meeting quality thresholds?
├─ Are responses going to the right channels?
└─ Any silent failures detected?
Key Metrics¶
| Metric | Check | Alert Threshold |
|---|---|---|
| Cron completion rate | cronjob list success/fail ratio |
< 95% |
| API token validity | Test auth endpoint | Expired or revoked |
| Disk usage | df -h |
> 90% |
| Memory usage | free -m |
> 90% |
| Email deliverability | SMTP test send | Bounce > 10% |
| LLM token burn | Session token counts | > $50/day |
| Session DB size | SQLite file size | > 1GB |
| Skill staleness | Last-updated timestamp | > 30 days |
Drift Detection¶
CorpusIQ's metric spec system detects when two sources disagree:
# Two sources should agree within 1%
metric_spec_resolve("leads_this_week")
# Returns: {value: 47, drift: {source_a: 47, source_b: 44, delta_pct: 6.4}}
# Flagged: 6.4% > 1% tolerance → investigate
Alerting Channels¶
| Severity | Channel | Example |
|---|---|---|
| Critical | Telegram Topic 2 (dev) + DM | API key expired |
| Warning | Telegram Topic 2 (dev) | Cron failure |
| Info | Logged to activity-log.jsonl | Daily stats |
System Audit¶
Run corpusiq-system-audit skill to run a full six-category audit:
1. Configuration integrity
2. Connection health
3. Cron execution
4. Disk and memory
5. Token and cost
6. Skill freshness
Self-Monitoring Patterns¶
cron: health-check (every 30m)
→ script: check_processes.sh
→ silent if healthy
→ alerts only on threshold breach
cron: drift-report (daily at 6 AM)
→ metric_spec_drift_report
→ reports discrepancies across data sources
→ silent if all within tolerance
Dashboard Files¶
| File | Content |
|---|---|
post-log.jsonl |
All outbound posts |
activity-log.jsonl |
All agent actions |
lead-pipeline.jsonl |
Lead state transitions |
email-monitor.log |
Inbound email processing |
← Scheduling | Email Ops → ↑ Governance