Diagnosing Agent Coordination Failures
Your multi-agent system is failing. But how do you describe what's wrong?
"The agents are... not working together" doesn't help your debugging session. You need a diagnostic vocabulary—precise terms that map symptoms to root causes to solutions.
The Problem with Ad-Hoc Diagnosis
Most teams describe agent failures vaguely:
- "It's stuck"
- "The responses conflict"
- "It's too slow"
- "Something's wrong with the handoffs"
These descriptions don't point to solutions. They're symptoms, not diagnoses.
Every piece of automation you write is a future landmine. Without the right vocabulary, you'll never defuse them.
Musical Diagnostics
I developed a vocabulary borrowed from music—where coordinating 100+ performers in real-time has been solved for centuries.
Here are the core diagnostic terms:
Timing Failures
| Symptom | Diagnosis | What's Happening |
|---|---|---|
| Deadlock / Hanging | Stuck Fermata | Agent is waiting for a signal that never comes |
| Skipping steps, hallucinating | Rushing | Agent proceeds without required context |
| Can't make decisions | Dragging | Over-deliberation, no urgency signal |
| Acts too early | False Entry | Missing barrier synchronization |
Coherence Failures
| Symptom | Diagnosis | What's Happening |
|---|---|---|
| Contradictory outputs | Harmonic Clash | No shared constraint on who "wins" |
| Repeating same error | Deaf Agent | Not receiving feedback from verifier |
| Context mismatch | Dissonance | Different agents have different reference frames |
| Wanders off-task | Improvisation Drift | Missing explicit constraints/boundaries |
From Diagnosis to Fix
Each diagnosis maps to a pattern—a named solution you can implement.
Stuck Fermata → Escalation Pattern
Problem: Agent A sends a fermata (hold) to Agent B, but B never responds.
Fix: Add timeout with automatic escalation.
from hct_mcp_signals import fermata
hold = fermata(
source="legal_agent",
reason="Compliance review required",
hold_type="human",
timeout_ms=300000, # 5 minutes
on_timeout="escalate" # Auto-route to manager
)
That "impossible" race condition in your code? It's not impossible, it's inevitable at scale. Plan for it.
Harmonic Clash → Shared Progressions Pattern
Problem: Finance says "approve" while Legal says "reject".
Fix: Define a voice hierarchy in your reference frame.
# reference_frame.yaml
shared_progressions:
compliance: "Legal has final voice"
budget: "Finance has final voice"
product: "Product has final voice"
NEVER sacrifice design quality for speed of delivery. Get the hierarchy right first.
Rushing → Tempo Control Pattern
Problem: Agent hallucinates because it feels "rushed".
Fix: Use explicit urgency signals and Chain-of-Verification.
# Instead of just sending content, send with tempo
signal = cue(
source="orchestrator",
targets=["analyst"],
urgency=5, # Not urgent - take your time
tempo="andante" # Walking pace
)
The Full Library
We've documented 15 diagnostic patterns covering:
- 5 Timing failures
- 5 Coherence failures
- 5 Resource failures
Each includes:
- How to spot it in your logs
- The root cause
- Code to fix it
Explore the Patterns Library →
Naming Things Matters
The act of naming a problem is half the battle. When you can say "we have a Stuck Fermata problem," your team immediately knows:
- What symptom to look for
- What the root cause is
- Which pattern to apply
That's the power of a shared diagnostic vocabulary.
Sometimes the best automation is no automation. But when you do automate, give yourself the language to debug it.
Next time your agents misbehave, don't say "it's broken." Say "it's a Harmonic Clash" and reach for the pattern.
Enjoy this? You might like SeekingSota — weekly essays on what happens when engineers stop programming and start conducting AI agents.