Diagnosing Agent Coordination Failures

A Vocabulary for What's Going Wrong

Your multi-agent system is failing. But how do you describe what's wrong?

"The agents are... not working together" doesn't help your debugging session. You need a diagnostic vocabulary, precise terms that map symptoms to root causes to solutions.

I spent three hours debugging why my AWS VPC peering connection was dropping packets despite the route tables being correctly propagated, only to discover the NACLs were silently blocking ephemeral ports.

Yesterday? I watched an agent fail for the exact same reason. The pizza was good. The debugging was better.

The Problem with Ad-Hoc Diagnosis

Most teams describe agent failures vaguely:

"It's stuck"
"The responses conflict"
"It's too slow"
"Something's wrong with the handoffs"

These descriptions don't point to solutions. They're symptoms, not diagnoses.

Every piece of automation you write is a future landmine. Without the right vocabulary, you'll never defuse them.

The system punishes your lies, eventually. Tell the truth about the system, or the system will lie to you.

Musical Diagnostics

I developed a vocabulary borrowed from music, where coordinating 100+ performers in real-time has been solved for centuries.

Here are the core diagnostic terms:

Timing Failures

Symptom	Diagnosis	What's Happening
Deadlock / Hanging	Stuck Fermata	Agent is waiting for a signal that never comes
Skipping steps, hallucinating	Rushing	Agent proceeds without required context
Can't make decisions	Dragging	Over-deliberation, no urgency signal
Acts too early	False Entry	Missing barrier synchronization

Coherence Failures

Symptom	Diagnosis	What's Happening
Contradictory outputs	Harmonic Clash	No shared constraint on who "wins"
Repeating same error	Deaf Agent	Not receiving feedback from verifier
Context mismatch	Dissonance	Different agents have different reference frames
Wanders off-task	Improvisation Drift	Missing explicit constraints/boundaries

From Diagnosis to Fix

Each diagnosis maps to a pattern, a named solution you can implement.

Stuck Fermata → Escalation Pattern

Problem: Agent A sends a fermata (hold) to Agent B, but B never responds.

Fix: Add timeout with automatic escalation.

from hct_mcp_signals import fermata

hold = fermata(
    source="legal_agent",
    reason="Compliance review required",
    hold_type="human",
    timeout_ms=300000,  # 5 minutes
    on_timeout="escalate"  # Auto-route to manager
)

That "impossible" race condition in your code? It's not impossible, it's inevitable at scale. Plan for it.

Every outage teaches you. Compound those lessons or die young in this career.

Harmonic Clash → Shared Progressions Pattern

Problem: Finance says "approve" while Legal says "reject".

Fix: Define a voice hierarchy in your reference frame.

# reference_frame.yaml
shared_progressions:
  compliance: "Legal has final voice"
  budget: "Finance has final voice"
  product: "Product has final voice"

Sacrificing design quality for speed of delivery is a trap. Get the hierarchy right first.

Rushing → Tempo Control Pattern

Problem: Agent hallucinates because it feels "rushed".

Fix: Use explicit urgency signals and Chain-of-Verification.

# Instead of just sending content, send with tempo
signal = cue(
    source="orchestrator",
    targets=["analyst"],
    urgency=5,  # Not urgent - take your time
    tempo="andante"  # Walking pace
)

The Full Library

We've documented 15 diagnostic patterns covering:

5 Timing failures
5 Coherence failures
5 Resource failures

Each includes:

How to spot it in your logs
The root cause
Code to fix it

Explore the Patterns Library →

Naming Things Matters

The act of naming a problem is half the battle. When you can say "we have a Stuck Fermata problem," your team immediately knows:

What symptom to look for
What the root cause is
Which pattern to apply

That's the power of a shared diagnostic vocabulary. That's specific knowledge, and you can't copy it.

Sometimes the best automation is no automation. But when you do automate, give yourself the language to debug it.

You become what you pretend to be. Act the diagnostician long enough and you become one, or you die from the fraud feeling. The ones who survive? They've absorbed the responsibility, and they've made the chaos their own.

Next time your agents misbehave, don't say "it's broken." Say "it's a Harmonic Clash" and reach for the pattern.

Enjoy this? You might like SeekingSota - weekly essays on what happens when engineers stop programming and start conducting AI agents.