Brain Nervous System — OpenTelemetry Observability
Shipped: 2026-04-13 Scope: Full 5-layer OTel pipeline + deterministic smoke test Companion docs:
docs/architecture/ARCHITECTURE.md
Why
The brain produces thoughts (marker emissions, broadcasts, injections). Before this work, to verify that a broadcast actually reached a session we had to ask the agent “did you see it?” — interrogating the brain instead of reading the nerve signal. This made validation slow, unreliable, and unscalable.
This doc describes the nervous system: the deterministic telemetry pipeline that observes every brain event as an OTel span or log, landing in ClickHouse + Langfuse, queryable from Grafana, and smoke-testable via a single CLI.
Architecture
LAYER 1 — Claude Code native OTel (zero code)
~/.claude/settings.json env vars
CLAUDE_CODE_ENABLE_TELEMETRY=1
CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1
OTEL_EXPORTER_OTLP_ENDPOINT=http://${OTEL_HOST}:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_SERVICE_NAME=claude-code-home
Emits: claude_code.user_prompt, claude_code.tool_result,
claude_code.api_request, plus metrics (tokens, cost, active_time)
LAYER 2 — Agentihooks structured spans
agentihooks/hooks/telemetry.py
emit_span(name, attrs)
span_ctx(name, attrs) → wraps block with duration
emit_log(message, attrs) → OTLP /v1/logs fan-out
Three instrumented functions emit three span types:
brain.inject brain_adapter._publish_entries
brain.marker_write brain_writer_hook.write_markers
brain.delivery broadcast.check_and_inject_broadcasts (per message)
LAYER 3 — Hook log OTLP fan-out
agentihooks/hooks/common.py log() extended
Events matching brain_* / broadcast_* / outbox_* / amygdala_*
additionally POST to OTLP /v1/logs → ClickHouse otel_logs
LAYER 4 — Data plane
otel-collector (Docker Compose at ${OTEL_HOST}:4318)
↓
ClickHouse otel database (otel_traces, otel_logs, otel_metrics)
Langfuse via otlphttp/langfuse exporter (traces only)
LAYER 5 — Visualization
Grafana dashboard /d/<your-dashboard-slug>/brain-nervous-system
4 new panels: inject latency, marker write rate, delivery coverage, error rate
Langfuse <your-langfuse-host>
Per-session trace view: claude_code.user_prompt → brain.inject → brain.delivery(×N)
Span taxonomy
| Span name | Fired by | Attributes | When |
|---|---|---|---|
brain.inject |
brain_adapter._publish_entries |
channel, entry_count, total_bytes, published_count |
Session start + every BRAIN_REFRESH_INTERVAL turns |
brain.marker_write |
brain_writer_hook.write_markers |
session_id, transcript_path, source, markers_found, outbox_count, redis_count, marker_types |
Stop hook (every turn) |
brain.delivery |
broadcast.check_and_inject_broadcasts |
session_id, message_id, channel, severity, source, bytes, persistent |
UserPromptSubmit, per broadcast message |
agentihooks.session.stop |
hook_manager.on_stop |
session_id, tool_calls, errors |
Stop hook (existing, pre-this work) |
Log taxonomy
| Log prefix | Source | Purpose |
|---|---|---|
brain_* |
brain_adapter, brain_writer_hook | Deterministic audit of brain lifecycle |
broadcast_* |
broadcast.py | Publish/deliver/expire events |
outbox_* |
brain_writer_hook | Marker file writes |
amygdala_* |
amygdala_hook | Emergency signal reads |
All logs carry service.name=agentihooks resource attribute and land in otel.otel_logs.
Query cookbook (ClickHouse)
-- All brain spans in the last hour
SELECT SpanName, count() FROM otel.otel_traces
WHERE SpanName LIKE 'brain.%' AND Timestamp > now() - INTERVAL 1 HOUR
GROUP BY SpanName;
-- Find all brain events for a specific session
SELECT Timestamp, SpanName, SpanAttributes
FROM otel.otel_traces
WHERE SpanAttributes['session_id'] = 'smoke-abc123'
ORDER BY Timestamp;
-- Slowest brain.inject calls in last 24h
SELECT Timestamp, Duration/1e6 AS ms, SpanAttributes['entry_count'] AS entries
FROM otel.otel_traces
WHERE SpanName='brain.inject' AND Timestamp > now() - INTERVAL 24 HOUR
ORDER BY Duration DESC LIMIT 20;
-- Marker write rate by type in the last 7 days
SELECT toStartOfDay(Timestamp) AS day,
SpanAttributes['marker_types'] AS types,
count() AS writes
FROM otel.otel_traces
WHERE SpanName='brain.marker_write' AND Timestamp > now() - INTERVAL 7 DAY
GROUP BY day, types ORDER BY day DESC;
-- Delivery coverage — which channels reached sessions today
SELECT SpanAttributes['channel'] AS channel,
uniqExact(SpanAttributes['session_id']) AS sessions,
count() AS deliveries
FROM otel.otel_traces
WHERE SpanName='brain.delivery' AND toDate(Timestamp) = today()
GROUP BY channel ORDER BY sessions DESC;
-- Recent brain errors from the hook log
SELECT Timestamp, Body, LogAttributes
FROM otel.otel_logs
WHERE SeverityText='ERROR' AND Body LIKE '%brain%'
AND Timestamp > now() - INTERVAL 1 HOUR
ORDER BY Timestamp DESC LIMIT 20;
Langfuse usage
- Navigate to
<your-langfuse-host> - Filter service → your Claude Code service name (e.g.
claude-code-home) oragentihooks(hook-emitted spans) - Pick a session trace → expand to see the causal graph:
claude_code.user_prompt(root)brain.inject(from SessionStart/refresh)brain.delivery × N(one per message delivered this turn)claude_code.tool_result × M(tool calls the agent made)brain.marker_write(on Stop)
Use Langfuse for “what did this specific session experience” — ClickHouse for “what’s the aggregate system state”.
Smoke test — brain-smoke
Run after any change to brain code:
cd <path-to>/agentihooks
./scripts/brain-smoke # 4 core tests, offline
./scripts/brain-smoke --otel-check # + ClickHouse span verification (5 tests)
CLICKHOUSE_URL="http://default:$PASS@${CLICKHOUSE_HOST}:8123" \
./scripts/brain-smoke --otel-check
Tests:
inject— SessionStart payload →broadcast.jsonhas brain entriesdelivery— UserPromptSubmit → hook log growsmarker_write— Stop + lesson marker → outbox gains a fileerror_path— Stop with emptytranscript_path→ gracefulotel_spans— ClickHouse hasbrain.*spans from this run
Exit 0 on all-pass, 1 on any fail. Runs in ~1.5 s. Wired into CI at agentihooks/.github/workflows/brain-smoke.yml.
Troubleshooting
Spans don’t show up in ClickHouse
- Check OTel collector is reachable:
curl http://${OTEL_HOST}:4318→ expect 404 (endpoint up, wrong path) - Check env vars:
printenv | grep OTEL→ should show OTLP_ENDPOINT + PROTOCOL - Check agentihooks telemetry flag:
printenv | grep OTEL_HOOKS_ENABLED→ should betrue - Restart your Claude Code session — settings.json env is read at startup
Langfuse is empty
- Langfuse only receives from the
otlphttp/langfuseexporter in the collector - Verify collector config:
ssh <otel-host> "grep -A3 langfuse <path-to-collector>/otelcol.yaml" - Verify credentials env:
ssh <otel-host> "docker exec <otel-collector-container> printenv | grep LANGFUSE" - For K8s pods: configure your
otel-collectorchart’s langfuse exporter with credentials from your secret store (e.g.<your-prefix>/otel-collector-prod)
ClickHouse lag > 30s
- Check collector batch processor: default
timeout: 5s— spans may batch for up to 5s before flush - Query
otel.otel_traceswith a wider time window (INTERVAL 5 MINUTE)
brain-smoke fails on delivery test
- Means
hooks.logdidn’t grow — eitherLOG_HOOKS=falseor~/.agentihooks/logs/is read-only - Fix:
mkdir -p ~/.agentihooks/logs && echo > ~/.agentihooks/logs/hooks.log
Telemetry is blocking hook execution
- Should not happen (telemetry is fail-silent) but if it does: set
OTEL_HOOKS_ENABLED=false - The file log keeps working independently of the OTLP fan-out
Files
| Layer | Path |
|---|---|
| 1 | ~/.claude/settings.json (env block) |
| 2 | agentihooks/hooks/telemetry.py |
| 2 | agentihooks/hooks/context/brain_adapter.py:151 |
| 2 | agentihooks/hooks/context/brain_writer_hook.py:168 |
| 2 | agentihooks/hooks/context/broadcast.py:500 |
| 3 | agentihooks/hooks/common.py:102 (log() fan-out) |
| 4 | <your-platform-repo>/stacks/observability/otelcol.yaml (Docker collector) |
| 4 | <your-platform-repo>/k8s/charts/otel-collector/values-{dev,prod}.yaml (K8s collector) |
| 4 | <your-platform-repo>/k8s/charts/otel-collector/templates/external-secret.yaml |
| 5 | <your-platform-repo>/stacks/observability/provisioning/dashboards/<your-dashboard>/brain-health.json (4 new panels, ids 101-104) |
| 5 | agentihooks/scripts/brain-smoke |
| 5 | agentihooks/.github/workflows/brain-smoke.yml |
| 5 | <your-platform-repo>/.claude/skills/brain-smoke/SKILL.md |
One-time setup
- Restart your Claude Code session so the new
settings.jsonenv vars take effect. - Populate your secret store (optional, for K8s pod telemetry to reach Langfuse) with these keys at whatever path your collector overlay references:
CLICKHOUSE_USER=default CLICKHOUSE_PASSWORD=<password> LANGFUSE_OTEL_ENDPOINT=<your-langfuse-host>/api/public/otel LANGFUSE_OTEL_AUTH=<base64 of pk:sk> - Sync the collector deployment (ArgoCD, Flux, or
helm upgrade) to pick up the new exporter config.