Brain Nervous System — OpenTelemetry Observability

Shipped: 2026-04-13 Scope: Full 5-layer OTel pipeline + deterministic smoke test Companion docs: docs/architecture/ARCHITECTURE.md

Why

The brain produces thoughts (marker emissions, broadcasts, injections). Before this work, to verify that a broadcast actually reached a session we had to ask the agent “did you see it?” — interrogating the brain instead of reading the nerve signal. This made validation slow, unreliable, and unscalable.

This doc describes the nervous system: the deterministic telemetry pipeline that observes every brain event as an OTel span or log, landing in ClickHouse + Langfuse, queryable from Grafana, and smoke-testable via a single CLI.

Architecture

LAYER 1 — Claude Code native OTel (zero code)
  ~/.claude/settings.json env vars
    CLAUDE_CODE_ENABLE_TELEMETRY=1
    CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1
    OTEL_EXPORTER_OTLP_ENDPOINT=http://${OTEL_HOST}:4318
    OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
    OTEL_SERVICE_NAME=claude-code-home
  Emits: claude_code.user_prompt, claude_code.tool_result,
         claude_code.api_request, plus metrics (tokens, cost, active_time)

LAYER 2 — Agentihooks structured spans
  agentihooks/hooks/telemetry.py
    emit_span(name, attrs)
    span_ctx(name, attrs) → wraps block with duration
    emit_log(message, attrs) → OTLP /v1/logs fan-out
  Three instrumented functions emit three span types:
    brain.inject         brain_adapter._publish_entries
    brain.marker_write   brain_writer_hook.write_markers
    brain.delivery       broadcast.check_and_inject_broadcasts (per message)

LAYER 3 — Hook log OTLP fan-out
  agentihooks/hooks/common.py log() extended
  Events matching brain_* / broadcast_* / outbox_* / amygdala_*
  additionally POST to OTLP /v1/logs → ClickHouse otel_logs

LAYER 4 — Data plane
  otel-collector (Docker Compose at ${OTEL_HOST}:4318)
    ↓
  ClickHouse otel database  (otel_traces, otel_logs, otel_metrics)
  Langfuse via otlphttp/langfuse exporter (traces only)

LAYER 5 — Visualization
  Grafana dashboard /d/<your-dashboard-slug>/brain-nervous-system
    4 new panels: inject latency, marker write rate, delivery coverage, error rate
  Langfuse <your-langfuse-host>
    Per-session trace view: claude_code.user_prompt → brain.inject → brain.delivery(×N)

Span taxonomy

Span name	Fired by	Attributes	When
`brain.inject`	`brain_adapter._publish_entries`	`channel, entry_count, total_bytes, published_count`	Session start + every BRAIN_REFRESH_INTERVAL turns
`brain.marker_write`	`brain_writer_hook.write_markers`	`session_id, transcript_path, source, markers_found, outbox_count, redis_count, marker_types`	Stop hook (every turn)
`brain.delivery`	`broadcast.check_and_inject_broadcasts`	`session_id, message_id, channel, severity, source, bytes, persistent`	UserPromptSubmit, per broadcast message
`agentihooks.session.stop`	`hook_manager.on_stop`	`session_id, tool_calls, errors`	Stop hook (existing, pre-this work)

Log taxonomy

Log prefix	Source	Purpose
`brain_*`	brain_adapter, brain_writer_hook	Deterministic audit of brain lifecycle
`broadcast_*`	broadcast.py	Publish/deliver/expire events
`outbox_*`	brain_writer_hook	Marker file writes
`amygdala_*`	amygdala_hook	Emergency signal reads

All logs carry service.name=agentihooks resource attribute and land in otel.otel_logs.

Query cookbook (ClickHouse)

-- All brain spans in the last hour
SELECT SpanName, count() FROM otel.otel_traces
WHERE SpanName LIKE 'brain.%' AND Timestamp > now() - INTERVAL 1 HOUR
GROUP BY SpanName;

-- Find all brain events for a specific session
SELECT Timestamp, SpanName, SpanAttributes
FROM otel.otel_traces
WHERE SpanAttributes['session_id'] = 'smoke-abc123'
ORDER BY Timestamp;

-- Slowest brain.inject calls in last 24h
SELECT Timestamp, Duration/1e6 AS ms, SpanAttributes['entry_count'] AS entries
FROM otel.otel_traces
WHERE SpanName='brain.inject' AND Timestamp > now() - INTERVAL 24 HOUR
ORDER BY Duration DESC LIMIT 20;

-- Marker write rate by type in the last 7 days
SELECT toStartOfDay(Timestamp) AS day,
       SpanAttributes['marker_types'] AS types,
       count() AS writes
FROM otel.otel_traces
WHERE SpanName='brain.marker_write' AND Timestamp > now() - INTERVAL 7 DAY
GROUP BY day, types ORDER BY day DESC;

-- Delivery coverage — which channels reached sessions today
SELECT SpanAttributes['channel'] AS channel,
       uniqExact(SpanAttributes['session_id']) AS sessions,
       count() AS deliveries
FROM otel.otel_traces
WHERE SpanName='brain.delivery' AND toDate(Timestamp) = today()
GROUP BY channel ORDER BY sessions DESC;

-- Recent brain errors from the hook log
SELECT Timestamp, Body, LogAttributes
FROM otel.otel_logs
WHERE SeverityText='ERROR' AND Body LIKE '%brain%'
AND Timestamp > now() - INTERVAL 1 HOUR
ORDER BY Timestamp DESC LIMIT 20;

Langfuse usage

Navigate to <your-langfuse-host>
Filter service → your Claude Code service name (e.g. claude-code-home) or agentihooks (hook-emitted spans)
Pick a session trace → expand to see the causal graph:
- claude_code.user_prompt (root)
  - brain.inject (from SessionStart/refresh)
  - brain.delivery × N (one per message delivered this turn)
  - claude_code.tool_result × M (tool calls the agent made)
  - brain.marker_write (on Stop)

Use Langfuse for “what did this specific session experience” — ClickHouse for “what’s the aggregate system state”.

Smoke test — `brain-smoke`

Run after any change to brain code:

cd <path-to>/agentihooks
./scripts/brain-smoke                # 4 core tests, offline
./scripts/brain-smoke --otel-check   # + ClickHouse span verification (5 tests)

CLICKHOUSE_URL="http://default:$PASS@${CLICKHOUSE_HOST}:8123" \
  ./scripts/brain-smoke --otel-check

Tests:

inject — SessionStart payload → broadcast.json has brain entries
delivery — UserPromptSubmit → hook log grows
marker_write — Stop + lesson marker → outbox gains a file
error_path — Stop with empty transcript_path → graceful
otel_spans — ClickHouse has brain.* spans from this run

Exit 0 on all-pass, 1 on any fail. Runs in ~1.5 s. Wired into CI at agentihooks/.github/workflows/brain-smoke.yml.

Troubleshooting

Spans don’t show up in ClickHouse

Check OTel collector is reachable: curl http://${OTEL_HOST}:4318 → expect 404 (endpoint up, wrong path)
Check env vars: printenv | grep OTEL → should show OTLP_ENDPOINT + PROTOCOL
Check agentihooks telemetry flag: printenv | grep OTEL_HOOKS_ENABLED → should be true
Restart your Claude Code session — settings.json env is read at startup

Langfuse is empty

Langfuse only receives from the otlphttp/langfuse exporter in the collector
Verify collector config: ssh <otel-host> "grep -A3 langfuse <path-to-collector>/otelcol.yaml"
Verify credentials env: ssh <otel-host> "docker exec <otel-collector-container> printenv | grep LANGFUSE"
For K8s pods: configure your otel-collector chart’s langfuse exporter with credentials from your secret store (e.g. <your-prefix>/otel-collector-prod)

ClickHouse lag > 30s

Check collector batch processor: default timeout: 5s — spans may batch for up to 5s before flush
Query otel.otel_traces with a wider time window (INTERVAL 5 MINUTE)

brain-smoke fails on delivery test

Means hooks.log didn’t grow — either LOG_HOOKS=false or ~/.agentihooks/logs/ is read-only
Fix: mkdir -p ~/.agentihooks/logs && echo > ~/.agentihooks/logs/hooks.log

Telemetry is blocking hook execution

Should not happen (telemetry is fail-silent) but if it does: set OTEL_HOOKS_ENABLED=false
The file log keeps working independently of the OTLP fan-out

Files

Layer	Path
1	`~/.claude/settings.json` (env block)
2	`agentihooks/hooks/telemetry.py`
2	`agentihooks/hooks/context/brain_adapter.py:151`
2	`agentihooks/hooks/context/brain_writer_hook.py:168`
2	`agentihooks/hooks/context/broadcast.py:500`
3	`agentihooks/hooks/common.py:102` (`log()` fan-out)
4	`<your-platform-repo>/stacks/observability/otelcol.yaml` (Docker collector)
4	`<your-platform-repo>/k8s/charts/otel-collector/values-{dev,prod}.yaml` (K8s collector)
4	`<your-platform-repo>/k8s/charts/otel-collector/templates/external-secret.yaml`
5	`<your-platform-repo>/stacks/observability/provisioning/dashboards/<your-dashboard>/brain-health.json` (4 new panels, ids 101-104)
5	`agentihooks/scripts/brain-smoke`
5	`agentihooks/.github/workflows/brain-smoke.yml`
5	`<your-platform-repo>/.claude/skills/brain-smoke/SKILL.md`

One-time setup

Restart your Claude Code session so the new settings.json env vars take effect.

Populate your secret store (optional, for K8s pod telemetry to reach Langfuse) with these keys at whatever path your collector overlay references:

CLICKHOUSE_USER=default
CLICKHOUSE_PASSWORD=<password>
LANGFUSE_OTEL_ENDPOINT=<your-langfuse-host>/api/public/otel
LANGFUSE_OTEL_AUTH=<base64 of pk:sk>

Sync the collector deployment (ArgoCD, Flux, or helm upgrade) to pick up the new exporter config.