Troubleshooting

Top failure modes, ordered by how often they bite. Each entry: symptom, root cause, fix.

1. agent /feed returns connection refused / DNS failure

Symptom: kubectl -n <ns> exec <agent-pod> -- curl $BRAIN_URL/feed → Could not resolve host or Connection refused.

Root cause: brain-api pod not running, or service name wrong, or pod’s BRAIN_URL points at the wrong namespace.

Fix:

NS=<your-namespace>
kubectl -n $NS get pod -l 'app.kubernetes.io/instance=agentibrain-brain-api-prod'
kubectl -n $NS get svc agentibrain-brain-api
kubectl -n $NS exec <agent-pod> -- env | grep BRAIN_URL

If pod missing: ArgoCD app status, rollout, image-pull issues (see #4).
If env var wrong: see DEPLOYMENT.md § “Agent fleet wiring”.

2. /feed returns HTTP 401

Symptom: brain-api responds, but with 401.

Root cause: bearer token mismatch — agent’s KB_ROUTER_TOKEN env var doesn’t match the one brain-api was started with.

Fix:

# what brain-api expects
kubectl -n <your-namespace> get secret agentibrain-router-secrets \
  -o jsonpath='{.data.KB_ROUTER_TOKEN}' | base64 -d | head -c 12

# what the agent has
kubectl -n <your-namespace> exec <agent-pod> -- sh -c 'echo "$KB_ROUTER_TOKEN" | head -c 12'

If they don’t match, restart the agent pod after fixing the chart values, OR update agentibrain-router-secrets to match what the chart deployed.

3. ExternalSecret in SecretSyncedError

Symptom: kubectl get externalsecret embeddings-secrets -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' returns False.

Root causes (in order):

The path in your secret store doesn’t exist or has zero keys.
ESO can’t authenticate to your store (auth token expired or missing).
ESO can’t reach your store network-wise.

Fix:

# describe shows the actual error
kubectl -n <your-namespace> describe externalsecret embeddings-secrets | grep -E "Reason|Message" | head

# verify the path in your secret store using its native CLI/UI

# check ESO auth
kubectl -n external-secrets logs deploy/external-secrets --tail=50 | grep -i auth

4. Pod stuck in CreateContainerConfigError

Symptom: kubectl get pod shows CreateContainerConfigError. Describe says “secret X not found”.

Root cause: the K8s Secret the pod’s envFrom references hasn’t been created yet (ESO not synced) or was deleted.

Fix:

NS=<your-namespace>
kubectl -n $NS describe pod <pod-name> | grep -E "Warning|Error" | head
kubectl -n $NS get secret embeddings-secrets
# if missing:
kubectl -n $NS get externalsecret -A | grep embeddings
# wait 30s, then:
kubectl -n $NS delete pod <pod-name>

The Reloader controller (if installed) does this automatically.

5. tick-drain Job pods all “Failed”

Symptom: kubectl -n <your-ops-namespace> get jobs | grep tick-drain shows recent jobs as Failed.

Root cause: the tick-engine image is missing a Python module (Dockerfile drift), or the script crashed on a malformed request file in ticks/requested/.

Fix:

# read the latest pod's log
POD=$(kubectl -n <your-ops-namespace> get pod -o name | grep tick-drain | tail -1 | sed 's|pod/||')
kubectl -n <your-ops-namespace> logs $POD | tail -50

ModuleNotFoundError: rebuild image (see PR #3 in kernel for the precedent — COPY *.py ./).
malformed file: move the offending file out of requested/ to failed/ manually.

6. /tick request stays in `requested/` forever

Symptom: POST /tick returns 202 with a job_id, but the file in brain-feed/ticks/requested/ never moves.

Root cause: tick-drain CronJob disabled or broken.

Fix:

kubectl -n <your-ops-namespace> get cronjob agentibrain-brain-ops-tick-drain
# Suspend=False, last successful time recent?
kubectl -n <your-ops-namespace> describe cronjob agentibrain-brain-ops-tick-drain | tail -20

If suspended: kubectl -n <your-ops-namespace> patch cronjob agentibrain-brain-ops-tick-drain -p '{"spec":{"suspend":false}}'.

7. ArgoCD app SyncError: “shared resource”

Symptom: app-of-apps-prod reports SharedResourceWarning or Application X is part of multiple app-of-apps.

Root cause: an ArgoCD Application CR has the same metadata.name in both k8s/argocd/dev/ and k8s/argocd/prod/. They fight.

Fix: rename one. Convention: prod CRs end with -prod suffix. Dev CRs use the un-suffixed name.

8. Vault file written via /marker but agents don’t see it in /feed

Symptom: /marker returned 201, the file is on NFS, but /feed doesn’t include it.

Root cause: /feed doesn’t read every vault file. It only reads brain-feed/hot-arcs.md, brain-feed/inject.md, brain-feed/intent.md, etc. Lessons + decisions land in their own directories and don’t show up in feed until the next tick promotes a related arc.

Fix: wait for the next 2 h tick, OR force a tick: kubectl -n <your-ops-namespace> create job --from=cronjob/agentibrain-brain-ops agentibrain-brain-ops-manual-$(date +%s).

9. Embeddings pod CrashLoop with “could not connect to Postgres”

Symptom: agentibrain-embeddings-0 restarts with Postgres connection error.

Root causes:

POSTGRES_URL env var wrong — wrong host, wrong port, wrong creds.
Postgres host unreachable from the cluster.
pgvector extension not installed in the embeddings DB.

Fix:

NS=<your-namespace>
kubectl -n $NS exec agentibrain-embeddings-0 -- env | grep POSTGRES_URL | head -c 60
ssh <your-postgres-host> "docker exec <postgres-container> psql -U embeddings -d embeddings -c 'SELECT 1;'"
ssh <your-postgres-host> "docker exec <postgres-container> psql -U embeddings -d embeddings -c '\\dx vector'"

10. Amygdala broadcasting stale signal

Symptom: agentihooks brain_adapter injection contains an old signal that’s been resolved.

Root cause: the signal file in amygdala/ wasn’t deleted/updated. Tick auto-tombstoning didn’t trigger.

Fix:

ssh <your-vault-host> "ls <your-vault-path>/amygdala/"
# remove the resolved signal file:
ssh <your-vault-host> "rm <your-vault-path>/amygdala/<filename>.md"
# next /feed will refresh

11. brain-ops 2 h tick fails with “ImportError”

Symptom: brain-ops job fails. Pod log shows ModuleNotFoundError.

Root cause: tick-engine image stale — a Python module was added in source but the Dockerfile COPY line missed it.

Fix: kernel PR with COPY *.py ./ (or explicit add). Force docker-build to rebuild :latest. ArgoCD image-updater picks up new digest.

12. Two agents writing the same marker collide

Symptom: duplicate entries in lessons-YYYY-MM-DD.md, or 409 from /marker.

Root cause: missing or duplicate X-Idempotency-Key.

Fix: clients MUST send a per-marker idempotency key. Pattern: <session_id>-<marker_index>. Replays return the original response with idempotent_replay: true — that’s the contract, not an error.

13. blackbox probe shows agentibrain-embeddings-k8s-prod down

Symptom: Grafana / Alertmanager fires for the kernel embeddings probe.

Root cause: probe URL drifted vs reality. Check k8s/charts/blackbox-exporter/values-targets.yaml — port 8080, path /health, host agentibrain-embeddings.<your-namespace>.svc.cluster.local.

Fix: correct the URL in values-targets.yaml + redeploy blackbox.

14. ArgoCD app stuck “OutOfSync” for hours

Symptom: the app’s source revision matches main HEAD, but app says OutOfSync.

Root causes:

A finalizer on a deleted CR is blocking sync.
A Helm template renders different output than what’s in cluster.

Fix:

# inspect operationState message
kubectl -n argocd get app <app-name> -o jsonpath='{.status.operationState.message}'

# common: "waiting for deletion of X"
kubectl -n argocd patch app <app-name> --type=merge \
  -p '{"metadata":{"finalizers":[]}}'

15. :8080/health timing out

Symptom: an external docker consumer can’t reach embeddings on its LoadBalancer IP.

Root cause: agentibrain-embeddings Service either lost its LoadBalancer type or your LB controller’s IP binding broke.

Fix:

kubectl -n <your-namespace> get svc agentibrain-embeddings -o jsonpath='type={.spec.type} ip={.status.loadBalancer.ingress[0].ip}'
# expect: type=LoadBalancer ip=<your-cluster-ip>

If it shows ClusterIP, the values-prod.yaml LB config didn’t apply — check ArgoCD sync state.

When this doc isn’t enough

OPERATIONS.md for routine ops
architecture/Overview for design context
Open an issue in agentibrain-kernel

Troubleshooting

1. agent /feed returns connection refused / DNS failure

2. /feed returns HTTP 401

3. ExternalSecret in SecretSyncedError

4. Pod stuck in CreateContainerConfigError

5. tick-drain Job pods all “Failed”

6. /tick request stays in requested/ forever

7. ArgoCD app SyncError: “shared resource”

8. Vault file written via /marker but agents don’t see it in /feed

9. Embeddings pod CrashLoop with “could not connect to Postgres”

10. Amygdala broadcasting stale signal

11. brain-ops 2 h tick fails with “ImportError”

12. Two agents writing the same marker collide

13. blackbox probe shows agentibrain-embeddings-k8s-prod down

14. ArgoCD app stuck “OutOfSync” for hours

15. :8080/health timing out

When this doc isn’t enough

6. /tick request stays in `requested/` forever