Measuring Agent Quality

Evaluation answers the question “is my agent good?” — not just “does it run?” It uses an LLM-as-judge to score agent behavior by reading the execution traces captured during real sessions. Evaluation never re-runs the agent — the trace record is sufficient.

The platform supports two evaluation backends: AgentCore Evaluation (the default, powered by AWS Bedrock) and Langfuse (provider-agnostic, dashboard-driven). Choose the backend with the evaluation.provider field in the blueprint.

How it works

Agent runs → OTEL traces captured → Evaluation reads traces → Judge model scores → Results stored

Observability must be enabled so traces are captured. Evaluation then reads those traces and runs them through the configured judge model. This decoupling means you can evaluate past sessions, investigate specific failures without recreating them, and run batch evaluations.

Evaluation providers

provider: agentcore (default)

Uses AWS Bedrock AgentCore Evaluation. The judge model must be a Bedrock-hosted model. Online config (continuous sampling) is fully functional via the AgentCore API.

Requirements: AWS_REGION env var; agent IAM role must have Bedrock evaluation permissions.

evaluation:
  provider: agentcore
  online:
    sampling_rate: 10
    evaluators:
      - Builtin.GoalSuccessRate
      - Builtin.Correctness

provider: langfuse

Uses LangfuseEvaluationClient — works with any inference provider. Requires LANGFUSE_HOST, LANGFUSE_PUBLIC_KEY, and LANGFUSE_SECRET_KEY.

Key difference: Langfuse online evaluation is configured in the Langfuse dashboard, not at agent runtime. The create_online_config() call logs a reminder and returns a no-op — the loader wiring keeps working, but continuous evaluation runs are triggered from the Langfuse UI or REST API, not from blueprint env vars. On-demand scoring via EvaluationClient.run() is fully functional and writes scores to Langfuse traces via langfuse.score().

evaluation:
  provider: langfuse
  online:
    sampling_rate: 10     # noted in logs; configure evaluation runs in Langfuse dashboard
    evaluators:
      - Builtin.GoalSuccessRate

Install the required extra: pip install 'agent-core[evaluation_langfuse]'

12 built-in evaluators

Response quality — TRACE level (7 evaluators)

Evaluator What it measures Typical score labels
Builtin.Correctness Factual accuracy of the response Correct / Incorrect
Builtin.Completeness Whether all aspects of the question were addressed Complete / Incomplete
Builtin.Faithfulness Whether claims are grounded in retrieved context (no hallucination) Faithful / Unfaithful
Builtin.Helpfulness Practical usefulness to the user Helpful / Not Helpful
Builtin.Harmlessness Absence of harmful, offensive, or dangerous content Harmless / Harmful
Builtin.Coherence Logical consistency and clarity of the response Coherent / Incoherent
Builtin.Relevance Whether the response is on-topic Relevant / Irrelevant

Task completion — SESSION level (1 evaluator)

Evaluator What it measures
Builtin.GoalSuccessRate Whether the agent achieved the user’s stated goal end-to-end

GoalSuccessRate is the most valuable single metric. It evaluates the full conversation holistically, not just the final response.

Tool usage — SPAN level (2 evaluators)

Evaluator What it measures
Builtin.ToolSelectionAccuracy Whether the agent selected the right tools for the task
Builtin.ToolParameterAccuracy Whether tool inputs were correctly specified

Low ToolParameterAccuracy often reveals prompt engineering issues where the agent misunderstands what a tool expects.

Safety — TRACE level (2 evaluators)

Evaluator What it measures
Builtin.Harmfulness Detection of dangerous, illegal, or harmful content
Builtin.Stereotyping Detection of biased or stereotyped outputs

Scoring scale

All evaluators return a numeric score from 0.0 to 1.0 and a categorical label:

Score Typical label Meaning
1.0 Achieved / Correct / Accurate / Harmless Full success
0.5 Partial / Partially Compliant Partial success
0.0 Failed / Incorrect / Inaccurate / Harmful Failure

The label vocabulary varies by evaluator but follows the same numeric scale. Every score includes an explanation field with the judge model’s reasoning.

On-demand evaluation

Score a specific session immediately after it completes:

from agent_core.evaluation.client import EvaluationClient

client = EvaluationClient(region="us-west-2")

result = client.run(
    agent_id="my-agent",
    session_id="sess-a1b2c3",
    evaluators=[
        "Builtin.GoalSuccessRate",
        "Builtin.Correctness",
        "Builtin.ToolSelectionAccuracy",
    ],
)

for score in result.scores:
    print(f"{score.evaluator_name}: {score.label} ({score.value:.2f})")
    print(f"  Explanation: {score.explanation}")

Or via CLI:

agentcli eval run \
  --agent-id my-agent \
  --session-id sess-a1b2c3 \
  --evaluators Builtin.GoalSuccessRate,Builtin.Correctness,Builtin.ToolSelectionAccuracy

Online evaluation (continuous monitoring)

Configure continuous scoring of live sessions in the blueprint:

evaluation:
  provider: agentcore
  online:
    sampling_rate: 10      # evaluate 10% of sessions automatically
    evaluators:
      - Builtin.GoalSuccessRate
      - Builtin.Correctness
      - Builtin.ToolSelectionAccuracy

At 10% sampling you get a representative sample at reasonable cost. At 100% you get complete coverage. Results feed into the CloudWatch GenAI Observability dashboard alongside latency and token metrics.

Custom LLM-as-judge evaluators

Define domain-specific evaluators when the built-in set is insufficient. The judge model receives the agent trace and scores it according to your instructions:

from agent_core.evaluation.client import EvaluationClient
from agent_core.schemas.evaluation_config import CustomEvaluatorConfig, EvaluatorLevel

client = EvaluationClient(region="us-west-2")

config = CustomEvaluatorConfig(
    name="workflow_compliance",
    level=EvaluatorLevel.TRACE,
    model_id="${EVALUATOR_MODEL_ID}",  # from blueprint or env, never hardcoded
    max_tokens=512,
    temperature=0.0,
    instructions=(
        "Evaluate whether the agent followed the required workflow: "
        "1. Gathered all required information before acting. "
        "2. Confirmed the action with the user. "
        "3. Handled errors gracefully. "
        "Context: {context}  Agent response: {assistant_turn}"
    ),
    scale=[1, 5],
)

evaluator_id = client.create_evaluator(config)

result = client.run(
    agent_id="my-agent",
    session_id="sess-001",
    evaluators=[evaluator_id, "Builtin.Faithfulness"],
)

Custom evaluators are declared in the blueprint and used identically to built-in evaluators:

evaluation:
  provider: agentcore
  custom_evaluators:
    - name: workflow_compliance
      level: TRACE
      model_id: "${EVALUATOR_MODEL_ID}"
      max_tokens: 512
      temperature: 0.0
      instructions: "... {context} {assistant_turn}"
      scale: [1, 5]
  online:
    sampling_rate: 10
    evaluators:
      - Builtin.GoalSuccessRate
      - workflow_compliance

Score persistence

Evaluation scores can be persisted to DynamoDB independently of the evaluation provider:

evaluation:
  persistence:
    enabled: true
    table_env: EVAL_TABLE_NAME   # env var holding the DynamoDB table name
    retention_days: 90

Each score is written as a DynamoDB item with a TTL. Query by session via the session_id-index GSI on the evaluation table.

Which evaluators to use

Scenario Recommended evaluators
General quality baseline GoalSuccessRate, Correctness, Helpfulness
Tool-heavy agents ToolSelectionAccuracy, ToolParameterAccuracy, GoalSuccessRate
RAG / retrieval agents Faithfulness, Correctness, Completeness
Safety-sensitive deployments Harmlessness, Harmfulness, Stereotyping
Production continuous monitoring GoalSuccessRate at 10% sampling
Investigating specific failures Full set on the problem session

See also