Evaluation

The Evaluation subsystem measures agent output quality. It ships 12 built-in evaluators covering correctness, safety, tool usage, and task completion, and supports custom LLM-as-judge evaluators. Two evaluation backends are supported: AWS Bedrock AgentCore (agentcore) and Langfuse (langfuse).

Architecture guide: Observability & Evaluation

Key Classes

Class	Module	Purpose
`EvaluationClient`	`agent_core.evaluation.client`	AgentCore backend — wraps `bedrock_agentcore_starter_toolkit.Evaluation`
`LangfuseEvaluationClient`	`agent_core.evaluation.langfuse_client`	Langfuse backend — LLM-as-judge via Langfuse API
`BUILTIN_EVALUATORS`	`agent_core.evaluation.evaluators`	Dict of 12 built-in evaluator metadata
`EvaluationWiring`	`agent_core.evaluation.wiring`	Instantiates the correct backend from blueprint config

The backend is selected by evaluation.provider in the blueprint (agentcore is the default). EvaluationWiring is called automatically by BlueprintLoader; you rarely need to instantiate it directly.

12 Built-in Evaluators

Response Quality (TRACE level):

`BuiltinEvaluator`	What It Measures
`Correctness`	Factual accuracy of agent responses
`Completeness`	Whether the response fully addresses the request
`Faithfulness`	Whether the response is grounded in retrieved context
`Helpfulness`	How useful the response is to the user
`Harmlessness`	Whether the response avoids harmful content
`Coherence`	Logical consistency and flow
`Relevance`	How relevant the response is to the query

Task Completion (SESSION level):

`BuiltinEvaluator`	What It Measures
`GoalSuccessRate`	Whether the agent achieved the stated goal

Tool Usage (SPAN level):

`BuiltinEvaluator`	What It Measures
`ToolSelectionAccuracy`	Whether the agent chose the right tools
`ToolParameterAccuracy`	Whether the agent passed correct parameters

Safety (TRACE level):

`BuiltinEvaluator`	What It Measures
`Harmfulness`	Detection of harmful or dangerous content
`Stereotyping`	Detection of stereotyping or biased content

`EvaluationClient` (AgentCore Backend)

Requires LANGFUSE_HOST / LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY for score export, and AWS_REGION for the AgentCore Evaluation API.

from agent_core.evaluation.client import EvaluationClient

client = EvaluationClient(region="us-west-2")

result = client.run(
    agent_id="my-agent",
    session_id="sess-001",
    evaluators=[
        "Builtin.Faithfulness",
        "Builtin.Correctness",
        "Builtin.ToolSelectionAccuracy",
    ],
)

for score in result.scores:
    print(f"{score.evaluator_name}: {score.label} = {score.value}  ({score.explanation})")

Online Evaluation

from agent_core.evaluation.client import EvaluationClient
from agent_core.schemas.evaluation_config import OnlineEvaluationConfig

client = EvaluationClient(region="us-west-2")

config_id = client.create_online_config(
    agent_id="my-agent",
    config_name="production-monitoring",
    config=OnlineEvaluationConfig(
        sampling_rate=10,   # 10% of sessions
        evaluators=["Builtin.Faithfulness", "Builtin.Harmfulness"],
    ),
)

results = client.get_online_results(
    agent_id="my-agent",
    config_name="production-monitoring",
)

`LangfuseEvaluationClient` (Langfuse Backend)

Use when evaluation.provider: langfuse is set. Requires LANGFUSE_HOST, LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY.

from agent_core.evaluation.langfuse_client import LangfuseEvaluationClient
from agent_core.schemas.evaluation_config import CustomEvaluatorConfig, EvaluatorLevel

client = LangfuseEvaluationClient(
    agent_id="my-agent",
    host=os.environ["LANGFUSE_HOST"],
    public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
    secret_key=os.environ["LANGFUSE_SECRET_KEY"],
)

# Run evaluators against a trace
result = client.run(
    agent_id="my-agent",
    session_id="sess-001",
    evaluators=["Builtin.Faithfulness"],
)

Online config: LangfuseEvaluationClient.create_online_config() is a no-op stub that logs a reminder. Online evaluation for the Langfuse backend is configured in the Langfuse dashboard, not via the SDK.

Custom LLM-as-Judge

Use CustomEvaluatorConfig for both backends:

from agent_core.schemas.evaluation_config import CustomEvaluatorConfig, EvaluatorLevel

config = CustomEvaluatorConfig(
    name="domain_accuracy",
    level=EvaluatorLevel.TRACE,
    model_id="anthropic.claude-3-haiku-20240307-v1:0",
    max_tokens=1024,
    temperature=0.0,
    instructions="Evaluate factual accuracy for the given context. {context} {assistant_turn}",
    scale=[1, 5],
)

evaluator_id = client.create_evaluator(config)

result = client.run(
    agent_id="my-agent",
    session_id="sess-001",
    evaluators=[evaluator_id, "Builtin.Faithfulness"],
)

The model_id must come from the blueprint or an explicit parameter — never hardcoded.

Blueprint Configuration

evaluation:
  provider: agentcore      # agentcore | langfuse

  online:
    sampling_rate: 10      # Percentage of sessions to evaluate (1–100)
    evaluators:
      - Builtin.Faithfulness
      - Builtin.Correctness
      - Builtin.Harmfulness

  custom_evaluators:
    - name: domain_accuracy
      level: TRACE
      model_id: anthropic.claude-3-haiku-20240307-v1:0
      max_tokens: 1024
      temperature: 0.0
      instructions: "Evaluate domain accuracy. {context} {assistant_turn}"
      scale: [1, 5]

  persistence:
    enabled: true
    table_env: EVAL_TABLE_NAME   # Env var name for the DynamoDB table
    retention_days: 90

Evaluation scores are published to CloudWatch under the /Agents/{agent_name}/Evaluation namespace and, when Langfuse is enabled, attached to the corresponding Langfuse trace.