agentcli eval
Run on-demand evaluation against a specific agent session, or check the status of continuous online evaluation. Uses the EvaluationClient from the agent-core SDK, which dispatches to either the AgentCore Evaluation backend (evaluation.provider: agentcore, the default) or the Langfuse backend (evaluation.provider: langfuse) depending on what the agent blueprint declares.
Synopsis
agentcli eval <subcommand> [OPTIONS]
Environment Variables
| Variable | Description |
|---|---|
AWS_REGION | AWS region for the evaluation client |
Subcommands
| Subcommand | Description |
|---|---|
run | Run on-demand evaluation for a specific agent and session |
status | Check online evaluation status and fetch recent results |
agentcli eval run
Score a completed agent session using one or more evaluators. Evaluation reads the OTEL traces captured during the session — no need to re-run the agent.
agentcli eval run --agent-id AGENT_ID --session-id SESSION_ID --evaluators EVALUATORS
Options
| Option | Required | Description |
|---|---|---|
--agent-id | Yes | AgentCore Runtime agent ID |
--session-id | Yes | Session ID to evaluate |
--evaluators | Yes | Comma-separated evaluator IDs |
Built-in Evaluator IDs
| Category | Evaluator ID | What It Measures |
|---|---|---|
| Response Quality | Builtin.Correctness | Factual accuracy of the response |
| Response Quality | Builtin.Completeness | Whether the response addresses all aspects of the question |
| Response Quality | Builtin.Faithfulness | Whether claims are grounded in retrieved context |
| Response Quality | Builtin.Helpfulness | Practical usefulness to the user |
| Response Quality | Builtin.Harmlessness | Absence of harmful content |
| Response Quality | Builtin.Coherence | Logical consistency and clarity |
| Response Quality | Builtin.Relevance | Relevance to the user’s question |
| Task Completion | Builtin.GoalSuccessRate | Whether the agent achieved the user’s stated goal |
| Tool Usage | Builtin.ToolSelectionAccuracy | Whether the agent selected appropriate tools |
| Tool Usage | Builtin.ToolParameterAccuracy | Whether tool inputs were correctly specified |
| Safety | Builtin.Harmfulness | Detection of dangerous or harmful content |
| Safety | Builtin.Stereotyping | Detection of biased or stereotyped outputs |
Examples
# Evaluate with correctness and goal success
agentcli eval run \
--agent-id my-agent \
--session-id sess-a1b2c3 \
--evaluators Builtin.Correctness,Builtin.GoalSuccessRate
Output:
Running evaluation: agent=my-agent session=sess-a1b2c3 evaluators=2
Evaluation Results
+-------------------------+-----------+-------+------------------------------------------+
| Evaluator | Label | Score | Explanation |
+-------------------------+-----------+-------+------------------------------------------+
| Builtin.Correctness | Correct | 1.00 | The response accurately reflects the ... |
| Builtin.GoalSuccessRate | Achieved | 1.00 | The agent successfully completed the ... |
+-------------------------+-----------+-------+------------------------------------------+
# Evaluate tool usage accuracy
agentcli eval run \
--agent-id my-agent \
--session-id sess-a1b2c3 \
--evaluators Builtin.ToolSelectionAccuracy,Builtin.ToolParameterAccuracy
# Full quality evaluation
agentcli eval run \
--agent-id my-agent \
--session-id sess-a1b2c3 \
--evaluators Builtin.Correctness,Builtin.Completeness,Builtin.Helpfulness,Builtin.GoalSuccessRate,Builtin.ToolSelectionAccuracy
agentcli eval status
Check the status of a continuous online evaluation configuration and retrieve recent aggregate results.
agentcli eval status --agent-id AGENT_ID [--config-name CONFIG_NAME]
Options
| Option | Required | Description |
|---|---|---|
--agent-id | Yes | AgentCore Runtime agent ID |
--config-name | No | Online eval config name (default: {agent_id}_online_eval) |
Example
agentcli eval status --agent-id my-agent
# Fetching online eval status: agent=my-agent config=my-agent_online_eval
agentcli eval status \
--agent-id my-agent \
--config-name my-agent_prod_monitoring
Score Interpretation
Evaluators return a numeric score and a label:
| Score | Typical Label | Meaning |
|---|---|---|
1.0 | Achieved / Correct / Accurate | Full success |
0.5 | Partial | Partial success |
0.0 | Failed / Incorrect / Inaccurate | Failure |
Scores are on a 0–1 scale. Labels vary by evaluator but follow the same pattern.
See Also
- Observability & Evaluation — how evaluation works, 12 built-in evaluators, LLM-as-judge
- Observability & Evaluation — OTEL traces that evaluation reads, and the agentcore vs langfuse provider choice
- Evaluation SDK Reference — programmatic evaluation API and
CustomEvaluatorConfig