Welcome to the Bluefly.io code repository. By logging in, you agree to comply with our terms of service and code of conduct.

Phoenix Evaluations: LLM Quality Metrics

Objective

Implement comprehensive Phoenix evaluations for agent LLM responses.

Metrics to Track

Hallucination Detection - Check responses against source documents
Reasoning Quality - Evaluate logical consistency
Decision Confidence Calibration - Compare predicted confidence to actual outcomes
Response Relevance - Measure alignment with user intent

Implementation

Create PhoenixEvaluator class in src/telemetry/phoenix-evaluator.ts
Use Phoenix Evaluations API
Run evaluations async after each LLM call
Store evaluation results in Phoenix

Acceptance Criteria

All LLM responses automatically evaluated
Evaluations visible in Phoenix UI
Evaluation latency <500ms
Support custom evaluation functions