LLM Evaluation — How to Measure What Actually Matters

“It seems to work well” isn’t an evaluation strategy. Every team building LLM applications eventually hits the moment where a model update, prompt change, or retrieval tweak breaks something subtle — and they don’t notice until users complain. Systematic evaluation is how you catch regressions before they reach production.

The challenge: LLM outputs aren’t deterministic, and “correct” is subjective. You can’t unit test a chat response the way you test a function return value. But you can build an evaluation framework that catches the most important failure modes.

What to Measure

Different metrics catch different problems. Faithfulness catches hallucinations. Relevance catches off-topic answers. Context precision catches bad retrieval. Harmfulness catches safety violations. Latency and cost catch production viability issues.

LLM Evaluation Metrics

Faithfulness

Does the answer use only provided context?

LLM-as-judge checks each claim against source docs

Relevance

Does the answer address the actual question?

Semantic similarity between question and answer

Context Precision

Are the retrieved chunks relevant to the question?

Measures retrieval quality before generation

Harmfulness

Does the output contain unsafe or biased content?

Red-team prompts + toxicity classifiers

Latency / Cost

Is it fast and cheap enough for production?

P50/P99 response time, cost per query

The most impactful metric for RAG applications is faithfulness — does the answer stick to the provided context, or does the model make things up? This is measurable: extract every claim in the answer, check each claim against the source documents, and calculate the percentage that are supported. Tools like RAGAS and DeepEval automate this with LLM-as-judge patterns.

Build your eval dataset before you build your application. Start with 50-100 question-answer pairs that represent real user queries. Include edge cases: ambiguous questions, questions with no good answer in the corpus, multi-hop questions that require combining information from multiple documents. Run every change through this eval set before deploying.

The LLM-as-judge pattern is the closest thing to scalable evaluation. You use a strong model (like GPT-4 or Claude) to evaluate the outputs of your application model. It’s not perfect — judges have biases and blind spots — but it’s dramatically better than manual review at scale, and it runs in CI alongside your tests.