Notes on LLM Evaluation | Towards Data Science

, one could argue that most of the work resembles traditional software development more than ML or Data Science, considering we often use off-the-shelf foundation models instead of training them ourselves. Even so, I still believe that one of the most critical parts of building an LLM-based application centers on data, specifically the evaluation pipeline. You can’t improve what you can’t measure, and you can’t measure what you don’t understand. To build an evaluation pipeline, you still need to invest a substantial amount of effort in examining, understanding, and analyzing your data.

In this blog post, I want to document some notes on the process of building an evaluation pipeline for an LLM-based application I’m currently developing. It’s also an exercise in applying theoretical concepts I’ve read about online to a concrete example, mainly from Hamel Husain’s blog.

The Application – Explaining our scenario and use case
The Eval Pipeline – Overview of the evaluation pipeline and its main components. For each step, we will divide it into:
1. Overview – A brief, conceptual explanation of the step.
2. In Practice – A concrete example of applying the concepts based on our use case.
What Lies Ahead – This is just the beginning. How will our evaluation pipeline evolve?
Conclusion – Recapping the key steps and final thoughts.

1. The Application

To ground our discussion, let’s use a concrete example: an AI-powered IT Helpdesk Assistant*.

The AI serves as the first line of support. An employee submits a ticket describing a technical issue—their laptop is slow, they can’t connect to the VPN, or an application is crashing. The AI’s task is to analyze the ticket, provide initial troubleshooting steps, and either resolve the issue or escalate it to the appropriate human specialist.

Evaluating the performance of this application is a subjective task. The AI’s output is free-form text, meaning there is no single “correct” answer. A helpful response can be phrased in many ways, so we cannot simply check if the output is “Option A” or “Option B.” It is also not a regression task, where we can measure numerical error using metrics like Mean Squared Error (MSE).

A “good” response is defined by a combination of factors: Did the AI correctly diagnose the problem? Did it suggest relevant and safe troubleshooting steps? Did it know when to escalate a critical issue to a human expert? A response can be factually correct but unhelpful, or it can fail by not escalating a serious problem.

* For context: I am using the IT Helpdesk scenario as a substitute for my actual use case to discuss the methodology openly. The analogy isn’t perfect, so some examples might feel a bit stretched to make a specific point.

2. The Eval Pipeline

Now that we understand our use case, let’s proceed with an overview of the proposed evaluation pipeline. In the following sections, we will detail each section and contextualize it by providing examples relevant to our use case.

Overview of the proposed evaluation pipeline, showing the flow from data collection to a repeatable, iterative improvement cycle. Image by author.

The Data

It all starts with data – ideally, real data from your production environment. If you don’t have it yet, you can try using your application yourself or ask friends to use it to get a sense of how it can fail. In some cases, it’s possible to generate synthetic data to get things started, or to complement existing data, if your volume is low.

When using synthetic data, ensure it is of high quality and closely matches the expectations of real-world data.

While LLMs are relatively recent, humans have been studying, training, and certifying themselves for quite some time. If possible, try to leverage existing material designed for humans to help you with generating data for your application.

In Practice

My initial dataset was small, containing a handful of real user tickets from production and some demonstration examples created by a domain expert to cover common scenarios.

Since I didn’t have many examples, I used existing certification exams for IT support professionals, which consisted of multiple-choice questions with an answer guide and scoring keys. This way, I not only had the correct answer but also a detailed explanation of why each choice was wrong or right.

I used an LLM to transform these exam questions into a more useful format. Each question became a simulated user ticket, and the answer keys and explanations were repurposed to generate examples of both effective and ineffective AI responses, complete with a clear rationale for each.

When using external sources, it’s important to be mindful of data contamination. If the certification material is publicly available, it may have already been included in the training data for the foundation model. This could cause you to assess the model’s memory instead of its ability to reason on new, unseen problems, which may yield overly optimistic or misleading results. If the model’s performance on this data seems surprisingly perfect, or if its outputs closely match the source text, chances are contamination is involved.

Data Annotation

Now that you have gathered some data, the next crucial step is analyzing it. This process should be active, so make sure to note your insights as you go. There are numerous ways to categorize or divide the different tasks involved in data annotation. I typically consider this in two main parts:

Error Analysis: Reviewing existing (often imperfect) outputs to identify failures. For example, you could add free-text notes explaining the failures or tag inadequate responses with different error categories. You can find a much more detailed explanation of error analysis on Hamel Husain’s blog.
Success Definition: Creating ideal artifacts to define what success looks like. For example, for each output, you could write ground-truth reference answers or develop a rubric with guidelines that specify what an ideal answer should include.

The main goal is to gain a clearer understanding of your data and application. Error analysis helps identify the primary failure modes your application faces, enabling you to address the underlying issues. Meanwhile, defining success enables you to establish the appropriate criteria and metrics for accurately assessing your model’s performance.

Don’t worry if you’re unsure about recording information precisely. It’s better to start with open-ended notes and unstructured annotations rather than stressing over the perfect format. Over time, you’ll notice the key aspects to assess and common failure patterns naturally emerge.

In Practice

I decided to approach this by first creating a custom-made tool designed explicitly for data annotation, which enables me to scan through production data, add notes, and generate reference answers, as previously discussed. I found this to be a relatively fast process because we can build a tool that operates somewhat independently of your main application. Considering it’s a tool for personal use and of limited scope, I was able to “vibe-code” it with less concern than I would have in usual settings. Of course, I’d still review the code, but I wasn’t too concerned if things broke once in a while.

To me, the most important outcome of this process is that I gradually learned what makes a bad response bad and what makes a good response good. With that, you can define your evaluation metrics to effectively measure what matters to your use case. For example, I realized my solution exhibited a behavior of “over-referral,” which means escalating simple requests to human specialists. Other issues, to a lesser extent, included inaccurate troubleshooting steps and incorrect root-cause diagnosis.

Writing Rubrics

In the success definition steps, I found that writing rubrics was very helpful. My guideline for creating the rubrics was to ask myself: what makes an ideal response a good response? This allows for reducing the subjectivity of the evaluation process – no matter how the response is phrased, it should tick all the boxes in the rubric.

Considering this is the initial stage of your evaluation process, you won’t know all the overall criteria beforehand, so I would define the requirements on an example basis, rather than trying to establish a single guideline for all examples. I also didn’t worry too much about setting a rigorous schema. Any criteria in my rubric needs to have a key and a value. I can choose this value to be either a boolean, a string, or a list of strings. The rubrics can be flexible because they are intended to be used by either a human or an LLM judge, and both can deal with this subjectivity. Also, as mentioned before, as you continue with this process, the ideal rubric guidelines will naturally stabilize.

Here’s an example:

{
  "fields": {
    "clarifying_questions": {
      "type": "array",
      "value": [
        "Asks for the specific error message",
        "Asks if the user recently changed their password"
      ]
    },
    "root_cause_diagnosis": {
      "type": "string",
      "value": "Expired user credentials or MFA token sync issue"
    },
    "escalation_required": {
      "type": "boolean",
      "value": false
    },
    "recommended_solution_steps": {
      "type": "array",
      "value": [
        "Guide user to reset their company password",
        "Instruct user to re-sync their MFA device"
      ]
    }
  }
}

Although each example’s rubric may differ from the others, we can group them into well-defined evaluation criteria for the next step.

Running the Evaluations

With annotated data in hand, you can build a repeatable evaluation process. The first step is to curate a subset of your annotated examples to create a versioned evaluation dataset. This dataset should contain representative examples that cover your application’s common use cases and all the failure modes you have identified. Versioning is critical; when comparing different experiments, you must ensure they are benchmarked against the same data.

For subjective tasks like ours, where outputs are free-form text, an “LLM-as-a-judge” can automate the grading process. The evaluation pipeline feeds the LLM judge an input from your dataset, the AI application’s corresponding output, and the annotations you created (such as the reference answer and rubric). The judge’s role is to score the output against the provided criteria, turning a subjective assessment into quantifiable metrics.

These metrics allow you to systematically measure the impact of any changes, whether it’s a new prompt, a different model, or a change in your RAG strategy. To ensure that these metrics are meaningful, it is essential to periodically verify that the LLM judge’s evaluations align with those of a human domain expert within an accepted range.

In Practice

After completing the data annotation process, we should gain a clearer understanding of what makes a response good or bad and, with that knowledge, establish a core set of evaluation dimensions. In my case, I identified the following areas:

Escalation Behavior: Measures if the AI escalates tickets appropriately. A response is rated as ADEQUATE, OVER-ESCALATION (escalating simple issues), or UNDER-ESCALATION (failing to escalate critical problems).
Root Cause Accuracy: Assesses whether the AI correctly identifies the user’s problem. This is a binary CORRECT or INCORRECT evaluation.
Solution Quality: Evaluates the relevance and safety of the proposed troubleshooting steps. It also considers whether the AI asks for necessary clarifying information before offering a solution. It’s rated ADEQUATE or INADEQUATE.

With these dimensions defined, I could run evaluations. For each item in my versioned evaluation set, the system generates a response. This response, along with the original ticket and its annotated rubric, is then passed to an LLM judge. The judge receives a prompt that instructs it on how to use the rubric to score the response across the three dimensions.

This is the prompt I used for the LLM judge:

You are an expert IT Support AI evaluator. Your task is to judge the quality of an AI-generated response to an IT helpdesk ticket. To do so, you will be given the ticket details, a reference answer from a senior IT specialist, and a rubric with evaluation criteria.

#{ticket_details}

**REFERENCE ANSWER (from IT Specialist):**
#{reference_answer}

**NEW AI RESPONSE (to be evaluated):**
#{new_ai_response}

**RUBRIC CRITERIA:**
#{rubric_criteria}

**EVALUATION INSTRUCTIONS:**

[Evaluation instructions here...]

**Evaluation Dimensions**
Evaluate the AI response on the following dimensions:
- Overall Judgment: GOOD/BAD
- Escalation Behavior: If the rubric's `escalation_required` is `false` but the AI escalates, label it as `OVER-ESCALATION`. If `escalation_required` is `true` but the AI does not escalate, label it `UNDER-ESCALATION`. Otherwise, label it `ADEQUATE`.
- Root Cause Accuracy: Compare the AI's diagnosis with the `root_cause_diagnosis` field in the rubric. Label it `CORRECT` or `INCORRECT`.
- Solution Quality: If the AI's response fails to include necessary `recommended_solution_steps` or `clarifying_questions` from the rubric, or suggests something unsafe, label it as `INADEQUATE`. Otherwise, label it as `ADEQUATE`.

If the rubric does not provide enough information to evaluate a dimension, use the reference answer and your expert judgment.

**Please provide:**
1. An overall judgment (GOOD/BAD)
2. A detailed explanation of your reasoning
3. The escalation behavior (`OVER-ESCALATION`, `ADEQUATE`, `UNDER-ESCALATION`)
4. The root cause accuracy (`CORRECT`, `INCORRECT`)
5. The solution quality (`ADEQUATE`, `INADEQUATE`)

**Response Format**
Provide your response in the following JSON format:

{
  "JUDGMENT": "GOOD/BAD",
  "REASONING": "Detailed explanation",
  "ESCALATION_BEHAVIOR": "OVER-ESCALATION/ADEQUATE/UNDER-ESCALATION",
  "ROOT_CAUSE_ACCURACY": "CORRECT/INCORRECT",
  "SOLUTION_QUALITY": "ADEQUATE/INADEQUATE"
}

3. What Lies Ahead

Our application is starting out simple, and so is our evaluation pipeline. As the system expands, we’ll need to adjust our methods for measuring its performance. This means we’ll have to consider several aspects down the line. Some key ones include:

How many examples are enough?

I started with about 50 examples, but I haven’t analyzed how close this is to an ideal number. Ideally, we want enough examples to produce reliable results while keeping the cost of running them affordable. In Chip Huyen’s AI Engineering book, there’s a mention of an interesting approach that involves creating bootstraps of your evaluation set. For instance, from my original 50-sample set, I could create multiple bootstraps by drawing 50 samples with replacement, then evaluate and compare performance across these bootstraps. If you observe very different results, it probably means you need more examples in your evaluation set.

When it comes to error analysis, we can also apply a helpful rule of thumb from Husain’s blog:

Keep iterating on more traces until you reach theoretical saturation, meaning new traces do not seem to reveal new failure modes or information to you. As a rule of thumb, you should aim to review at least 100 traces.

Aligning LLM Judges with Human Experts

We want our LLM judges to remain as consistent as possible, but this is challenging because the judgment prompts will be revised, the underlying model can change or be updated by the provider, and so on. Additionally, your evaluation criteria will improve over time as you grade outputs, so it’s crucial to always ensure your LLM Judges stay aligned with your judgment or that of your domain experts. You can schedule regular meetings with the domain expert to review a sample of LLM judgments, and calculate a simple agreement percentage between automated and human evaluations, and of course, adjust your pipeline when necessary.

Overfitting

Overfitting is still a thing in the LLM world. Even if we’re not training a model directly, we’re still training our system by tweaking instruction prompts, refining retrieval systems, setting parameters, and enhancing context engineering. If our changes are based on evaluation results, there’s a risk of over-optimizing for our current set, so we still need to follow standard advice to prevent overfitting, such as using held-out sets.

Increased Complexity

For now, I’m keeping this application simple, so we have fewer components to evaluate. As our solution becomes more complex, our evaluation pipeline will also grow more complex. If our application involves multi-turn conversations with memory, or different tool usage or context retrieval systems, we should break down the system into multiple tasks and evaluate each component separately. So far, I’ve been using simple input/output pairs for evaluation, so retrieving data directly from my database is sufficient. However, as our system evolves, we’ll likely need to track the entire chain of events for a single request. This involves adopting solutions for logging LLM traces, such as using platforms like Arize, HoneyHive, or LangFuse.

Continuous Iteration and Data Drift

Production environments are constantly changing. User expectations evolve, usage patterns shift, and new failure modes arise. An evaluation set created today may no longer be representative in six months. This shift requires ongoing data annotation to ensure the evaluation set always reflects the current state of how the application is used and where it falls short.

4. Conclusion

In this post, we covered some key concepts for building a foundation to evaluate our data, along with practical details for our use case. We started with a small, mixed-source dataset and gradually developed a repeatable measurement system. The main steps involved actively annotating data, analyzing errors, and defining success using rubrics, which helped us turn a subjective problem into measurable dimensions. After annotating our data and gaining a better understanding of it, we used an LLM as a judge to automate scoring and create a feedback loop for continuous improvement.

Although the pipeline outlined here is a starting point, the next steps involve addressing challenges such as data drift, judge alignment, and increasing system complexity. By putting in the effort to understand and organize your evaluation data, you’ll gain the clarity needed to iterate effectively and develop a more reliable application.

“Notes on LLM Evaluation” was originally published in the author’s personal newsletter.

References

Source link

Sign Up to Our Newsletter

Top Categories

Tech News

Tech

Software development

Robotics

Popular Tech News

Top Categories

Tech News

Tech

Software development

Robotics

Popular Tech News

Table of Contents

1. The Application

2. The Eval Pipeline

The Data

In Practice

Data Annotation

In Practice

Writing Rubrics

Running the Evaluations

In Practice

3. What Lies Ahead

How many examples are enough?

Aligning LLM Judges with Human Experts

Overfitting

Increased Complexity

Continuous Iteration and Data Drift

4. Conclusion

References

“The Michigan bill is a danger for the political discourse” – Proton slams verification laws turning VPNs into a liability

BouMatic to Launch MilkGenius In-Line Milk Analyzer at World Dairy Expo

Felipe Adachi

About Author

You may also like

Our Company

Categories

Get Latest Updates and big deals

Our expertise, as well as our passion for web design, sets us apart from other agencies.