How to Evaluate Retrieval Quality in RAG Pipelines (Part 3): DCG@k and NDCG@k

Make sure also to check out the previous parts:

👉Part 1: Precision@k, Recall@k, and F1@k

👉Part 2: Mean Reciprocal Rank (MRR) and Average Precision (AP)

of my post series on retrieval evaluation measures for RAG pipelines, we took a detailed look at the binary retrieval evaluation metrics. More specifically, in Part 1, we went over binary, order-unaware retrieval evaluation metrics, like HitRate@K, Recall@K, Precision@K, and F1@K. Binary, order-unaware retrieval evaluation metrics are essentially the most basic type of measures we can use for scoring the performance of our retrieval mechanism; they just classify a result either as relevant or irrelevant, and evaluate if relevant results make it to the retrieved set.

Then, in part 2, we reviewed binary, order-aware evaluation metrics like Mean Reciprocal Rank (MRR) and Average Precision (AP). Binary, order-aware measures categorise results either as relevant or irrelevant and check if they appear in the retrieval set, but on top of this, they also quantify how well the results are ranked. In other words, they also take into account the ranking with which each result is retrieved, apart from whether it is retrieved or not in the first place.

In this final part of the retrieval evaluation metrics post series, I’m going to further elaborate on the other large category of metrics, beyond binary metrics. That is, graded metrics. Unlike binary metrics, where results are either relevant or irrelevant, for graded metrics, relevance is rather a spectrum. In this way, the retrieved chunk can be more or less relevant to the user’s query.

Two commonly used graded relevance metrics that we are going to be taking a look at in today’s post are Discounted Cumulative Gain (DCG@K) and Normalized Discounted Cumulative Gain (NDCG@k).

I write 🍨DataCream, where I’m learning and experimenting with AI and data. Subscribe here to learn and explore with me.

Some graded measures

For graded retrieval measures, it is first of all important to understand the concept of graded relevance. That is, for graded measures, a retrieved item can be more or less relevant, as quantified by rel_i.

🎯 Discounted Cumulative Gain (DCG@k)

Discounted Cumulative Gain (DCG@k) is a graded, order-aware retrieval evaluation metric, allowing us to quantify how useful a retrieved result is, taking into account the rank with which it is retrieved. We can calculate it as follows:

Here, the numerator rel_i is the graded relevance of the retrieved result i, essentially, is a quantification of how relevant the retrieved text chunk is. Moreover, the denominator of this formula is the log of the ranking of the result i. Essentially, this allows us to penalize items that appear in the retrieved set with lower ranks, emphasizing the idea that results appearing at the top are more important. Thus, the more relevant a result is, the higher the score, but the lower the ranking it appears at, the lower the score.

Let’s further explore this with a simple example:

In any case, a major issue of DCG@k is that, as you can see, is essentially a sum function of all the relevant items. Thus, a retrieved set with more items (a larger k) and/or more relevant items is going to inevitably result in a larger DCG@k. For instance, if in for example, just consider k = 4, we would end up with a DCG@4 = 28.19. Similarly, DCG@6 would be higher and so on. As k increases, DCG@k typically increases, since we include more results, unless additional items have zero relevance. Nonetheless, this doesn’t necessarily mean that its retrieval performance is superior. On the contrary, this rather causes a problem because it doesn’t allow us to compare retrieved sets with different k values based on DCG@k.

This issue is effectively solved by the next graded measure we are going to be discussing later on today – NDCG@k. But before that, we need to introduce IDCG@K, required for calculating NDCG@K.

🎯 Ideal Discounted Cumulative Gain (IDCG@k)

Ideal Discounted Cumulative Gain (IDCG@k), as its name suggests, is the DCG we would get in the ideal situation where our retrieved set is perfectly ranked based on the retrieved results’ relevance. Let’s see what the IDCG for our example would be:

Apparently, for a fixed k, IDCG@k is going to always be equal to or larger than any DCG@k, since it represents the score for a perfect retrieval and ranking of results for a certain k.

Finally, we can now calculate Normalized Discounted Cumulative Gain (NDCG@k), using DCG@k and IDCG@k.

🎯 Normalized Discounted Cumulative Gain (NDCG@k)

Normalized Discounted Cumulative Gain (NDCG@k) is essentially a normalised expression of DCG@k, solving our initial problem and rendering it comparable for different retrieved set sizes k. We can calculate NDCG@k with this straightforward formula:

Basically, NDCG@k allows us to quantify how close our current retrieval and ranking is to the ideal one, for a given k. This conveniently provides us with a number that is comparable for different values of k. In our example, NDCG@k=5 would be:

In general, NDCG@k can range from 0 to 1, with 1 representing a perfect retrieval and ranking of the result, and 0 indicating a complete mess.

So, how do we actually calculate DCG and NDCG in Python?

If you’ve read my other RAG tutorials, you know this is where the War and Peace example would usually come in. Nonetheless, this code example is getting too massive to include in every post, so instead I’m going to show you how to calculate DCG and NDCG in Python, doing my best to keep this post at a reasonable length.

To calculate these retrieval metrics, we first need to define a ground truth set, exactly as we did in Part 1 when calculating Precision@K and Recall@K. The difference here is that, instead of characterising each retrieved chunk as relevant or not, using binary relevances (0 or 1), we now assign to it a graded relevance score; for example, from completely irrelevant (0), to super relevant (5). Thus, our ground truth set would include the text chunks that have the highest graded relevance scores for each query.

For instance, for a query like “Who is Anna Pávlovna?”, a retrieved chunk that perfectly matches the answer might receive a score of 3, one that partially mentions the needed information could get a 2, and a completely unrelated chunk would get a relevance score equal to 0.

Using these graded relevance lists for a retrieved result set, we can then calculate DCG@k, IDCG@k, and NDCG@k. We’ll use Python’s math library to handle the logarithmic terms:

import math

First of all, we can define a function for calculating DCG@k as follows:

# DCG@k
def dcg_at_k(relevance, k):
    k = min(k, len(relevance))
    return sum(rel / math.log2(i + 1) for i, rel in enumerate(relevance[:k], start=1))

We can also calculate IDCG@k applying a similar logic. Essentially, IDCG@k is DCG@k for a perfect retrieval and ranking; thus, we can easily calculate it by calculating DCG@k after sorting the results by descending relevance.

# IDCG@k
def idcg_at_k(relevance, k):
    ideal_relevance = sorted(relevance, reverse=True)
    return dcg_at_k(ideal_relevance, k)

Finally, after we have calculated DCG@k and IDCG@k, we can also easily calculate NDCG@k as their function. More specifically:

# NDCG@k
def ndcg_at_k(relevance, k):
    dcg = dcg_at_k(relevance, k)
    idcg = idcg_at_k(relevance, k)
    return dcg / idcg if idcg > 0 else 0.0

As explained, each of these functions takes as input a list of graded relevance scores for retrieved chunks. For instance, let’s suppose that for a specific query, ground truth set, and retrieved results test, we end up with the following list:

relevance = [3, 2, 3, 0, 1]

Then, we can calculate the graded retrieval metrics using our functions :

print(f"DCG@5: {dcg_at_k(relevance, 5):.4f}")
print(f"IDCG@5: {idcg_at_k(relevance, 5):.4f}")
print(f"NDCG@5: {ndcg_at_k(relevance, 5):.4f}")

And that was that! This is how we get our graded retrieval performance measures for our RAG pipeline in Python.

Finally, similarly to all other retrieval performance metrics, we can also average the scores of a metric across different queries to get a more representative overall score.

On my mind

Today’s post about the graded relevance measures concludes my post series about the most commonly used metrics for evaluating the retrieval performance of RAG pipelines. In particular, throughout this post series, we explored binary measures, order-unaware and order-aware, as well as graded measures, gaining a holistic view of how we approach this. Apparently, there are lots of other things that we can look at in order to evaluate a retrieval mechanism of a RAG pipeline, as for instance, latency per query or context tokens sent. Nonetheless, the measures I went over in these posts cover the fundamentals for evaluating retrieval performance.

This allows us to quantify, evaluate, and ultimately improve the performance of the retrieval mechanism, ultimately paving the way for building an effective RAG pipeline that produces meaningful answers, grounded in the documents of our choice.

Loved this post? Let’s be friends! Join me on:

📰Substack 💌 Medium 💼LinkedIn ☕Buy me a coffee!

What about pialgorithms?

Looking to bring the power of RAG into your organization?

pialgorithms can do it for you 👉 book a demo today

Source link

Sign Up to Our Newsletter

Top Categories

Tech News

Tech

Software development

Robotics

Popular Tech News

Cyber Monday deals 2025: when they start, what...

Union files legal claims against Rockstar Games following...

WordPress users beware – GootLoader strikes again, using...

Hackers hijacked antivirus features to install malware –...