How to Evaluate Retrieval Quality in RAG Pipelines (part 2): Mean Reciprocal Rank (MRR) and Average Precision (AP)

If you missed Part 1: How to Evaluate Retrieval Quality in RAG Pipelines, check it out here

In my previous post, I took a look at to evaluate the retrieval quality of a RAG pipeline, as well as some basic metrics for doing so. More specifically, that first part mainly focused on binary, order-unaware measures, essentially evaluating if relevant results exist in the retrieved set or not. In this second part, we are going to further explore binary, order-aware measures. That is, measures that take into account the ranking with which each relevant result is retrieved, apart from evaluating if it is retrieved or not. So, in this post, we are going to take a closer look at two commonly used binary, order-aware metrics: Mean Reciprocal Rank (MRR) and Average Precision (AP).

Why ranking matters in retrieval evaluation

Effective retrieval is really important in a RAG pipeline, given that a good retrieval mechanism is the very first step for generating valid answers, grounded in our documents. Otherwise, if the correct documents that contain the needed information cannot be identified in the first place, no AI magic can fix this and provide valid answers.

We can distinguish between two large categories of retrieval quality evaluation measures: binary and graded measures. More specifically, binary measures categorize a retrieved chunk either as relevant or irrelevant, with no in-between situations. On the flip side, when using graded measures, we consider that the relevance of a chunk to the user’s query is rather a spectrum, and in this way, a retrieved chunk can be more or less relevant.

Binary measures can be further divided into order-unaware and order-aware measures. Order-unaware measures evaluate whether a chunk exists in the retrieved set or not, regardless of the ranking with which it was retrieved. In my latest post, we took a detailed look at the most common binary, order-unaware measures, and ran an in-depth code example in Python. Namely, we went over HitRate@K, Precision@K, Recall@K, and F1@K. In contrast, binary, order-aware measures, apart from considering if chunks exist or not in the retrieved set, also take into account the ranking with which they are retrieved.

Thereby, in today’s post, we are going to have a more detailed look at the most commonly used binary order-aware retrieval metrics, such as MRR and AP, and also check out how these can be calculated in Python.

I write 🍨DataCream, where I’m learning and experimenting with AI and data. Subscribe here to learn and explore with me.

Some order-aware, binary measures

So, binary, order-unaware measures like Precision@K or Recall@K tell us whether the correct documents are somewhere in the top k chunks or not, but don’t indicate if a document is scoring at the top or at the very bottom of those k chunks. And this exact information is what order-aware measures provide us. Some very useful and commonly used order-unaware measures are Mean Reciprocal Rank (MRR) and Average Precision (AP). But let’s see all these in some more detail.

🎯 Mean Reciprocal Rank (MRR)

A commonly used order-aware measure for evaluating retrieval is Mean Reciprocal Rank (MRR). Taking one step back, the Reciprocal Rank (RR) expresses in what ranking the first truly relevant result is found, among the top k retrieved results. More precisely, it measures how high the first relevant result appears in the ranking. RR can be calculated as follows, with rank_i being the rank the first relevant result is found:

We can also visually explore this calculation with the following example:

We can now put together the Mean Reciprocal Rank (MRR). MRR expresses the average position of the first relevant item across different result sets.

In this way, MRR can range from 0 to 1. That is, the higher the MRR, the higher in the ranking the first relevant document appears.

A real-life example where a metric like MRR can be useful for evaluating the retrieval step of a RAG pipeline would be any fast-paced environment, where quick decision-making is needed, and we need to make sure that a truly relevant result emerges at the top of the search. It works well for assessing systems where just one relevant result is enough, and significant information is not scattered across multiple text chunks.

A good metaphor to further understand MRR as a retrieval evaluation metric is Google Search. We think of Google as a good search engine because you can find what you are looking for in the top results. If you had to scroll down to result 150 to actually find what you are looking for, you wouldn’t think of it as a good search engine. Similarly, a good vector search mechanism in a RAG pipeline should surface the relevant chunks in reasonably high rankings and thus score a reasonably high MRR.

🎯 Average Precision (AP)

In my previous post on binary, order-unaware retrieval measures, we took a look specifically at Precision@k. In particular, Precision@k indicates how many of the top k retrieved documents are indeed relevant. Precision@k can be calculated as follows:

Average Precision (AP) further builds on that idea. More specifically, to calculate AP, we need to initially iteratively calculate Precision@k for each k when a new, relevant item appears. Then we can calculate AP by simply calculating the average of those Precision@k scores.

But let’s see an illustrative example of this calculation. For this example set, we notice that new relevant chunks are introduced in the retrieved set for k = 1 and k = 4.

Thus, we calculate the Precision@1 and Precision@4, and then take their average. That will be (1/1 + 2/4)/ 2 = (1 + 0.5)/ 2 = 0.75.

We can then generalize the calculation of AP as follows:

Again, AP can range from 0 to 1. More specifically, the higher the AP score, the more consistently our retrieval system ranks relevant documents towards the top. In other words, the more relevant documents are retrieved and the more they appear before the irrelevant ones.

Unlike MRR, which focuses only on the first relevant result, AP takes into account the ranking of all the retrieved relevant chunks. It essentially quantifies how much or how little garbage we get along, while retrieving the truly relevant items, for various top k.

To get a better grip on AP and MRR, we can also imagine them in the context of a Spotify playlist. Similarly to the Google Search example, a high MRR would mean that the first song of the playlist is our favorite song. On the flip side, a high AP would mean that the entire playlist is good, and many of our favorite songs appear frequently and towards the top of the playlist.

So, is our vector search any good?

Normally, I would continue this section with the War and Peace example, as I’ve done in my other RAG tutorials. However, the full retrieval code is getting quite large to include in every post. Instead, in this post, I’ll focus on showing how to calculate these metrics in Python, doing my best to keep the examples concise.

Anyways! Let’s see how MRR and AP can be calculated in practice for a RAG pipeline in Python. We can define functions for calculating the RR and MRR as follows:

from typing import List, Iterable, Sequence

# Reciprocal Rank (RR)
def reciprocal_rank(relevance: Sequence[int]) -> float:
    for i, rel in enumerate(relevance, start=1):
        if rel:
            return 1.0 / i
    return 0.0

# Mean Reciprocal Rank (MRR)
def mean_reciprocal_rank(all_relevance: Iterable[Sequence[int]]) -> float:
    vals = [reciprocal_rank(r) for r in all_relevance]
    return sum(vals) / len(vals) if vals else 0.0

We have already calculated Precision@k in the previous post as follows:

# Precision@k
def precision_at_k(relevance: Sequence[int], k: int) -> float:
    k = min(k, len(relevance))
    if k == 0: 
        return 0.0
    return sum(relevance[:k]) / k

Building on that, we can define Average Precision (AP) as follows:

def average_precision(relevance: Sequence[int]) -> float:
    if not relevance:
        return 0.0
    precisions = []
    hit_count = 0
    for i, rel in enumerate(relevance, start=1):
        if rel:
            hit_count += 1
            precisions.append(hit_count / i)   # Precision@i
    return sum(precisions) / hit_count if hit_count else 0.0

Each of these functions takes as input a list of binary relevance labels, where 1 means a retrieved chunk is relevant to the query, and 0 means it’s not. In practice, these labels are generated by comparing the retrieved results with the ground truth set, exactly as we did in Part 1 when calculating Precision@K and Recall@K. In this way, for each query (for instance, “Who is Anna Pávlovna?”), we generate a binary relevance list based on whether each retrieved chunk contains the answer text. From there, we can calculate all the metrics using the functions as shown above.

Another useful order-aware metric we can calculate is Mean Average Precision (MAP). As you can imagine, MAP is the mean of the APs for different retrieved sets. For example, if we calculate AP for three different test questions in our RAG pipeline, the MAP score tells us the overall ranking quality across all of them.

On my mind

Binary order-unaware measures that we saw in the first part of this series, such as HitRate@k, Precsion@k, Recall@k, and F1@k, can provide us valuable information for evaluating the retrieval performance of a RAG pipeline. Nonetheless, such measures only provide us information on whether a relevant document it present in the retrieved set or not.

Binary order-aware measures reviewed in this post, like Mean Reciprocal Rank (MRR) and Average Precision (AP) can provide us further insight, as they not only tell us whether the relevant documents exist in the retrieved results, but also how well they are ranked. In this way, we can have a better overview of how well the retrieval mechanism of our RAG pipeline performs, depending on the task and type of documents we are using.

Stay tuned for the next and final part of this retrieval evaluation series, where I’ll be discussing graded retrieval evaluation measures for RAG pipelines.

Loved this post? Let’s be friends! Join me on:

📰Substack 💌 Medium 💼LinkedIn ☕Buy me a coffee!

What about pialgorithms?

Looking to bring the power of RAG into your organization?

pialgorithms can do it for you 👉 book a demo today

Source link

Sign Up to Our Newsletter

Top Categories

Tech News

Tech

Software development

Robotics

Popular Tech News

BD expands PureWick portfolio with portable device

You can now store a passport in your...

Watch this – the latest humanoid robots are...

Cyber Monday deals 2025: when they start, what...