How to Do Evals on a Bloated RAG Pipeline

to Building an Overengineered Retrieval System. That one was about building the entire system. This one is about doing the evals for it.

In the previous article, I went through different parts of a RAG pipeline: chunking the data properly, query optimization, retrieval (semantic, BM25, or hybrid search), re-ranking, expanding chunks to neighbors, building the context, and then generation with an LLM.

One of the questions I got was: does expanding chunks to neighbors actually improve answers, or does it just add noise and make it harder for the model to stay grounded?

So that’s what we’ll test here. We’ll run some basic evaluations and look at metrics like faithfulness, answer relevancy, context relevance, and hallucination rate, and compare results across different models and datasets.

I’ve collected most of the results here and here already, but we’ll go through them too.

As a note, I’m planning to compare this kind of “advanced” pipeline to a more naive baseline later. But this article is mainly about evaluating the pipeline as it is.

I always go through some intro sections before I dig in, but if you’re new-new, I’d first read up on how to build a basic RAG system, how embeddings work, and an actual intro to evals/metrics. Then you can also read how to build the over-engineered pipeline I introduced above, or at least skim it.

If none of this is new, then skip to the results part.

Why we perform evals

Evals are about making sure to pressure-test the system on a bigger (more targeted) corpus than your favorite 10 questions, and making sure that whatever changes you push don’t change the quality of the system.

Changes in data, prompts, or models can very much affect performance without you seeing it.

You may also need to show your team the general performance of the system you’ve built before being allowed to test it on real users.

But before you do this, you need to decide what to test.

What does a successful system look like to you? If you care about multi-hop, you need questions that actually require multi-hop. If you care about Q&A and proper citations, you test for that. Otherwise, you end up evaluating the wrong thing.

This is a bit like doing investigative work: you test something, you try to understand the results, and then you build better tests.

To do this well, you should try to build a golden set (often from user logs) to test with.

This isn’t always possible, so in situations like this we build synthetic datasets. This may not be the best way to do it, as it will clearly be biased and won’t reflect what your users will actually ask. Nevertheless, you may need to start somewhere.

For this article, I’ve created three different datasets so we can discuss it: one created from the ingested corpus, one that creates messy user questions from the corpus, and one with random questions on RAG that haven’t been generated from the corpus at all.

You’ll be able to see how these datasets give us different results on the metrics, but that they all mean different things.

What to think about

I’m not going to go through everything there is to think about here, because doing evals well is pretty difficult (although also fun if you like statistics and data).

But there are a few things you need to keep in mind: LLM judges are biased, cherry-picking questions is a problem, gold answers are best if you have them, and using a larger dataset with tags helps you break down where and how the system is failing.

If you’ve read the eval metrics article, you’ve already seen the idea of LLM-as-a-judge. It can be useful, but it’s not inherently reliable because it has baked-in preferences and blind spots.

There are things that will make you go mad, like a judge punishing an answer that’s based on the corpus but not explicitly stated in the retrieved chunks (summaries / small inferences), or judging the same answer differently depending on how the question is phrased.

You’ll realize this later when you dig into the questions that are failing to understand why.

Another thing to keep in mind is to make sure not to “cherry-pick” questions, even if you feel the urge to.

You obviously have to start somewhere, but the goal is to get close to what your users are actually asking, find the issues, and to update the dataset continuously based on what the system seems to fail in. It’s easy to get nice numbers if you mostly test “easy” questions, but then the eval becomes less useful.

The best thing is to have not just real user questions but also gold answers.

So even if you can “bypass” having references by using an LLM judge, having the correct answers for those questions is best. That’s when you can use the LLM to judge whether the output matches the gold answer, instead of asking it to judge the answer on its own.

Sample size matters too. Too small and it may not be reliable. Too big and it’s easy to miss smaller problems.

If you have enough data, you can tag questions into topics, different wordings (pessimistic / typical phrasing), and different types (short / long / messy) so you can see what breaks where.

I’ve heard recommendations that start with something like 200–1,000 real queries with gold answers if you want this to be a real evaluation setup.

Since this entire exercise is hypothetical, and the system has ingested documents to demo the idea of expanding to neighbors, the evals will have datasets that have been synthetically generated, and thus less reliable, but there is still learnings we can get from it.

Deciding on metrics & datasets

This section is about two things: which metrics I’m using to evaluate the pipeline, and how I’m using them across datasets to see if neighbor expansion seems to help.

First, if you haven’t read about evals for LLM systems at all, go read this article. It gives you a taxonomy of the different metrics out there (RAG included).

Since I’m lazy for this, I needed reference-free metrics, but this will also limit us to what we can actually test. We can have the judge look at the context, the question, and the generated answer.

A few metrics that can help here are faithfulness (is the answer grounded in the provided context), answer relevancy (does it actually answer the question), context relevancy (how much of the context is just noise), and hallucination (how many claims are actually backed up by the provided context).

Since we want to figure out if seed expansion is useful, and without building two different pipelines, we can do one simple comparison: ask the judge to look at the seed chunks vs. the final expanded context and score how much of the answer comes from each for the faithfulness metric.

If grounding improves when the judge sees the expanded context, that’s at least evidence that the model is using the expanded chunks and it’s not just noise. We would need more testing, though, to say for sure which is the winner.

Finally, the datasets matter as much as the metrics.

If you’ve read the first article, you know that all the docs that have been ingested are scientific articles that mention RAG. So all the questions that we create here need to be about RAG.

I have generated three different datasets with a different RAG flavor.

The first is based on the ingested corpus, going through each scientific article and writing two questions each that it can answer.

The second is doing the same but providing messy questions like, “how does k2 btw rag improve answer fetching compared to naive rag, like what’s the similarity scores in terms of q3?”

This messy user questions dataset could be good to test the query optimizer if you read the first article (but I don’t have those results for you here). Here it will tell us if stating things differently would skew the results.

The third dataset is based on 66 random RAG questions found online. This means that these questions may not have answers in the corpus (the ingested RAG articles are just from September to October, so we don’t know exactly what they contain).

So the first two will evaluate how well the pipeline behaves, whether it can answer questions on the documents it has, and the third one tells us what it is missing and how it behaves on questions that it might not be able to answer.

Though this is a bit simplified, as the first questions may be structured on sections and the random ones may be better answered by seed chunks.

Running the evals

To run the evals, you first need to run the pipeline on every question, for every model, and store the results.

If you don’t store everything you need, you can’t debug later. You want to be able to go from a low score back to the exact answer, the exact retrieved context, and the exact model settings.

I also wanted to compare models, because people assume “bigger model = better answers,” and that’s not always true, especially for easier tasks. So I’m running the same pipeline across GPT-5-mini, GPT-5.1, and GPT-5.2, for several datasets.

Once that’s done, I build the eval layer on top of those stored outputs.

I used RAGAS for the standard metrics and DeepEval for the custom ones. You can obviously build it manually, but it’s much easier this way. I love how seamless DeepEval is, though it’s harder to debug if you find issues with the judge later.

A few specifics: the pipeline runs with no context cap, the judge model is gpt-4o-mini, and we use n=3 for RAGAS and n=1 for the custom judges.

Since neighbor expansion is the whole point of this pipeline, remember we also run this check: for faithfulness, we score grounding against the seed chunks and against the full expanded context, to see if there’s a difference.

Eval results of datasets & models

Let’s run the evals for the different datasets, metrics, and models to see how the pipeline is doing and how we can interpret the results. Remember you can find the full results here and here (especially if you dislike my childish sketches).

We can start with the results from the dataset generated by the corpus.

Remember that the real tabe you’ll find here

The table above shows the first RAGAS metrics. Faithfulness (does it stay grounded in the context provided) and answer relevancy (does it answer the question) are very high.

This is to be expected, as we’re basically giving it questions that it should be able to answer with the documents. If these showed low numbers, there would be something severely off in the pipeline.

It also gives us back seed faithfulness, where the judge is estimating how grounded the answer is to the seed chunks. This one is overall a lot lower than the full context faithfulness, 12–18 points across the different models.

In fewer words: we can say that the LLM is using some of the full context, not just the seed chunks, when generating its answer.

What we can’t judge though is if the seed-only answer would have been just as good. This will require us to run two pipelines and compare the same metrics and datasets for each.

Now let’s look at these next metrics (for the same dataset).

I would have estimated that context relevance would decrease here, as it’s looking at the full context that pulls in up to 10 different chunk neighbors for the section.

A reason for this may be that the questions generated are based on sections, which means that added context helps to answer them.

Structure citations (i.e. does it cite its claims correctly) looks alright, but hallucination is high, which is good (1 means no made-up claims in the answer).

Now you’ll see that the different models provide very little difference in terms of performance.

Yes, this is quite an easy Q&A task. But it does demonstrate that the additional size of the model may not be needed for everything, and the added context expansion may be able to act as a buffer for the smaller models.

Now let’s look at the results if we change the dataset to those messy user questions instead.

We see a few drops in points, but they still stay high, though without isolating the outliers here we can’t say why. But faithfulness looks lower when only judging with the seed chunks for the messy user questions, which is interesting.

Let’s now turn to the third dataset, which will be able to tell us a lot more.

We see all around worse numbers which is of course expected, the corpus that has been ingested probably can’t answer all of these questions so well. This helps us point to where we have missing information.

Faithfulness stays high though still for the full context runs. Here the difference from the seed-only runs are a lot higher, which means the added expansion is being used more in the answer.

Something that was strange here was how GPT-5.2 consistently did worse for answer relevance across two different runs. This can be a metric thing, or it can be a model thing where it answers more cautiously than before, thus getting a lower score.

This also tells you why it’s so important to test these new models on your own pipelines before adding them in.

Let’s continue with the other metrics for the random dataset.

Context relevance is very low, so the judge thinks there is a lot of noise in the context that does not directly answer the question for this dataset.

The context relevance metric rewards retrieving a high proportion of directly relevant chunks, but seed+expand intentionally adds neighbor chunks (definitions, adjacent sections, captions) that should improve grounding. Unfortunately, there may be some noise that comes with that.

Both structure citations and hallucination score worse here, probably because it’s harder to stay grounded if you’re not provided any relevant information to use.

Now this was a first build for this pipeline. Based on these results, you can improve it until you drive those numbers up. Maybe change the metrics to be more custom to your pipeline.

Both datasets show different things: the corpus-based one shows the pipeline works, the random one shows there is missing information and that there is a lot more noise to navigate around.

What to examine

So what do you do now that you have all these scores? You can look at regressions, worst offenders, and do some random sampling.

Like I said before, this whole exercise is a bit like investigative work, which you’ll love if you love working with data. If you do have enough data and you’ve tagged it, then by all means break it down and look at the individual numbers and see what you find. But there are three checks you should always do.

First is regressions: check the stuff that used to pass and now fails following changes.

Second is checking the worst offenders (bottom ~5–10 data points) to see what is happening. It could be the metric, the judge, it could be the retrieval, and it could be the model.

Below is one of the worst ones for full faithfulness and answer relevancy.

"question": "how often does entity linking mess up acronyms and lookalikes, and how do i test that",
"answer": "The documents do not report a direct number for "how often" entity linking fails specifically on acronyms or look‑alike names.\n\nWhat we do see is that:\n- Even when structure is correct, LLMs often link to the wrong property or entity (semantic inaccuracies) in SPARQL generation over a scholarly KG [2]. This is the same failure mode you'd see when an acronym (e.g., NMF, ACE) is mapped to the wrong node.\n- In medical KGs, entity linking relies on exact keywords and then semantic similarity of embeddings when no exact match exists [3]. This last step explicitly allows linking to a merely "semantically related" concept rather than the exact one, which is where lookalikes and acronym collisions can occur [3].\n- In conversational entity‑centric systems, the largest single error category is coreference/entity confusion (27% of failures) [4], showing that confusing similar mentions is a major practical problem even outside strict EL.\n\nHow to test this in your setting, using patterns from the papers:\n1) Build a targeted test set\n   - Create a small benchmark that stresses acronyms and visually/lexically similar entities, analogous to how MATE is used to stress cross‑modal entity linking [1].\n   - Include positive pairs (correct acronym→entity) and hard negatives (different entities sharing acronym or near‑identical names).\n\n2) Run your full EL pipeline\n   - Use the same identifier / keyword / embedding‑based linking strategy described in MedKGent [3].\n\n3) Manually and/or LLM‑assisted evaluation\n   - Have humans or an LLM score whether each link is correct vs. incorrect, similar to the triple‑validity rubric used in MedKGent [3] and the SPARQL error audit in FIRESPARQL [2].\n\n4) Analyze error types\n   - Break down errors like the semantic vs. structural analysis in [2] and the error table in [4], but with categories "acronym collision," "lookalike surface form," etc.\n\nThis will give you an empirical error rate for your own stack, focused specifically on acronyms and lookalikes.",
"full_faithfulness": 0.24,
"seed_faithfulness": 0.8260869565000001,
"answer_relevancy": 0.0,
"context_relevance": 0.208549739206933,
"context_relevance_reason": "The context provided does not directly address the user's question about how often entity linking messes up acronyms and lookalikes, nor does it offer methods for testing that. While it discusses entity linking and its evolution, it lacks specific information on the issues related to acronyms and lookalikes, which is the core of the user's inquiry.",
"hallucination_score": 0.6572611409640697,
"hallucination_reason": "The response accurately identifies that the documents do not provide a specific frequency for how often entity linking fails with acronyms or lookalikes, which aligns with the input query. It also discusses relevant issues such as semantic inaccuracies and coreference confusion, which are pertinent to the topic. However, it lacks direct references to specific claims made in the context, such as the limitations of traditional EL methods or the role of exact keywords in medical KGs, which could have strengthened the response further.",
"full_contexts": ["Entity Linking\n\nEntity Linking (EL) has evolved from text-only methods to Multimodal Entity Linking (MEL), and more recently to Cross-Modal Entity Linking (CMEL), which supports crossmodal reasoning. Traditional EL methods associate textual entities with their corresponding entries in a knowledge base, but overlook non-textual information (Shen, Wang, and Han 2015; Shen et al. 2023). MEL extends EL by incorporating visual information as auxiliary attributes to enhance alignment between entities and knowledge base entries (Gan et al. 2021; Liu et al. 2024b; Song et al. 2024).", "However, MEL does not establish cross-modal relations beyond these auxiliary associations, thereby limiting genuine cross-modal interaction.", "CMEL goes further by treating visual content as entities-aligning visual entities with their textual counterparts-to construct MMKGs and facilitate explicit crossmodal inference (Yao et al. 2023). Research on CMEL remains in its early stages, lacking a unified theoretical framework and robust evaluation protocols. The MATE benchmark is introduced to assess CMEL performance, but its synthetic 3D scenes fall short in capturing the complexity and diversity of real-world images (Alonso et al. 2025). To bridge this gap, we construct a CMEL dataset featuring greater real-world complexity and propose a spectral clustering-based method for candidate entity generation to drive further advances in CMEL research.", "3 Error type analysis on generated SPARQL queries\n\nDespite the improvements of LLMs on QA over SKGs, LLMs face limitations when handling KG-specific parsing. The experimental results conducted by Sören Auer et al.[2] showed that only 63 out of 100 handcrafted questions could be answered by ChatGPT, of which only 14 answers were correct. To better understand why LLMs fail to generate the correct SPARQL query to a NLQ, we conduct a pilot experiment on using ChatGPT(GPT-4) with a random one-shot example to generate SPARQL queries for 30 handcrafted questions in the SciQA benchmark datasets.", "Insights from this pilot experiment revealed two major categories of errors LLMs tend to make in this task: semantic inaccuracies and structural inconsistencies. Semantic inaccuracies occur when LLMs fail to link the correct properties and entities in ORKG, despite generating SPARQL queries with correct structure. Our observations reveal that LLMs tend to rely on the example provided in the one-shot learning process to generate the correct structure for a certain type", "of questions, but often struggle with linking the correct properties and entities because LLMs do not learn the content of the underlying KG. Structural inconsistencies arise due to LLMs' lack of ontological schema of the underlying KG, leading to errors in query structure, such as missing or abundant links (triples), despite correctly linking to the mentioned entities or properties.", "Figure 1 shows the example of semantic inaccuracies and structural inconsistencies problem with the generated SPARQL queries in our pilot study. In the example of the semantic inaccuracies problem, ChatGPT failed to link the correct property orkgp:P15687; instead, it linked to a wrong property orkgp:P7101. In the example of the structural inconsistencies problem, the SPARQL query generated by ChatGPT directly links Contribution to Metrics, fails to detect the correct schema of the ORKG where Contribution and Metric are connected via Evaluation.", "Fig. 1: Examples of semantic inaccuracies and structural inconsistencies problem with the generted SPARQL queries\n\nSemantic inaccuracies Problem\n\nFail to link the correct properties and entities in ORKG\n\nWhat is the maximum sample size?\n\nContribution Evaluation Metric P34 P2006 P7046\n\nStructural inconsistencies Problem\n\nMake errors in query structure, such as missing or abundant links (triples)\n\nWhat are the metrics used by paper \"Using NMF-based text summarization\n\nto improve supervised and unsupervised classification?\n\norkgp:P15687 rdfs:label Sample size (n)\n\norkgp:P7101 rdfs:label has elements", "2 Results\n2.1 Method overview\n\nas its confidence score. For instance, if the triple (NPPA, Negative Correlate, Water) appears in 90% of the outputs, its confidence score is 0.9. Low-confidence triples (score < 0.6) are filtered out, and only high-confidence triples are retained for downstream graph construction. Each triple is also annotated with the PubMed ID of the source abstract and a timestamp, ensuring traceability and source attribution. For example, (NPPA, Negative Correlate, Water) would have a PubMed ID of 10494624 and a timestamp of 2000-01-01.", "As shown in Figure 1 c , for each retained triple, such as (NPPA, Negative Correlate, Water), the Constructor Agent checks its presence in the current KG. If absent ( i.e. , either the head or tail entities are missing), it is inserted; if present, its confidence score is updated according to Equation (1). The associated PubMed ID is appended, and the timestamp is updated to reflect the latest publication. For example, if an existing triple (NPPA, Negative Correlate, Water) has a confidence score of 0.7, PubMed ID 10691132, and timestamp 1999-12-31, and a new occurrence with a confidence score of 0.9, PubMed ID 10494624, and timestamp 2000-01-01 is encountered, the updated triple will have a confidence score of 0.97, PubMed IDs [10691132, 10494624], and a timestamp of 2000-01-01. If the head and tail entities are present but the relation differs, such as existing (NPPA, Associate, Water) vs. incoming (NPPA, Negative Correlate, Water), only the most appropriate relation is maintained. The Constructor Agent invokes the LLM to resolve the conflict by selecting the more suitable relation, considering both the existing and incoming triple's confidence scores and timestamps. If the LLM selects the new triple, the existing one is replaced; otherwise, no changes are made. The prompt design for relation conflict resolution is shown in Extended Data Figure 2 c . Together, the two agents extract structured medical facts and integrate them into a dynamic, time-aware KG. See more details in the Section 4.", "2.2 Structural Characterization of the Knowledge Graph\n\nIn this section, we detail the structural characteristics of the medical KG we constructed, with an emphasis on the distribution of node types, relationship types, and the confidence scores of relationship triples. We also present a visualization of a subgraph centered on COVID-19 to illustrate the graph's structure.", "Using the MedKGent framework, we extracted knowledge triples from the abstracts of 10,014,314 medical papers, with 3,472,524 abstracts (34.68%) yielding extractable triples. The relatively low extraction rate can be attributed to several factors: first, some abstracts lacked sufficient structured information for triple extraction; second, only triples with a confidence score exceeding 0.6 were retained, excluding those with lower confidence; and third, some triples extracted by LLMs contained formatting issues, such as extraneous or irrelevant characters, which were discarded. In total, our Extractor Agent identified 8,922,152 valid triples from the abstracts. However, the extracted triples contained a significant number of duplicates and conflicts. To resolve this, our Constructor Agent integrates the triples in chronological order. During this process, duplicates are merged, with the confidence score for each triple increasing in proportion to its frequency, reflecting greater certainty. For conflicting triples, where the same entity pair is associated with multiple relations, the Constructor Agent retains the most appropriate relationship. Following this consolidation, the final KG comprises 2,971,384 distinct triples.", "We conducted a comprehensive statistical analysis of the final constructed KG, which comprises 156,275 nodes. As shown in Figure 2 a , the node distribution is predominantly dominated by Gene and Chemical nodes, with smaller proportions of other entities such as Disease, Variant, Species, and CellLine. The KG includes 2,971,384 relationship triples (edges), representing a range of interactions between entities, as illustrated in Figure 2 b . The most common relationship type is 'Associate', followed by 'Negative Correlate' and 'Positive Correlate', indicating strong associations between medical entities. Less frequent relationships, such as 'Interact', 'Prevent', and 'Drug Interact', provide additional insights into the complexities of medical interactions. The distribution of confidence scores for these relationship triples, shown in Figure 2 c , with confidence values discretized to the nearest smaller 0.05 increment (rounding down to the closest multiple of 0.05), reveals a clear dominance of high-confidence triples. A significant proportion of triples exhibit confidence scores of 0.95, reflecting the cumulative increase in confidence resulting from the repetition of triples during the graph construction process. This high-confidence distribution reinforces the reliability and robustness of the KG.", "We visualized a local subgraph of the constructed KG with COVID-19 as the central node, highlighting five surrounding relationship triples, as shown in Figure 2 d . Each node is characterized by six key attributes: the Identifier, which uniquely references the node and normalizes multiple synonymous mentions to a standardized terminology entry; the Entity Type, which classifies the entity; the Terminology, which maps the entity type to its corresponding standard terminology; the Page Link, providing a reference to the entity in the Terminology; the Exact Keywords, which lists common names and aliases of the entity in lowercase; and the Semantic Embedding, a vector representation of the entity. In practice, these attributes facilitate entity linking within a query by matching entities to their corresponding nodes in the KG. When the Identifier of an entity in the query is available, entity linking can be efficiently performed using this unique reference. In the absence of an Identifier, precise matching", "Figure 2: A comprehensive statistical analysis and visualization of the constructed KG, consisting of 156,275 nodes and 2,971,384 relationship edges. a . Node distribution within the KG, with Gene and Chemical nodes predominating, and smaller proportions of Disease, Variant, Species, and CellLine. b . Relationship type distribution within the KG, highlighting the prevalence of 'Associate' relationships, followed by 'Negative Correlate' and 'Positive Correlate', with less common interactions such as 'Interact', 'Prevent', and 'Drug Interact'. c . The distribution of confidence scores for relationship triples, discretized to the nearest smaller 0.05 increment, ensures values are rounded down to the closest multiple of 0.05. This distribution reveals a clear dominance of high-confidence triples, particularly those with scores of 0.95, underscoring the robustness of the KG. d . Local subgraph visualization centered on COVID-19, displaying five surrounding relationship triples. Each node is characterized by key attributes, including Identifier, Entity Type, Terminology, Page Link, Exact Keywords, and Semantic Embedding, facilitating efficient entity linking through exact or similarity matching. The relationships in the KG are further enriched by attributes such as Confidence, PubMed IDs, and Timestamp, enhancing traceability, accuracy, and temporal relevance.\n\nCOVID -19 ACE2 Pneu- monia Lung Disea -ses MAD00 04J08 tociliz- umab Treat Identifier : MESH:C000718219 Entity Type : Chemical Terminology : NCBI MeSH Page Link", ": meshb.nlm.nih.gov/record/ui?ui=C000718219\n\nExact Keywords : [mad0004j08] Semantic Embedding : [- 0.12, …, 0.10 ] : MESH:D000086382\n\nEntity Type:\n\nDisease\n\n: meshb.nlm.nih.gov/record/ui?ui=D000086382\n\n: [ncp, covid-19]\n0.25, …, 0.09\n\nIdentifier:\n\nMESH:C502936\nChemical\n\nTerminology:\n\nNCBI MeSH\n\n: meshb.nlm.nih.gov/record/ui?ui=C502936\n\n: [mra, tocilizumab] 0.12, …, 0.13 Associate 59272 Gene\n\n:\n\nNCBI Gene\n\n: www.ncbi.nlm.nih.gov/gene/59272\n\n: [ace2, ace2p]\n0.22, …, 0.09]\nMESH:D011014\n\n: meshb.nlm.nih.gov/record/ui?ui=D011014\n\n: [pneumonia]\n0.18, …, 0.01\nMESH:D008171\n\n: meshb.nlm.nih.gov/record/ui?ui=D008171\n\n: [lung diseases,lung damage]\n\n: [ 0.06, …, 0.11 d a b Drug_Interact (0.1%) 0.70 0.65 'Prevent (0.79 0.75 7.89) (7.5%) 0.60 (8.1%) (5.4% (47.7%) 0.80 CellLine Positive (8.9%) (0.5%) Correlate 0.85 (19.9%) (10.3%) Variant (1.49) (5.9%) Cause (1.4% 0.90 (33.6%) Inhibit (1.2% Negative_Correlate Stimulate (0.5%) (13.7%) Species Compare (26.1%) Cotreat (1.0%)", "Figure 3: Comprehensive evaluation of extraction quality for relationship triples generated by the Extractor Agent. Systematic assessment of extraction accuracy using both automated evaluations by LLMs and independent manual expert review. a . Proportion of valid relationship triples (score ≥ 2.0) across relation types, as assessed by GPT4.1 on a randomly selected subset of 34,725 abstracts (83,438 triples). b . Proportion of valid relationship triples across relation types, as assessed by DeepSeek-v3 on the same subset. c . Validity rates from independent manual evaluation by three domain experts on a subset of 400 abstracts (1,060 triples), demonstrating high inter-expert consistency. d-f . Performance of GPT-4.1 and DeepSeek-v3 compared to three expert evaluations on the shared evaluation subset, reporting precision, recall, and F1 score. g . Pairwise inter-rater agreement between experts and LLMs quantified by Cohen's kappa coefficients, demonstrating substantial consistency across all evaluators.\n\nGPT-4.\n\nAutomated Evaluation\n\nDeepSeek-v3 Automated Evaluation\n\nManual Evaluation 0936| 0.0307 0,8875 0,8880 0 8700 0.7160 0.4\n\nExpert1's Evaluation as Reference\n\nExpert2's Evaluation as Reference\n\nExpert3's Evaluation as Reference\n\nPairvise Cohen's 0 9761 09761 0 0602 00760 0.9502 00537 0,9503 0 9440 0.5663 08143 0,8818 0 5446 0.6762 0,8853 0.5446 0.6906 06818 0.6008 0 6560 GPT-4,1 DeepSeek-v3 GPT-4.1 Correlale Corelate Cause Inhon Irhon Cotcat Inlatact Colrcat Kappa ison", "is achieved by checking whether the entity appears in the Exact Keywords list of a specific node. Alternatively, semantic vectors of the query entities can be compared with those in the KG to identify the most similar entities, enabling semantic similarity matching. This approach is particularly beneficial for entities with multiple names, ensuring accurate linking even when not all aliases are captured in the Exact Keywords list.", "The relationships between entities are characterized by three key attributes. Confidence reflects the reliability of the relationship, with higher values indicating greater certainty based on its frequency across multiple sources. The PubMed IDs attribute lists the PubMed identifiers of the papers from which the relationship is derived, enabling easy access to the original publications via the PubMed website 2 . If the relationship appears in multiple papers, all relevant PubMed IDs are included, further increasing the confidence score. Finally, Timestamp denotes the most recent occurrence of the relationship, specifically the publication date of the latest paper. Notably, while Timestamp captures only the latest appearance, the full temporal span of the relationship-including its earliest mention-can be readily retrieved through the associated PubMed IDs via the PubMed website. These attributes collectively enhance the traceability, accuracy, and temporal relevance of the relationships within the KG.", "4 Methods\n4.2.2 Constructor Agent\n\nA chemical/drug treats a disease. The Treat relationship typically occurs between Chemical and Disease.\n\nMeSH (Medical Subject Headings)\n\ndbSNP, otherwise HGNV format\n\nNCBI Taxonomy\nCell Line\nCellosaurus\n\nYour task is to select the most appropriate relation\n\nbetween two medical entities to form more\n\nreasonable knowledge triple.\n\nThere is an and Now, a new between e1 and\n\ne2 is proposed.\n\nPlease decide which relation should be retained\n\nbetween e1 and e2.\n\nIf r1 should be kept, respond with \"Y\".\n\nIf r2 should replace it, respond with \"N\".\n\nYou may consider the following two factors to assist\n\nyour decision:\n\n(1) The\n, and\nthat of\n\n;\n\n(2) The\nfor\n\nIn general, relations with\n\nhigher confidence scores or more recent timestamps are likely\n\nretained.\n\nYour output should contain only \"Y\" or \"N\". Do not\n\nprovide any explanations.\n\nOutput:\n\nc", "Extended Data Figure 2: a . Prompt template for relation extraction. Given a biomedical abstract and its extracted entities, the Extractor Agent prompts the LLM to infer semantic relations between entity pairs using a predefined relation set and textual descriptions. b . Reference terminologies for entity normalization. Each biomedical entity type is mapped to a standard terminology: Gene (NCBI Gene), Disease and Chemical (MeSH), Variant (dbSNP or HGNV), Species (NCBI Taxonomy), and Cell Line (Cellosaurus). c . Prompt design for relation conflict resolution. When conflicting relations exist between the same entity pair, the Constructor Agent prompts the LLM to select the most appropriate one based on confidence scores and timestamps. d . Schema for predefined relation types. The 12 core relation types-seven bidirectional and five unidirectional-are listed alongside their directionality, descriptions, and allowed entity-type combinations.", "4.3 Quality Assessment\n\nWe assessed the quality of relational triples extracted by the Extractor Agent through both automated and manual evaluations, leveraging two state-of-the-art LLMs-GPT-4.1 [74] and DeepSeek-v3 [75]-as well as three PhD students with interdisciplinary expertise in medicine and computer science. For each medical abstract and its corresponding set of extracted triples, individual triples were evaluated using a standardized four-level scoring rubric: 3.0 (Correct), 2.0 (Likely Correct), 1.0 (Likely Incorrect), and 0.0 (Incorrect). The evaluation prompt provided to both LLMs and human annotators is illustrated in Extended Data Figure 3 a .", "A relational triple was defined as valid if it received a score of ≥ 2 . 0 . The validity rate was calculated as:\n\nTo assess the reliability of automatic evaluation, we compared LLM-based assessments with human annotations on a shared evaluation subset, treating human judgments as ground truth. The precision, recall, and F 1 -score of the automatic evaluations were computed as:\n\nwhere TP, FP, and FN represent true positives, false positives, and false negatives, respectively. To further quantify inter-rater agreement, we calculated Cohen's Kappa coefficient [82] for each pair of evaluators, including both LLMs and human annotators, resulting in 10 pairwise comparisons across the five raters. The Kappa coefficient was computed as:\n\nwhere p 0 represents the observed agreement and p e denotes the expected agreement by chance. This analysis provides a quantitative measure of rating consistency across evaluators.", "4.4 Retrieval-Augmented Generation\n\nThe constructed KG serves as a reliable external source for information retrieval and can be integrated into LLMs via a RAG framework. By providing structured biomedical context, the KG enhances LLM performance across a range of medical QA benchmarks.", "Given a user query q , we first extract the set of medical entities present in the question, denoted as E q = { e q 1 , e q 2 , · · · } . When using PubTator3 [80]-the same entity recognition tool employed during KG constructioneach extracted entity is assigned a unique identifier. This allows for efficient entity linking by matching these identifiers to the corresponding nodes N q = { n q 1 , n q 2 , · · · } within the graph. Alternatively, if medical entities are extracted using other methods-such as prompting a LLM-they may lack standardized identifiers. In such cases, the extracted entity mentions are first converted to lowercase and matched against the Exact Keywords attribute of each node in the KG. A successful match enables linkage of the entity to the corresponding graph node. In both approaches, if an entity cannot be linked via its identifier or if its surface form does not appear in any node's Exact Keywords list, we apply a semantic similarity strategy to complete the entity linking process. Specifically, the embedding of the query entity is computed using the same model employed for generating node-level semantic representations ( i.e. , BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext [81]) and is compared against the Semantic Embedding of all nodes in the KG. The entity is then linked to the node with the highest semantic similarity score, which may correspond to either the exact concept or a semantically related medical entity. This entity linking framework-combining identifier-based matching, lexical normalization, and semantic embedding-ensures robust and flexible integration of KG-derived knowledge into downstream QA tasks.", "Following entity linking, we construct evidence subgraphs using a neighbor-based exploration strategy [86] to enhance the reasoning capabilities of LLMs. For each entity-linked node in the query-specific set N q , we retrieve its one-hop neighbors within the KG. Specifically, for each node n q i ∈ N q , all adjacent nodes n q ′ i are identified, and the corresponding triples ( n q i , r, n q ′ i ) are appended to form a localized subgraph G q i . This expansion captures the immediate relational context surrounding the query entities, which is essential for enabling fine-grained medical reasoning. The complete evidence set for a given query is then defined as the union of these localized subgraphs: G q = { G q 1 , G q 2 , · · · } . The resulting subgraph G q may contain a large number of relational triples, including redundant or irrelevant information, which can adversely impact LLM reasoning [87]. To address this, we leverage the LLM's inherent ranking capability to selectively filter high-value knowledge [88]. Given the question q and", "You are tasked with evaluating the validity of the knowledge triples extracted from the abstract of a medical paper.\n\nGiven the abstract (\n\n) of a medical paper and the extracted triples\n\n) from this abstract.\n\nEach triple is represented in the format:\n\n\"Head Entity Name (Alias1, Alias2) | Relationship Name | Tail Entity Name (Alias1, Alias2)\"\n\n,\n\nwith triples separated by ' $ '.", "Some entities may have no aliases or multiple aliases, which are separated by ', ' within the '()'.\n\nYour task is to evaluate the validity of each triple, with a particular focus on the\n\nrelationship\n\nit describes, based on the information provided in the abstract. Consider whether the stated relationship accurately\n\nreflects the connection between the head and tail entities as presented or implied in the text.", "For each triple, evaluate its validity using the following scoring scale and assign a confidence score:\n\n•\n\nCorrect (3.0):\n\nThe relationship logically and accurately describes the relation between the head and tail entities as\n\nexplicitly mentioned or directly and strongly supported\n\nby the abstract. The\n\nrelationship type is\nprecise\n\nand the connection is\n\nundeniable\n\nbased on the text, requiring minimal inference.\n\nLikely Correct (2.0):\n\nThe relationship is\n\ngenerally acceptable and directionally correct\n\n. The core connection between the entities is\n\nvalid and supported by the text (explicitly, implicitly, or via\n\nreasonable inference)\n\n, even if the relationship type has\n\nminor inaccuracies or lacks ideal precision\n\n.\n\nLikely Incorrect (1.0):\n\nsubstantially inaccurate or misleading\n\nsignificantly misrepresenting\n\nthe connection described in the abstract, even if the entities are mentioned together.\n\nIncorrect (0.0):\n\nnot supported by the abstract whatsoever\n\n, is\n\nclearly and undeniably contradicted\n\nby the text, or involves a\n\nfundamental misunderstanding\n\nof the entities or their\n\nconnection as presented.\n\nOutput the evaluation in a fixed format:\n\nFirst line: 'Analysis: ' followed by the analysis of all triples, separated by '; '. Each triple's analysis should explain\n\nwhy\n\nthe specific confidence score (3.0, 2.0, 1.0, or 0.0) was assigned based on the criteria\n\nabove and the abstract's content.", "Second line: Only the numerical confidence scores for all triples, separated by ' $ ', in the same order as the input triples (e.g., 3.0 $ 2.0 $ 1.0 $ 0.0). This line must contain only numbers (formatted to one\n\ndecimal places like 3.0, 2.0, 1.0, 0.0), decimal points, and ' $ ' as separator, with no additional text or English letters.", "5 Results\n5.1 Main Results\n\n| | Model | FR (%) | DC (%) | UCS (/5) |\n|---:|:-------------------|:-----------|:-----------|:-----------|\n| 0 | Stateless LLM | 54.1 (0.4) | 48.3 (0.5) | 2.1 (0.1) |\n| 1 | Vector RAG | 71.6 (0.6) | 66.4 (0.7) | 3.4 (0.1) |\n| 2 | Entity-RAG | 75.9 (0.5) | 72.2 (0.6) | 3.7 (0.1) |\n| 3 | Semantic Anchoring | 83.5 (0.3) | 80.8 (0.4) | 4.3 (0.1) |\n\nTable 1: Overall performance on MultiWOZ-Long. Semantic Anchoring outperforms all baselines across metrics. Improvements in FR and DC are statistically significant at p < 0 . 01 ; UCS gains are significant at p < 0 . 05 . Values are mean ± stdev over three runs.", "Figure 2 analyzes how performance varies with session depth. While all models degrade as dialogue span increases, Semantic Anchoring sustains over 75% recall at 10 sessions, indicating stronger long-range tracking.", "5.2 Per-Dataset Breakdown\n\nTo test generality, we evaluate on DialogRE-L , which emphasizes relation extraction across sessions. Results in Table 2 show consistent improvements, though broader domains are needed to claim robustness.", "Figure 2: Factual Recall by session depth on MultiWOZ-Long. Semantic Anchoring exhibits the slowest degradation, maintaining > 75% recall at 10-session distance. Error bars denote standard deviation across three runs.\n\nFactual Recall vs. Session Depth (MultiWOZ-Long)\n\nStateless LLM Vector RAG Entity-RAG Semantic Anchoring Session Depth\n\n|---:|:-------------------|---------:|---------:|-----------:|\n| 0 | Stateless LLM | 49.8 | 44.1 | 2 |\n| 1 | Vector RAG | 68.7 | 62.5 | 3.2 |\n| 2 | Entity-RAG | 72.1 | 68.3 | 3.6 |\n| 3 | Semantic Anchoring | 81.4 | 77.9 | 4.2 |\n\nTable 2: Performance on DialogRE-L. Semantic Anchoring achieves consistent gains across metrics, suggesting effectiveness in relation extraction tasks that require long-range entity tracking.", "5.3 Ablation Studies\n\nTable 3 examines the role of linguistic components. Removing discourse tagging reduces FR by 4.7 points, while excluding coreference resolution reduces DC by 6.2 points. Eliminating all symbolic features collapses performance to Vector RAG levels. These results align with observed error patterns (§5.6), underscoring the value of symbolic features.", "5.4 Qualitative Examples\n\nIn MultiWOZ-Long, when the user later asks 'Did he confirm the time for the taxi?' , Semantic Anchoring retrieves:\n\n[Entity: John Smith][CorefID: E17] confirmed the taxi is booked for 9 AM.", "By contrast, Vector RAG surfaces unrelated mentions of 'taxi.' Additional examples, including cases where Semantic Anchoring fails, are shown in Appendix C.", "| | Variant | FR (%) | DC (%) | UCS (/5) |\n|---:|:-------------------------|---------:|---------:|-----------:|\n| 0 | Full Model | 83.5 | 80.8 | 4.3 |\n| 1 | - Discourse Tagging | 78.8 | 75.6 | 4 |\n| 2 | - Coreference Resolution | 80.1 | 74.6 | 4.1 |\n| 3 | - Dependency Parsing | 81.2 | 78.5 | 4.1 |\n| 4 | Dense-only (Vector RAG) | 71.6 | 66.4 | 3.4 |\n\nTable 3: Ablation results on MultiWOZ-Long. Removing discourse or coreference modules significantly reduces FR and DC, respectively. Without all symbolic features, performance falls to the dense-only baseline.", "5.5 Human Evaluation\n\nFive trained annotators rated 50 randomly sampled conversations for User Continuity Satisfaction (UCS). Agreement was high ( α = 0 . 81 ). As Table 1 shows, Semantic Anchoring achieves the highest UCS (4.3), with annotators noting better consistency in entity references. Full protocol details are in Appendix B.", "5.6 Error Analysis\n\nTable 4 categorizes common failures. Coreference mistakes (27%) and parsing errors (19%) are the most frequent, consistent with ablation findings. Discourse mislabeling (15%) often arises in sarcasm or overlapping speech. While overall error frequency is lower than dense retrieval, these remain open challenges.", "| | Error Type | Proportion of Failures |\n|---:|:----------------------|:-------------------------|\n| 0 | Parsing errors | 19% |\n| 1 | Coreference mistakes | 27% |\n| 2 | Discourse mislabeling | 15% |\n| 3 | Other / miscellaneous | 39% |\n\nTable 4: Error analysis on MultiWOZ-Long. Coreference mistakes are the most frequent error type, followed by parsing and discourse issues. These patterns align with ablation results."],
"seed_texts": ["Entity Linking\n\nEntity Linking (EL) has evolved from text-only methods to Multimodal Entity Linking (MEL), and more recently to Cross-Modal Entity Linking (CMEL), which supports crossmodal reasoning. Traditional EL methods associate textual entities with their corresponding entries in a knowledge base, but overlook non-textual information (Shen, Wang, and Han 2015; Shen et al. 2023). MEL extends EL by incorporating visual information as auxiliary attributes to enhance alignment between entities and knowledge base entries (Gan et al. 2021; Liu et al. 2024b; Song et al. 2024).", "Insights from this pilot experiment revealed two major categories of errors LLMs tend to make in this task: semantic inaccuracies and structural inconsistencies. Semantic inaccuracies occur when LLMs fail to link the correct properties and entities in ORKG, despite generating SPARQL queries with correct structure. Our observations reveal that LLMs tend to rely on the example provided in the one-shot learning process to generate the correct structure for a certain type", "We visualized a local subgraph of the constructed KG with COVID-19 as the central node, highlighting five surrounding relationship triples, as shown in Figure 2 d . Each node is characterized by six key attributes: the Identifier, which uniquely references the node and normalizes multiple synonymous mentions to a standardized terminology entry; the Entity Type, which classifies the entity; the Terminology, which maps the entity type to its corresponding standard terminology; the Page Link, providing a reference to the entity in the Terminology; the Exact Keywords, which lists common names and aliases of the entity in lowercase; and the Semantic Embedding, a vector representation of the entity. In practice, these attributes facilitate entity linking within a query by matching entities to their corresponding nodes in the KG. When the Identifier of an entity in the query is available, entity linking can be efficiently performed using this unique reference. In the absence of an Identifier, precise matching", "Given a user query q , we first extract the set of medical entities present in the question, denoted as E q = { e q 1 , e q 2 , · · · } . When using PubTator3 [80]-the same entity recognition tool employed during KG constructioneach extracted entity is assigned a unique identifier. This allows for efficient entity linking by matching these identifiers to the corresponding nodes N q = { n q 1 , n q 2 , · · · } within the graph. Alternatively, if medical entities are extracted using other methods-such as prompting a LLM-they may lack standardized identifiers. In such cases, the extracted entity mentions are first converted to lowercase and matched against the Exact Keywords attribute of each node in the KG. A successful match enables linkage of the entity to the corresponding graph node. In both approaches, if an entity cannot be linked via its identifier or if its surface form does not appear in any node's Exact Keywords list, we apply a semantic similarity strategy to complete the entity linking process. Specifically, the embedding of the query entity is computed using the same model employed for generating node-level semantic representations ( i.e. , BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext [81]) and is compared against the Semantic Embedding of all nodes in the KG. The entity is then linked to the node with the highest semantic similarity score, which may correspond to either the exact concept or a semantically related medical entity. This entity linking framework-combining identifier-based matching, lexical normalization, and semantic embedding-ensures robust and flexible integration of KG-derived knowledge into downstream QA tasks.", "5.6 Error Analysis\n\nTable 4 categorizes common failures. Coreference mistakes (27%) and parsing errors (19%) are the most frequent, consistent with ablation findings. Discourse mislabeling (15%) often arises in sarcasm or overlapping speech. While overall error frequency is lower than dense retrieval, these remain open challenges."]

Here we can see the corpus didn’t actually contain direct evidence for what was asked but the seed chunks were more contained so faithfulness got a higher score when only providing the seed chunks.

Answer relevancy was punished unusually low for not being able to answer the direct question.

This one was an outlier though and in general full faithfulness is higher than seed faithfulness.

You can also check a good one to understand how it estimates the ones that passes.

"question": "What retrieval settings help reduce needle-in-a-haystack issues for medical QA?",
"answer": "To reduce "needle‑in‑a‑haystack" problems in medical QA, the retrieval setup should focus on bringing back a small, very focused set of highly relevant items instead of large, noisy context.\n\n1. Control context size and granularity\n- Use abstracts or snippets rather than full articles; long, unfocused context diluted model performance in biomedical QA, so reranking down to a smaller, high‑quality subset was necessary for good answers [1].\n- Limit the number of documents passed to the model (e.g., rerank from thousands down to a few hundred or less) to avoid disorienting the model with too much text [1].\n\n2. Use multi‑stage retrieval\n- Start with high‑recall lexical retrieval (e.g., BM25 over PubMed) to get a broad set of candidates (up to thousands) [1].\n- Apply a strong semantic reranker to pick only the top slice for RAG (e.g., top 300) [1].\n- If evidence is missing, fall back to broader web‑or multi‑source search over trusted medical sites to fill gaps, rather than over‑stuffing the first pass with everything [1].\n\n3. Improve query quality\n- Use an LLM to rewrite or expand queries (synonyms, related terms, less rare jargon) when initial recall is poor, boosting the chance that the key paper is retrieved without over‑expanding every query by default [1].\n\n4. Use structured, corpus‑aware retrieval\n- Retrieve from structured medical knowledge sources such as knowledge graphs and prepend only the most relevant subgraph, which consistently improved accuracy across several medical QA benchmarks [3].\n- In multimodal/heterogeneous setups, use corpus‑specific retrieval for different knowledge sources so each corpus is searched with methods tuned to its content, instead of one generic retriever over everything [2].",
"full_faithfulness": 1.0,
"seed_faithfulness": 0.8636363636000001,
"answer_relevancy": 0.9135841092,
"context_relevance": 0.8976322813389481,
"context_relevance_reason": "The context passages provide comprehensive insights into retrieval settings that can mitigate needle-in-a-haystack issues in medical QA. Specifically, the discussion on the integration of LLMs for information retrieval, the use of semantic reranking, and the multi-stage retrieval approach directly addresses the user's question. The emphasis on maintaining relevance while expanding query coverage and the mention of ensemble methods highlight effective strategies for improving retrieval accuracy in complex biomedical queries. However, while the information is highly relevant, a more explicit connection to specific 'needle-in-a-haystack' challenges could enhance clarity.",
"hallucination_score": 0.8893376167284271,
"full_contexts": ["Abstract\n\nBiomedical question answering (QA) poses significant challenges due to the need for precise interpretation of specialized knowledge drawn from a vast, complex, and rapidly evolving corpus. In this work, we explore how large language models (LLMs) can be used for information retrieval (IR), and an ensemble of zero-shot models can accomplish state-of-the-art performance on a domain-specific Yes/No QA task. Evaluating our approach on the BioASQ challenge tasks, we show that ensembles can outperform individual LLMs and in some cases rival or surpass domain-tuned systems - all while preserving generalizability and avoiding the need for costly fine-tuning or labeled data. Our method aggregates outputs from multiple LLM variants, including models from Anthropic and Google, to synthesize more accurate and robust answers. Moreover, our investigation highlights a relationship between context length and performance: while expanded contexts are meant to provide valuable evidence, they simultaneously risk information dilution and model disorientation. These findings emphasize IR as a critical foundation in Retrieval-Augmented Generation (RAG) approaches for biomedical QA systems. Precise, focused retrieval remains essential for ensuring LLMs operate within relevant information boundaries when generating answers from retrieved documents. Our results establish that ensemble-based zero-shot approaches, when paired with effective RAG pipelines, constitute a practical and scalable alternative to domain-tuned systems for biomedical question answering.", "3. Our methodology\n\n3.1. Information Retrieval Pipeline\n\nTo support high-quality RAG for Phase A+, we developed an IR pipeline that integrates traditional lexical search with LLM-based query generation and semantic reranking (Fig. 1).", "If the initial query returns fewer than five documents, we invoke Gemini 2.5 Pro Preview (05-06) to automatically revise the query. The model is prompted to enhance retrieval recall by enabling approximate matching and omitting overly rare or domain-specific terms. This refinement step is done to improve the query coverage while maintaining relevance. Our experiments have shown that this process is required in less than 5% of the queries in the BioASQ 13 test set.", "We index all PubMed article titles and abstracts in an Elasticsearch instance, using BM25 retrieval as the ranking function. For each input question, we use Gemini 2.0 Flash to generate a structured Elasticsearch query that captures the semantic intent of the question using synonyms, related terms, and full boolean query string syntax rules supported by Elasticsearch. This query is validated using regular expressions and then is used to retrieve up to 10,000 documents.", "Following document retrieval, we apply a semantic reranking model (Google semantic-ranker-default004) to reduce the number of candidate documents [11]. This model re-scores the initially retrieved documents based on semantic similarity to the original question, allowing us to select the top 300 most relevant documents. This reranked subset is used for downstream RAG-based QA, since despite really long context supported by modern Transformer architectures [12, 13], we could not get adequate QA results on full article abstracts without this step.", "Finally, we have added additional IR searches to handle the cases where a QA step does not return a response based on the evidence retrieved from Elasticsearch. We have observed that Elasticsearch context might not provide sufficient evidence for QA in 3-7% of test cases for Phase A+, depending on the batch. An automated process is used to expand IR sources to address these cases. First, we are using a Google search restricted to PubMed sources to attempt to find new matches. If that fails, we extend our sources to include Home of the Office of Health Promotion and Disease Prevention, WebMD,\n\nThis multi-stage retrieval approach, combining LLM-generated queries, a traditional BM25 search, and semantic reranking, enables flexible, high-recall, and high-precision document selection tailored to complex biomedical queries.", "Figure 1: IR process\n\nPubMed corpus in Elasticsearch Query Generation (Gemini 2.0 Flash) Query Valida- tion and IR (BM25, ≤ 10,000 docs) Results < Refinement 2.5 Pro) Reranking (semantic- reranker-4) Top 300 Articles for RAG No Yes Refine\n\nHealthline, and Wikipedia. This ensures that we have an answer candidate for all questions in Phase A+ test sets.", "3.2. Question Answering Pipeline\n\nWe adopt a unified, zero-shot QA framework for both Phase A+ and Phase B of the challenge. While the core QA procedure remains consistent across phases, Phase A+ incorporates an additional IR step to verify the presence of candidate answers within relevant documents (described at the end of Section 3.1). This ensures that selected documents contain sufficient information to support answer generation.", "To generate candidate answers, we leverage several large language models (LLMs): Gemini 2.0 Flash, Gemini 2.5 Flash Preview (2025-04-17), and Claude 3.7 Sonnet (2025-02-19). Prompts are adjusted using examples derived from the BioASQ 11 test set, improving the response structure and quality.", "The system uses zero-shot prompting, tailored to the question type: Yes/No, Factoid, or List. We experiment with multiple types of input context: (1) IR-derived results from Phase A+, (2) curated snippets provided in Phase B, and (3) full abstracts of articles selected during Phase B. This allows us to examine the influence of context granularity on answer accuracy and completeness.", "To consolidate candidate answers, we perform a secondary synthesis step using Gemini 2.0 Flash. This model is prompted to resolve any contradictions, select the most precise and specific answer components, and integrate complementary information into a single, unified response. As part of this step, the model also returns a confidence score estimating the reliability of the synthesized answer. If the score is below a predefined threshold (0.5, determined empirically), the synthesis is re-run with reduced sampling temperature (from 0.1 to 0.0) to improve determinism. This synthesis process is evaluated using the BioASQ 12 dataset to ensure consistency with benchmark standards.", "Table 1\n\nResults of our runs on BioASQ 13 Phase A+, Yes/No questions.", "| | Batch | System | Accuracy | Ranking |\n|---:|:--------|:------------------|-----------:|----------:|\n| 0 | 3 | Extractive | 0.73 | 41 |\n| 1 | | (last) | 0.23 | 58 |\n| 2 | 4 | Extractive | 0.92 | 1 |\n| 3 | | Simple truncation | 0.88 | 11 |\n| 4 | | Kmeans | 0.65 | 67 |\n| 5 | | (last) | 0.65 | 67 |\n\nTable 2\n\nResults of our runs on BioASQ 13 Phase A+, Factoid questions.", "| | Batch | System | MRR | Ranking |\n|---:|:--------|:------------------|------:|----------:|\n| 0 | 3 | Extractive | 0.14 | 41 |\n| 1 | | (last) | 0.05 | 47 |\n| 2 | 4 | Extractive | 0.43 | 17 |\n| 3 | | Simple truncation | 0.29 | 51 |\n| 4 | | Kmeans | 0.05 | 62 |\n| 5 | | (last) | 0.05 | 62 |", "2 Related Work\n\nMedical Report Retrieval for Generation. Existing Medical MMRAG approaches primarily utilize the medical images to retrieve relevant reports (He et al. 2024; Sun et al. 2025; Xia et al. 2024, 2025). For instance, FactMM-RAG (Sun et al. 2025) enhances report generation by incorporating high-quality reference reports. Similarly, RULE (Xia et al. 2024) and MMed-RAG (Xia et al. 2025) integrate reference reports and employ preference fine-tuning to improve model utilization of retrieved reports. Although these approaches improve the factual accuracy of responses, they neglect the retrieval of medical documents, which are crucial for Med-LVLM's reliable inference.", "Medical Document Retrieval for Generation. Acknowledging the limitations of report-only retrieval, recent studies have increasingly emphasized medical documents as knowledge sources (Choi et al. 2025; Shaaban et al. 2025; Wu et al. 2025; Hamza et al. 2025). Among them, MKGF (Wu et al. 2025) and K-LLaVA (Hamza et al. 2025) both employ multimodal retrievers to fetch documents from the database, aiming to mitigate hallucination issues in language models. ChatCAD+ (Zhao et al. 2024b) and MIRA (Wang et al. 2025) utilize a zero-shot query rewriting module for retrieval. Nevertheless, these retrieval methods overlook the substantial content differences among various corpora, lacking corpus-specific retrieval mechanisms.", "6 Conclusion\n\nThis work addresses the critical challenges of effective retrieval and multi-aspect alignment for heterogeneous knowledge in the Medical MMRAG field. MedAtlas provides a rich, multi-source knowledge base for medical multimodal tasks. The HeteroRAG framework enables precise report retrieval and multi-corpus retrieval, followed by aligning heterogeneous retrieval results through Heterogeneous Knowledge Preference Tuning. Extensive experiments demonstrate that our framework achieves state-of-the-art performance across multiple medical VQA and report generation benchmarks. Our work paves the way for effectively integrating multi-source medical knowledge, advancing the reliability and applicability of Med-LVLMs in clinical scenarios.", "2 Results\n\n2.3 Quality Assessment of Extracted Relationship Triples\n\nFor automated evaluation, two state-of-the-art LLMs, GPT-4.1 [74] and DeepSeek-v3 [75], were employed. A random subset comprising 1% of the abstracts (n = 34,725), resulting in 83,438 extracted triples, was selected for evaluation. Each abstract and its corresponding triples were formatted into structured prompts and independently assessed by both models according to a standardized four-tier rubric: Correct (3.0), Likely Correct (2.0), Likely Incorrect (1.0), and Incorrect (0.0) (the specific evaluation prompt is illustrated in Extended Data Figure 3 a ). Triples receiving scores of ≥ 2 . 0 were deemed valid. The evaluation outcomes are presented in Figure 3 a and b , illustrating the proportion of valid triples across relation types for GPT-4.1 and DeepSeek-v3, respectively. Both models demonstrated high overall accuracy, with 85.44% and 88.10% of triples rated as valid by\n\n2 https://pubmed.ncbi.nlm.nih.gov/", "GPT-4.1 and DeepSeek-v3, respectively. For most relation types, validity was approximately 90%, except for Negative Correlate, which exhibited slightly lower agreement. These findings underscore the high precision of the Extractor Agent across diverse biomedical relation types and support its utility for downstream analyses.", "In parallel, a manual evaluation was conducted to further validate extraction accuracy. Three domain experts with doctoral-level training in artificial intelligence and medicine independently reviewed a randomly selected subset of 400 abstracts, comprising 1,060 extracted triples. Each abstract and its associated triples were evaluated using the same standardized scoring rubric. Triples receiving scores of ≥ 2.0 were considered valid. As shown in Figure 3 c , all three reviewers demonstrated high consistency, with overall validity rates exceeding 86% across assessors. The close concordance between manual and automated evaluations further substantiates the robustness of the Extractor Agent in accurately capturing biomedical relationships, providing strong support for the application of the extracted knowledge in large-scale medical analyses.", "To further validate the reliability of the LLM-based assessments, we used three expert annotations as reference standards to evaluate GPT-4.1 and DeepSeek-v3 on the same subset of 400 abstracts, respectively. As shown in Figure 3 d -f , both models exhibited strong concordance with expert evaluations, achieving precision, recall, and F1 scores of approximately 95% across metrics. These results further corroborate the accuracy of the automated scoring framework and its alignment with expert judgment.", "Finally, inter-rater agreement was assessed across all evaluators-including three human experts and two LLMs-by computing pairwise Cohen's kappa coefficients on a shared evaluation subset (Figure 3 g ) [82]. Most pairwise comparisons (80%) yielded kappa values exceeding 0.6, indicating substantial agreement-an accepted threshold for reliable concordance in domains involving subjective judgment, including medicine, psychology, and natural language processing [83]. The coefficients between expert 1 and expert 2 (0.5663), and between expert 2 and expert 3 (0.5446), fell slightly below this threshold but still reflected moderate agreement, closely approaching the substantial range. These findings demonstrate strong inter-rater reliability across both human and automated evaluators, underscoring the robustness and reproducibility of the evaluation framework.", "2.4 Evaluating Downstream Utility in Medical Question Answering\n\nWe evaluated the downstream utility of our constructed KG as a RAG information source across seven multiplechoice medical QA datasets. These included four widely used benchmarks [76]-MMLU-Med, MedQA-US, PubMedQA*, and BioASQ-Y/N-spanning a broad spectrum of clinical and biomedical reasoning tasks. To further assess diagnostic reasoning under varying complexity, we introduce MedDDx, a newly developed benchmark suite focused on differential diagnosis [77]. Questions are stratified into three levels-MedDDx-Basic, MedDDxIntermediate, and MedDDx-Expert-based on the variance in semantic similarity among answer choices. All MedDDx subsets were designed to reduce training data leakage and more closely reflect authentic clinical reasoning. Detailed dataset statistics are shown in Figure 4 a . We systematically evaluated five state-of-the-art LLMs to measure the impact of KG-based retrieval. Each model was tested in a zero-shot setting under two conditions: (1) direct answering using internal knowledge alone, and (2) RAG, with relevant KG subgraphs prepended as external context. The models-GPT-4-turbo, GPT-3.5-turbo (OpenAI) [78], DeepSeek-v3 (DeepSeek) [75], Qwen-Max, and Qwen-Plus (Qwen) [79]-span diverse architectures and training regimes, representing both proprietary and open-source systems. All models were accessed via publicly available APIs without additional fine-tuning. Version details and access endpoints are summarized in Figure 4 b .", "Figures 4 c -i present model performance across the seven medical QA datasets using radar plots, each depicting the five LLMs under both direct answering (w/o RAG) and RAG conditions (w/ RAG). Notably, the background shading in the radar plots is lighter for the MedDDx suite (Figure 4 g -i ) than for the four widely used benchmarks (Figure 4 c -f ), reflecting the overall lower accuracy of all models on these recently introduced and semantically more challenging datasets. This contrast highlights the greater complexity and reduced risk of training data leakage inherent to the MedDDx design. Across all datasets, RAG with our KG consistently outperformed direct answering. The most substantial improvements were observed in tasks requiring deeper clinical reasoning, such as MedQA-US and the MedDDx suite. For example, on MedQA-US, GPT-3.5-turbo improved from 0.5986 to 0.6834 (+8.5 percentage points), and Qwen-Max from 0.7306 to 0.7636. On MedDDx-Expert, RAG yielded absolute gains of up to +8.6 points for GPT-3.5-turbo and +5.7 points for Qwen-Max. Even in knowledge-intensive but semantically simpler tasks such as MMLU-Med and BioASQ-Y/N, RAG offered modest yet consistent benefits. On MMLU-Med, GPT-4-turbo improved from 0.8724 to 0.9054, while DeepSeek-v3 achieved the highest score overall at 0.9183 with KG support. In BioASQ-Y/N, RAG further enhanced already strong performance, with four models exceeding 0.85 accuracy following augmentation. Notably, several models performed better on MedDDx-Expert than on MedDDx-Basic, despite the former being constructed with higher semantic complexity. This counterintuitive trend may be related to differences in distractor framing, where Expert-level distractors-", "Figure 4: Overview of evaluation datasets, model configurations, and performance across medical QA tasks. a . Dataset statistics for the seven medical QA benchmarks used in this study. The benchmark suite includes four widely adopted datasets [76] (MMLU-Med, MedQA-US, PubMedQA*, and BioASQ-Y/N) and three newly developed differential diagnosis datasets [77] (MedDDx-Basic, MedDDx-Intermediate, and MedDDx-Expert). For each dataset, we report the number of multiple-choice questions and the corresponding answer option formats. b . Configuration of the five LLMs evaluated: GPT-4-turbo, GPT-3.5-turbo (OpenAI) [78], DeepSeek-v3 (DeepSeek) [75], Qwen-Max, and Qwen-Plus (Qwen) [79]. All models were accessed through public APIs in their zero-shot settings without fine-tuning. The specific version identifiers and access platforms are indicated. c -i . Model performance across the seven QA datasets, shown as radar plots. Each chart compares zero-shot accuracy for five LLMs under two conditions: direct answering without retrieval (w/o RAG) and RAG with our KG (w/ RAG). Across all datasets, RAG with our KG consistently outperformed direct answering.\n\nDatasets Size Options MMLU-Med 1,089 A/B/C/D MedQA-US 1,273 PubMedQA* Yes/No/Maybe BioASQ-Y/N Yes/No MedDDx-Basic MedDDx-Intermediate 1,041 MedDDx-Expert Provider Model Version Accessed URL OpenAI GPT-4-turbo\n\nhttps://platform.openai.com/docs/models/gpt-4-turbo\n\nGPT-3.5-turbo\n\nhttps://platform.openai.com/docs/models/gpt-3.5-turbo\n\nDeepSeek\nDeepSeek-v3", "https://huggingface.co/deepseek-ai/DeepSeek-V3\n\nQwen\nQwen-Max\n\nhttps://www.alibabacloud.com/help/en/model-\n\nstudio/what-is-qwen-llm Qwen-Plus b BioASQ-YIN w/o RAG RAG 0.9054 0.8130 0.5780 0.8625 0.5660 0,5720 0.5520 0.7401 0.7880 0.4940 0.831 0.5300 0.8953 0.8834 0.9183 0.8036 h wlo RAG 0.5197 0.5437 0,5714 0.5207 0.5347 0.4890 0,4265 506- 0.3685 0.4204 0,.4688 0.5020 0,4720 0.5259 0.4990 0.5043 0.5592 0,5878 0.8935 0.8576 7855| 0.8398 DeepSe -Max Seek-v3 0,5135 ) 5673 0.5469 0.4700", "Figure 5: Case study of tocilizumab for literature-based discovery and drug repurposing within the KG. a . Known association between tocilizumab and rheumatoid arthritis, supported by multiple publications, with the earliest reported date defined by the first extracted supporting paper. b . Two multi-hop reasoning paths linking tocilizumab to COVID-19 via intermediate genes FGB and TNF. The inferred Treat relation (red arrow) was derived solely from earlier literature, while later studies validated this prediction (green arrow). The temporal order of evidence highlights the KG's capacity to anticipate therapeutic connections prior to their recognition in the literature.\n\ntociliz-\numab\n\nIdentifier:\n\nMESH:C502936\n\nEntity Type:\n\nChemical\n\nTerminology:\n\nNCBI MeSH\nPage Link\n\n: meshb.nlm.nih.gov/record/ui?ui=C502936\n\nTreat Arthritis Rheum atoid MESH:D001172 Disease\n\n: meshb.nlm.nih.gov/record/ui?ui=D001172\n\nConfidence\n: 0.999999925\nPubMed IDs\n\n:\n\n26374404,27958380,29146040,30859494,308\n\n88472,32844216,35713462,36688476\n\nEarliest Reported Date\n: 2016-07-01\n\nmeshb.nlm.nih.gov/record/ui?ui=C502936\n\nFGB Gene Terminology NCBI Gene\n\n: www.ncbi.nlm.nih.gov/gene/2244\n\nCOVID -19 Identifier : MESH:D000086382 : NCBI MeSH\n\nmeshb.nlm.nih.gov/record/ui?ui=D000086382\n\nTNF"],
"seed_texts": ["Abstract\n\nBiomedical question answering (QA) poses significant challenges due to the need for precise interpretation of specialized knowledge drawn from a vast, complex, and rapidly evolving corpus. In this work, we explore how large language models (LLMs) can be used for information retrieval (IR), and an ensemble of zero-shot models can accomplish state-of-the-art performance on a domain-specific Yes/No QA task. Evaluating our approach on the BioASQ challenge tasks, we show that ensembles can outperform individual LLMs and in some cases rival or surpass domain-tuned systems - all while preserving generalizability and avoiding the need for costly fine-tuning or labeled data. Our method aggregates outputs from multiple LLM variants, including models from Anthropic and Google, to synthesize more accurate and robust answers. Moreover, our investigation highlights a relationship between context length and performance: while expanded contexts are meant to provide valuable evidence, they simultaneously risk information dilution and model disorientation. These findings emphasize IR as a critical foundation in Retrieval-Augmented Generation (RAG) approaches for biomedical QA systems. Precise, focused retrieval remains essential for ensuring LLMs operate within relevant information boundaries when generating answers from retrieved documents. Our results establish that ensemble-based zero-shot approaches, when paired with effective RAG pipelines, constitute a practical and scalable alternative to domain-tuned systems for biomedical question answering.", "Finally, we have added additional IR searches to handle the cases where a QA step does not return a response based on the evidence retrieved from Elasticsearch. We have observed that Elasticsearch context might not provide sufficient evidence for QA in 3-7% of test cases for Phase A+, depending on the batch. An automated process is used to expand IR sources to address these cases. First, we are using a Google search restricted to PubMed sources to attempt to find new matches. If that fails, we extend our sources to include Home of the Office of Health Promotion and Disease Prevention, WebMD,\n\nThis multi-stage retrieval approach, combining LLM-generated queries, a traditional BM25 search, and semantic reranking, enables flexible, high-recall, and high-precision document selection tailored to complex biomedical queries.", "Medical Document Retrieval for Generation. Acknowledging the limitations of report-only retrieval, recent studies have increasingly emphasized medical documents as knowledge sources (Choi et al. 2025; Shaaban et al. 2025; Wu et al. 2025; Hamza et al. 2025). Among them, MKGF (Wu et al. 2025) and K-LLaVA (Hamza et al. 2025) both employ multimodal retrievers to fetch documents from the database, aiming to mitigate hallucination issues in language models. ChatCAD+ (Zhao et al. 2024b) and MIRA (Wang et al. 2025) utilize a zero-shot query rewriting module for retrieval. Nevertheless, these retrieval methods overlook the substantial content differences among various corpora, lacking corpus-specific retrieval mechanisms.", "6 Conclusion\n\nThis work addresses the critical challenges of effective retrieval and multi-aspect alignment for heterogeneous knowledge in the Medical MMRAG field. MedAtlas provides a rich, multi-source knowledge base for medical multimodal tasks. The HeteroRAG framework enables precise report retrieval and multi-corpus retrieval, followed by aligning heterogeneous retrieval results through Heterogeneous Knowledge Preference Tuning. Extensive experiments demonstrate that our framework achieves state-of-the-art performance across multiple medical VQA and report generation benchmarks. Our work paves the way for effectively integrating multi-source medical knowledge, advancing the reliability and applicability of Med-LVLMs in clinical scenarios.", "2.4 Evaluating Downstream Utility in Medical Question Answering\n\nWe evaluated the downstream utility of our constructed KG as a RAG information source across seven multiplechoice medical QA datasets. These included four widely used benchmarks [76]-MMLU-Med, MedQA-US, PubMedQA*, and BioASQ-Y/N-spanning a broad spectrum of clinical and biomedical reasoning tasks. To further assess diagnostic reasoning under varying complexity, we introduce MedDDx, a newly developed benchmark suite focused on differential diagnosis [77]. Questions are stratified into three levels-MedDDx-Basic, MedDDxIntermediate, and MedDDx-Expert-based on the variance in semantic similarity among answer choices. All MedDDx subsets were designed to reduce training data leakage and more closely reflect authentic clinical reasoning. Detailed dataset statistics are shown in Figure 4 a . We systematically evaluated five state-of-the-art LLMs to measure the impact of KG-based retrieval. Each model was tested in a zero-shot setting under two conditions: (1) direct answering using internal knowledge alone, and (2) RAG, with relevant KG subgraphs prepended as external context. The models-GPT-4-turbo, GPT-3.5-turbo (OpenAI) [78], DeepSeek-v3 (DeepSeek) [75], Qwen-Max, and Qwen-Plus (Qwen) [79]-span diverse architectures and training regimes, representing both proprietary and open-source systems. All models were accessed via publicly available APIs without additional fine-tuning. Version details and access endpoints are summarized in Figure 4 b ."]

Here, the extra (full) context actually contains the missing evidence which helped and therefore got a higher score than the seeds.

But frankly though I find the answer is a bit lackluster still, there is nothing all that interesting in the answer. But this points to issues on the retrieval side though.

Lastly, check a few random ones and see how it is evaluating these.

"question": "how do i measure if i\u2019m getting diverse sources instead of 10 chunks from the same pdf", 
"answer": "You can treat \u201cdiverse sources\u201d as something you measure directly, not just eyeball.\n\n1. Simple source-diversity score (like in the paper)\nIn Document 1, they define a retrieval diversity metric as the fraction of unique review IDs among the top\u201110 chunks: Diversity = (number of distinct IDs) / 10 [1].\nYou can copy this idea for documents:\n- Give every PDF (or document) a unique ID.\n- When you retrieve K chunks (say K=10), look at which document each chunk came from.\n- Compute: Diversity = (number of distinct document IDs) / K.\n  \u2022 If all 10 chunks come from the same PDF, Diversity = 1 / 10 = 0.1 (very low).\n  \u2022 If all 10 chunks come from different PDFs, Diversity = 10 / 10 = 1.0 (perfect diversity).\n\nYou can log this number for each query to monitor whether you\u2019re repeatedly pulling from the same file or actually spreading across your corpus.\n\n2. Combine with relevance\nIn Document 1, they also track how close the chunks are to the query using an average cosine similarity between the query embedding and each chunk [1]. In practice you want both:\n- High similarity (on\u2011topic).\n- High diversity (many different source IDs).\n\n3. Optional: richer diversity later\nDocument 2 shows that diversity can also be measured at different levels (lexical, semantic, ingredient) when you have multiple outputs for one input [2]. By analogy, once you\u2019re happy with source diversity, you could later add:\n- Semantic diversity between chunks (are they saying different things?).\nBut the easiest starting point is the per-query \u2018distinct document IDs / K\u2019 metric.",
"full_faithfulness": 0.8823529412000001, 
"seed_faithfulness": 0.5294117647000001,
"answer_relevancy": 0.8588673985, 
"context_relevance": 0.5303662744007874,
"context_relevance_reason": "The context passages provide relevant information about measuring diversity in retrieval systems, particularly in the context of recipe adaptation and LLMs. However, while some passages discuss diversity metrics and retrieval methods, they do not directly address the user's specific question about measuring diverse sources versus multiple chunks from the same PDF. The relevance of the context is somewhat indirect, leading to a moderate score.",
"hallucination_score": 0.7209711030557213,
"hallucination_reason": "The response effectively outlines a method for measuring source diversity by introducing a simple source-diversity score and providing a clear formula. It aligns well with the context, which discusses retrieval diversity metrics. However, while it mentions combining relevance with diversity, it does not explicitly connect this to the context's focus on average cosine similarity, which could enhance the completeness of the answer. Overall, the claims are mostly supported, with minor gaps in direct references to the context."
"full_context": ["D. Question and Answering (QA)\n\nFor retrieval of reviews, we sampled five Spotify-centric queries and retrieved the top K = 10 review chunks for each. We measured two unsupervised metrics:\n\nAverage Cosine Similarity : the mean cosine similarity between each query embedding and its top-10 chunk embeddings.", "Retrieval Diversity : the fraction of unique review IDs among all retrieved chunks (distinct IDs / 10).\n\nOur retriever achieved perfect diversity and cosine scores from 0.618 to 0.754, demonstrating reliable, on-topic retrieval. Table IX summarizes these proxy metrics.", "For generation of answers, we randomly sampled 20 generated answers (each paired with its cited snippets) and annotated them ourselves, confirming that each answer (1) reflected the cited excerpts, (2) covered the main points of those excerpts, and (3) was written in clear, reader-friendly prose. We found the responses to be accurate and comprehensive.", "| | Query | Avg. Cosine Sim. | Diversity |\n|---:|:-------------------------------------------------------------------------------|-------------------:|------------:|\n| 0 | What complaints do users have about | 0.713 | 1 |\n| 1 | What do listeners say about Spotify crashing or freezing on startup? | 0.754 | 1 |\n| 2 | How do listeners describe the app's offline playback experience? | 0.696 | 1 |\n| 3 | How do users report errors or failures when downloading songs for offline use? | 0.618 | 1 |\n| 4 | What do users say about Spotify's crossfade and track-transition experience? | 0.65 | 1 |\n\nTABLE IX RETRIEVAL PROXY METRICS (K=10) FOR SELECTED SPOTIFY QUERIES (HIGHER DIVERSITY IS BETTER)", "2 Related Work\n\nRecipe Cross-Cultural Adaptation Recipe cross-cultural adaptation (Cao et al., 2024) involves modifying recipes to suit the dietary preferences and writing styles of the target culture. This includes not just translation, but also adjusting formats, ingredients, and cooking methods to align with cultural norms. Previous studies (Cao et al., 2024; Pandey et al., 2025; Zhang et al., 2024) often treat recipe adaptation as a cross-cultural translation task, exploring how prompt-based LLMs can be used for Chinese-English recipe adaptation.", "However, LLM-based recipe adaptation still faces challenges. Magomere et al.'s (2024) show that such methods can be misleading and may reinforce regional stereotypes. Hu et al.'s (2024) further identify two main challenges: First, LLMs lack culinary cultural knowledge, leading to insufficient cultural appropriateness. Second, the adapted recipes have quality issues, such as changing ingredients without adjusting the cooking steps accordingly. They propose another way to address these issues, namely through cross-cultural recipe retrieval, which sources recipes from real cooking practices within the target culture, generally offering better quality and cultural alignment. However, compared to directly using LLMs, the retrieved recipes often have low similarity to the original.", "All the above-mentioned studies primarily focus on the quality of generated results, including cultural appropriateness and their preservation of the original . However, they overlook the diversity of the results and do not explore the use of RAG for cross-cultural recipe adaptation. Our study emphasizes the trade-off between diversity and quality, with a particular focus on RAG-based approaches.", "Diversity in text generation, IR, and RAG Previous studies (Lanchantin et al., 2025) have shown that post-training LLMs tend to sharpen their output probability distribution, leading to reduced response diversity. This has raised a common concern about the insufficient diversity of LLMs, particularly in creative tasks. Several stochastic sampling-based decoding methods are widely used to control the level of diversity, most notably by adjusting hyperparameters such as temperature (Shi et al., 2024). However, these methods often still fall short in achieving sufficient diversity and may lead to a rapid decline in output quality, which is another important factor to consider when measuring diversity (Lanchantin et al., 2025).", "Figure 2: Overview of CARRIAGE . Diversity components are highlighted. We first enhance the diversity of retrieved results, then we enable more diverse use of contextual information via dynamic context selection, and inject contrastive context to prevent the LLM from generating outputs similar to previously generated recipes.\n\nMulti-Query Retrieval Source Culture Recipe Target Culture Diversity-aware Reranking Query Rewriting Dynamic Context Organization Pool of Previously Generated Recipes LLM Generation Contrastive Context Injection Previously : Diversity component Reference Recipes Selection Relevance Diversity\n\nMay generate multiple times\n\nIn IR, retrieving text with high diversity can cover a wider range of subtopics, thereby accommodating the potentially diverse preferences of different users. Methods such as diverse query rewriting (Mohankumar et al., 2021) and diversity-aware re-ranking (Carbonell and Goldstein, 1998; Krestel and Fankhauser, 2012) can effectively enhance the diversity of retrieval results. Some recent works (Carraro and Bridge, 2024) have explored using LLMs to enhance diversity in re-ranking.", "In RAG, prior works have mainly focused on retrieving diverse results to obtain more comprehensive information, such as mitigating context window limitations (Wang et al., 2025) and addressing multi-hop question answering tasks (Rezaei and Dieng, 2025). These works are primarily framed as question answering, aiming to acquire comprehensive knowledge to produce a single correct answer. Consequently, the evaluation metrics emphasize answer accuracy rather than diversity. In contrast, our task naturally permits multiple valid answers. Therefore, we adopt different strategies to encourage answer diversity and use metrics that explicitly evaluate the diversity of final outputs. While prior works have largely focused on retrieving diverse contexts, our approach goes a step further by investigating how to utilize such diverse contexts to produce diverse outputs.", "5 Metrics\n\nOur evaluation metrics focus on two key aspects: diversity and quality . To assess diversity, we consider factors such as lexical , semantic , and ingredient diversity from a per-input perspective. As a trade-off, we evaluate quality from two dimensions: the preservation of the source recipe, and cultural appropriateness for users in the target culture.", "5.1 Diversity\n\nKirk et al.'s (2023) have proposed two paradigms for measuring diversity: across-input (over pairs of one input and one output) and per-input diversity (one input, several outputs). Per-input diversity helps us investigate whether a single recipe can be adapted into multiple variants to meet different dietary preferences, while across-input diversity assesses whether the generated recipes collectively exhibit a diverse range of linguistic patterns. Because our investigation primarily focuses on whether a single recipe can be adapted into diverse variations to meet a broader range of needs, we adopt the per-input diversity setting as our main experimental focus. The across-input diversity setting is discussed further in Section 7.", "For a diversity metric D , under model configuration c , A denotes a set of adapted recipes,", "containing N source recipes, we define A i c = { a i c, 1 , a i c, 2 , . . . , a i c,K } as the set of K adaptations for the i -th source recipe under configuration c . The per-input diversity is defined as follows:\n\nLexical Diversity Lexical diversity is a measure of the variety of vocabulary used within a set of text. High lexical diversity indicates using a broad range of unique words, which may correspond to a wider variety of ingredients, cooking methods, and flavors. We employ Unique-n (Johnson, 1944) to evaluate lexical diversity, calculated as the ratio of unique n -grams to the total number of n -grams, reflecting the proportion of distinct n -grams and indicates vocabulary richness. Following prior work (Guo et al., 2024), we report the average Unique-n across unigrams, bigrams, and trigrams.", "Semantic Diversity Semantic diversity refers to the variety of meanings within a set of texts. High semantic diversity suggests a wide range of culinary ideas. We measure per-input semantic diversity using the average pairwise cosine distance between Sentence-BERT embeddings because embedding-based semantic diversity enables a more fine-grained evaluation of variation beyond surface-level vocabulary (Stasaski and Hearst, 2023). Specifically, for a set of K adapted recipes, we define the sum of their average semantic similarity and semantic diversity to be 1. In this formulation, higher semantic similarity implies lower semantic diversity. We define semantic diversity, scaled to the range [0 , 1] , as follows:\n\nwhere e represents embeddings of the recipe.", "Ingredient Diversity Ingredient diversity measures the variation in sets of ingredients across different recipes. Ingredient choice plays a crucial role in recipe diversity (Borghini, 2015). Compared to general lexical variation, ingredient changes offer a more precise signal for capturing the key factors driving diversity in recipes.", "Recipes often describe the same ingredient in varying ways, such as differences in quantity or units of measurement. To mitigate this, we introduce Standard Ingredients , which retain only the ingredient name by stripping away non-essential details. Since ingredient descriptions typically follow the format < quantity > < unit > < ingredient name >, we extract only the < ingredient name > to compute ingredient diversity. The detailed procedure is provided in Appendix B.", "To avoid the influence of differing ingredient counts across recipes, we define ingredient diversity as the ratio of unique standardized ingredients to the total number of ingredients. For a set of K adapted recipes, let the set of standardized ingredients for each recipe be I 1 , I 2 , . . . , I K . We define ingredient diversity as follows:", "5.2 Quality\n\nWe define automatic quality metrics to serve as a trade-off when evaluating recipe diversity. Further details on the training and evaluation of the CultureScore model are provided in Appendix B.", "Source Recipe Preservation Following prior work (Cao et al., 2024; Hu et al., 2024), we employ BERTScore (Zhang* et al., 2020), a common cosine embedding-based method for measuring the similarity between source and output recipes. Previous studies have shown that BERTScore aligns well with human evaluations in terms of source recipe preservation (Hu et al., 2024).", "Cultural Appropriateness We propose a novel metric, the Recipe Cultural Appropriateness Score (CultureScore), to assess how well the output recipes align with the target culture. Specifically, we employ a BERT-based classifier (Devlin et al., 2019; Ca\u00f1ete et al., 2020) to predict the country of origin of a recipe using its title and list of ingredients as input. The CultureScore is defined as the average predicted probability assigned by the model to the target culture across all adapted recipes, with higher scores indicating better cultural alignment. Since Latin American and Spanish recipes share the same language, the model cannot rely on linguistic cues; instead, it must learn to distinguish them based on culturally relevant features such as ingredients, flavors, and writing styles. Given that the classification model achieves an F1-score of over 90% in distinguishing between Latin American and Spanish recipes, we consider CultureScore a reliable proxy for assessing cultural appropriateness.", "| | | Method. | Diversity ( \u2191 ).Lexical | Diversity ( \u2191 ).Ingredient | Diversity ( \u2191 ).Semantic | Quality ( \u2191 ).CultureScore | Quality ( \u2191 ).BERTScore |\n|---:|:------------------|:----------------------------------------------------------------------------|:--------------------------|:-----------------------------|:---------------------------|:-----------------------------|:--------------------------|\n| 0 | Closed- Book LLMs | Llama3.1-8B Qwen2.5-7B Gemma2-9B | 0.557 0.551 0.538 | 0.667 0.531 0.639 | 0.232 0.247 0.196 | 0.451 0.404 0.468 | 0.404 0.439 0.370 |\n| 1 | IR | JINA-ES CARROT CARROT-MMR | 0.742 0.735 0.741 | 0.937 0.925 0.941 | 0.459 0.462 0.527 | 0.511 0.512 0.503 | 0.295 0.301 0.298 |\n| 2 | RAG | Vanilla-LLaMA RAG CARROT-LLaMA RAG CARROT-MMR-LLaMA RAG CARROT-MMR-Qwen RAG | 0.518 0.525 0.520 0.532 | 0.748 0.765 0.748 0.536 | 0.155 0.152 0.164 0.212 | 0.383 0.385 0.393 0.402 | 0.551 0.545 0.545 0.448 |\n| 3 | Ours | CARRIAGE -LLaMA CARRIAGE -Qwen | 0.577 0.628 | 0.739 0.676 | 0.269 0.303 | 0.463 0.590 | 0.442 0.342 |", "Table 1: Evaluation of diversity and quality on the RecetasDeLaAbuel@ dataset shows that our proposed CARRIAGE -LLaMA outperforms all closed-book LLMs in terms of Pareto efficiency across both diversity and quality metrics. In contrast, IR-based methods struggle with preserving the source recipe, while other RAG-based approaches tend to underperform in terms of diversity and cultural appropriateness."

This above is interesting as you see that the evaluator is taking a reasonable generalization and treats it as “kinda supported” or “meh.”

Evaluating this item above with another LLM, it said that it thought the context relevance comment was a bit whiny.

But as you see, low scores don’t have to mean that the system is bad. You have to examine why they are low and also why they are high to understand how the judge works or why the pipeline is failing.

A good example is context relevance here. Context relevance is measuring how much of the retrieved context was useful. If you’re doing neighbor expansion, you will almost always pull in some irrelevant text, so context precision will look worse, especially if the corpus can’t answer the question in the first place.

The question is whether the extra context actually helps grounding (faithfulness / hallucination rate) enough to be worth the noise.

Some cautious notes

Okay, some notes before I round this off.

Testing seeds here is clearly biased, and it doesn’t tell us whether they were actually useful on their own. We’d have to build two different pipelines and compare them side by side to say that properly.

I’ll try to do this in the future, with this exact uses case.

I should also note that the system has very few docs in the pipeline: only about 150 PDF files along with some Excel files, which is a few thousand pages. But I have to demo this in public, and this was the only way.

Remember we used only metrics on the generation side here, looking at the context that was retrieved. If the context retrieved is lying or has conflicting information, these metrics may not show it, you have to measure that before.

Furthermore many teams also build their own custom metrics, that’s unique to their pipeline and to what they want to test, and even if you start like this, with general ones, you can spot what you need along the line to build better targeted ones.

The last thing to note is LLM judge bias. I’m using OpenAI models both for the RAG pipeline and for the evaluator. This is generally not recommended, but as long as the models are different from the generator and judge it is generally accepted.

Hopefully it was a fun read (if you’re a dork about data like me).

Stay tuned for the last article where I try to test a more naive pipeline against this one (hopefully I have time to finish it).

If you want to stay updated or just connect you’ll find me at LinkedIn, my website, or Medium (and here too).

❤

Source link

Sign Up to Our Newsletter

Top Categories

Uncategorized

Tech News

Tech

Software development

Popular Tech News

Governments across Asia force public employees to work...

HIMSS26: Why Health Systems ‘Need To Get Into...

This bestselling 30-piece Milwaukee wrench set is 25%...

The Future of Food Testing May Be in...