Structured Knowledge for Multi-hop QA: A Comparative Study of GraphRAG and RAG | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Structured Knowledge for Multi-hop QA: A Comparative Study of GraphRAG and RAG Nimet Aksoy, Murat Osman Ünalır This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8283065/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract This study comparatively examines the classical RAG approach and the knowledge graph-based GraphRAG architecture in multi-hop question answering. While RAG uses external knowledge in an unstructured manner, GraphRAG aims to provide more controlled and meaningful answers by relying on structured knowledge triples, thereby reducing hallucinations. In the experiments, both architectures were tested on 500 questions selected from the HotpotQA dataset, and their performances were compared. In particular, the questions that the RAG system answered with “I don’t know” were re-evaluated using GraphRAG. In the GraphRAG pipeline, knowledge triples were first extracted from the context, and then the same language model performed question analysis—identifying the question type, the expected reasoning pattern, and selecting the most relevant triples. Answers were generated using only filtered and contextually appropriate structured information. The results show that incorporating structured knowledge provides a clear improvement in semantic answer quality. On average, both cosine similarity and BERT F1 scores increased by 20–30% across the tested subsets. Moreover, GraphRAG successfully answered approximately 80% of the questions that the classical RAG system could not answer. These findings demonstrate that structured knowledge enables more reliable reasoning in multi-step QA and highlight the potential of the GraphRAG approach as a stronger alternative for complex question answering tasks. Figures Figure 1 Figure 2 1. Introduction The emergence of large language models (LLMs) has led to substantial progress in various natural language processing (NLP) tasks such as summarization, question answering (QA), and information retrieval. These models exhibit strong language understanding capabilities due to their advanced architecture and training on large-scale open-domain datasets. However, their performance often declines in complex reasoning settings, particularly in tasks that require integrating information from multiple sources. To mitigate these limitations, the Retrieval-Augmented Generation (RAG) framework was proposed. RAG enhances LLMs by incorporating external knowledge sources, typically in the form of vectorized domain-relevant documents stored in a vector database (Izacard & Grave, 2021 ; Lewis et al., 2020 ). During inference, relevant passages are retrieved based on vector similarity and provided to the LLM as additional context, improving its ability to perform in specialized domains. Nevertheless, vector similarity–based retrieval often struggles to surface all the necessary evidence required for multi-hop reasoning, leading to incomplete or noisy context. Despite its effectiveness in many applications, vector similarity-based retrieval may fall short in multi-hop QA tasks, where the model must combine multiple pieces of information across documents. To address this, recent studies have introduced GraphRAG, which leverages structured representations of knowledge—specifically knowledge graphs constructed from external sources—rather than unstructured vector embeddings. In this framework, input documents are transformed into subject–predicate–object triples, allowing the model to reason more systematically over structured data and handle tasks requiring deeper inference (Han et al., 2025 ; Kau et al., 2024 ; Pan et al., 2024 ; zilliz, 2024 ).Rather than relying on pre-built or externally curated knowledge graphs, our study adopts a lightweight, task-specific graph construction approach, where triples are extracted directly from the dataset context using an LLM. Structurally informed models tend to generate more consistent responses and significantly reduce hallucinations by constraining the model to reason over explicit relational information. In addition to improving reliability, this approach enhances performance in tasks requiring multi-step inference. In this study, we conduct a detailed comparison of RAG and our GraphRAG implementation using the HotpotQA dataset, a widely adopted benchmark for multi-hop reasoning. Experiments are conducted across different question types (i.e., bridge and comparison). The results show that GraphRAG yields a performance improvement of nearly 20% across the full dataset. Moreover, in cases where RAG failed to generate an answer, GraphRAG successfully answered 80–90% of those questions. Because GraphRAG produces synthesized answers derived from relational structures rather than extractive spans, we evaluate both systems using semantic similarity metrics, which more accurately capture correctness in structured reasoning settings. These findings highlight the potential of knowledge-graph-based reasoning in enhancing multi-hop QA systems. 2. Related Work To enhance the performance of large language models (LLMs) in domain-specific question answering tasks and reduce their tendency to generate hallucinations, the Retrieval-Augmented Generation (RAG) framework was introduced. RAG combines the generative capabilities of LLMs with external document retrieval, enabling the model to generate more grounded and informative responses without the need for retraining. In a typical RAG pipeline, relevant documents are retrieved based on semantic similarity to the query and appended to the model’s context window during inference (Lewis et al., 2020 ). This architecture has been widely adopted in various QA tasks and shown to improve factual accuracy and reduce hallucination in both open-domain and knowledge-intensive settings (Arefeen et al., 2024 ; Besbes, 2024; Izacard & Grave, 2021 ). Despite its success, RAG faces limitations when applied to multi-hop question answering (QA) tasks, where reasoning across multiple documents is required. Datasets like HotpotQA are specifically designed to test multi-hop inference by including questions that necessitate integrating information from distinct sources (Yang et al., 2018 ). However, standard RAG pipelines often struggle with these scenarios due to the increased context length, semantic dispersion, and the lack of structural organization in retrieved content. Chunk-based retrieval methods, commonly used in RAG, may overlook long-range dependencies or split coherent information across different chunks, leading to incomplete or incorrect answers, especially in bridge-type or comparison questions (Liu et al., 2023 ; Mavi et al., 2024 ) . To address these limitations, several extensions to the RAG framework have been proposed. EfficientRAG, introduced by Zhuang et al. ( 2024 ) (Zhuang et al., 2024 ), trains lightweight labeling models on synthetic datasets derived from HotpotQA to perform efficient multi-hop retrieval without requiring LLM calls during inference. While this approach significantly reduces inference costs, its reliance on word-level labels may fail to capture semantic variation, and preprocessing large datasets like QASPER still incurs substantial overhead (Dasigi et al., 2021 ). Similarly, Lee et al. ( 2025 ) proposed a Multi-Hop Tree Structure (MHTS) Framework that generates synthetic QA datasets with controlled difficulty levels (Lee et al., 2025 ). However, the approach has only been tested on a narrow domain (e.g., the novel David Copperfield) and relies on proprietary models like GPT-4 Turbo, limiting its generalizability and scalability. Another prominent direction is multi-hop dense retrieval. Xiong et al. (2020) introduced a system that reformulates the query after each retrieval step by incorporating previously retrieved passages, enabling sequential information integration (Rosset et al., 2020 ). While this method showed improvements over sparse retrieval, it is constrained by a fixed number of hops and incurs high computational cost due to repeated embedding and search operations. More recently, the Chain-of-Retrieval (CoRAG) framework proposed by Wang et al. ( 2025 ) introduced a dynamic query reformulation mechanism to enable chained multi-hop retrieval during inference (Wang et al., 2025 ). CoRAG demonstrated promising results on HotpotQA and MuSiQue by using adaptive decoding strategies such as greedy decoding, best-of-N, and tree search. However, its performance gains are tightly linked to increased token consumption and computational overhead due to complex rejection sampling and chaining techniques. Likewise, ReSP (Retrieve, Summarize, Plan), proposed by Jiang et al. ( 2024 ), introduces a planning module based on summarization to reduce context redundancy (Jiang et al., 2024 ).. Although it improves retrieval quality, repeated invocation of a summarization model increases latency, making it impractical for real-time systems. Although numerous architectural improvements have been proposed to adapt the RAG framework to multi-hop question answering tasks, the underlying source of information in these systems remains unstructured natural language text. This lack of structure weakens contextual coherence and reduces the effectiveness of vector-based retrieval methods, particularly in long passages where relevant information is dispersed. In datasets like HotpotQA, which require multi-step reasoning, such limitations often lead to lower answer quality and increased hallucination risks. In contrast, knowledge graphs (KGs) offer a structured data representation that can help address these issues. By representing information as triples—(subject, predicate, object)—KGs encode explicit semantic relationships between entities and enable language models to perform more interpretable and coherent reasoning. While the creation and maintenance of high-quality KGs is time-consuming and resource-intensive, their integration with language models has been shown to improve accuracy and reduce hallucinations. Recent research has explored different strategies for integrating KGs with large language models (LLMs). A comprehensive analysis of this landscape is presented in Unifying Large Language Models and Knowledge Graphs: A Roadmap (Pan et al., 2024 ), which categorizes KG-LLM integration approaches into three main paradigms: KG-enhanced LLMs, LLM-enhanced KGs, and hybrid architectures. Although each paradigm presents its own challenges and trade-offs, they all highlight the benefit of leveraging structured knowledge to enhance language model reasoning. Prior studies have proposed various methods: for example, KG-BERT aims to complete missing links in a KG using BERT by representing triples as textual input, whereas ERNIE integrates KG entities during pretraining to improve the performance of transformer-based models (Yao et al., 2019 ; Zhang et al., 2019 ). However, both approaches involve costly processes such as KG construction and model retraining. In response to these limitations, lighter-weight techniques such as KG-prompting have gained popularity in recent years. These approaches use knowledge graphs during inference without changing the language model itself, making them easier and cheaper to apply in practice. In this setup, triples are automatically extracted from plain text through prompt-based methods, allowing the model to reason over structured information without additional training. One of the most prominent examples of this direction is the GraphRAG framework, which integrates knowledge graphs into the RAG pipeline to support structured reasoning during multi-hop question answering. GraphRAG allows the model to reason over relational triples extracted from relevant documents or linked from external ontologies, thus combining the strengths of retrieval-based and structure-based QA. This approach has attracted increasing attention in recent research, and many studies have been conducted using similar methods to improve multi-hop reasoning and reduce hallucination (EQT Ventures, 2024; Han et al., 2025 ; Kau et al., 2024 ). In our work, we present a systematic comparison between the GraphRAG and standard RAG pipelines using the widely adopted HotpotQA dataset. Two distinct subsets of the dataset were selected—one consisting of bridge-type questions and the other of comparison-type questions—and the complete GraphRAG pipeline was applied end-to-end. To ensure that the extracted triples would support meaningful inference, careful prompt engineering was applied during the graph construction phase, aligning the triples semantically with the question intent. Experimental results show that GraphRAG outperformed the baseline RAG model by approximately 20% in answer accuracy. More importantly, a separate evaluation focused on questions left unanswered by the RAG system (i.e., "I don't know" responses) revealed that GraphRAG was able to answer 80–90% of these previously unanswerable questions correctly, depending on the question type. These findings demonstrate that incorporating structured knowledge into QA pipelines can substantially improve not only performance but also reliability, especially in complex multi-hop scenarios. 3. Methodology In this section describes the datasets used in the study, as well as the design and implementation of both the RAG and GraphRAG frameworks. In particular, we first summarize the properties of the HotpotQA dataset, then detail the baseline RAG pipeline and the proposed GraphRAG pipeline built on top of the same data representation. 3.1 Dataset Description HotpotQA is a large-scale, Wikipedia-based question answering dataset specifically built to evaluate the ability of question answering systems and language models to perform multi-hop reasoning. The dataset introduced by Yang et al. ( 2018 ) contains about 112,779 examples with annotated supporting facts and is a valuable benchmark dataset for multi-hop QA and RAG-based architectures (Yang et al., 2018 ). Each question in the dataset is mapped to two or more Wikipedia paragraphs, and the models are expected to find the answer through chains of reasoning. There are two different question types in the dataset, “bridge” and “comparison”: Bridge questions, which require reasoning over a bridge entity that connects two distinct facts or paragraphs. Comparison questions, which involve evaluating and comparing properties of two entities (e.g., "Who is older, A or B?"). In addition to the question type, each example is annotated with a difficulty level (easy, medium, or hard), which reflects the number of reasoning steps and the complexity of the required evidence. The HotpotQA dataset is divided into subsets based on difficulty level. Easy questions usually require only a single fact to answer. Medium questions involve multi-hop reasoning and can be answered by baseline models. Hard questions also require multi-hop reasoning but are more complex and often cannot be answered by baseline systems. Each example in the HotpotQA dataset is a JSON object that includes a natural language question, its correct answer, the reasoning type (bridge or comparison), the difficulty level (easy, medium, or hard), a list of supporting facts, and the context. The context is made of Wikipedia paragraphs given as pairs of a title and a list of related sentences. The supporting facts specify which titles and sentence indices are required to answer the question, and thus explicitly encode the multi-hop nature of the dataset. Since the goal of this study is to understand how structured data affects QA performance in multi-hop settings, we selected 500 examples from the HotpotQA dataset, focusing on two question types: bridge and comparison. These 500 examples (250 bridge-type and 250 comparison-type questions) were taken from our previous work, Understanding the Impact of Dataset Characteristics on RAG-based Multi-hop QA Performance, where they were already curated to represent typical multi-hop reasoning patterns(Aksoy et al., 2025 ). In both studies, we preprocessed the context field into a RAG-compatible format by restructuring it as “title: sentence1, sentence2, …”, allowing for more consistent passage retrieval. This representation flattens each Wikipedia paragraph into a single textual unit while preserving the association between the title and its sentences, which is useful both for dense retrieval and for triple extraction. In the current study, we also extracted subject–predicate–object triples directly from this reformatted context to build knowledge graphs used in the GraphRAG pipeline. By using the same preprocessed context for both RAG and GraphRAG, we ensure that any performance differences can be attributed to the reasoning mechanism (unstructured retrieval vs. structured graph reasoning) rather than to changes in data representation. 3.2 RAG Pipeline Retrieval-Augmented Generation (RAG) is a method that combines document retrieval and answer generation in one system. First, the question is used to find related documents from an external source, such as a vector database. Then, the question and the retrieved documents are given to a language model, which uses both to create an answer. This approach helps the model give more accurate and informed responses without extra training, especially for tasks that need current or specific knowledge (Lewis et al., 2020 ).RAG therefore provides a controlled way of grounding LLM outputs in external evidence, which is particularly important in multi-hop scenarios where multiple pieces of information must be integrated. In this study, we adopt a standard Retrieval-Augmented Generation (RAG) framework implemented using the LangChain library. The pipeline consists of three main stages: (1) preprocessing and chunking dataset-specific documents, (2) semantic vector-based retrieval, and (3) answer generation using a large language model (LLM). Figure 1 illustrates the RAG pipeline as used in this study, including the document preprocessing steps, the chunk-level embedding process, and the retrieval–generation workflow. For this study, we reused the RAG pipeline previously developed and evaluated in our earlier multi-dataset experiments. Specifically, a subset of 500 questions from the HotpotQA dataset was selected, consisting of 250 bridge-type and 250 comparison-type questions. The context field for each question was formatted in the same way as in the earlier setup, where each paragraph was transformed into a unified structure of title: sentence1, sentence2, ... to preserve semantic coherence during retrieval. Using the same preprocessing strategy ensures comparability with our previous work (Aksoy et al., 2025 ) and isolates the effect of structured reasoning introduced in GraphRAG. The RAG pipeline was implemented using LangChain’s RetrievalQA chain. Contexts were split into overlapping chunks using RecursiveCharacterTextSplitter (chunk size: 1000 tokens, overlap: 100) to maintain information continuity. This parameter choice reflects a balance between capturing sufficient local context and avoiding excessive chunk fragmentation, which is known to negatively affect multi-hop retrieval performance. These chunks were embedded using the all-MiniLM-L6-v2 model from Hugging Face’s sentence-transformers library and stored in a ChromaDB vector database. We used the default embedding dimension of 384, which is well suited for dense retrieval tasks where semantic rather than lexical similarity is required. During retrieval, Maximal Marginal Relevance (MMR) similarity search was employed to fetch relevant chunks. MMR was selected because it reduces redundancy in retrieved passages and improves coverage of distinct reasoning paths—an important requirement for multi-hop QA. For the answer generation step, we used the LLaMA3-70B-8192 model via the Groq API, selected for its support for long input sequences, ease of access, and increasing adoption in research. A question-specific prompt template was used to help the model produce short, accurate responses and reduce hallucinations. Temperature was set to 0.1, and a maximum output length of 64 tokens was used to encourage concise answers consistent with HotpotQA’s answer format. No model-specific fine-tuning was performed. Rather than directly reusing the numerical hyperparameter values from our earlier SQuAD-based study (Aksoy et al., n.d.), we applied the design principles established there: namely, that multi-hop tasks benefit from (i) chunk sizes aligned with typical context lengths, (ii) MMR-based retrieval to maximize evidence diversity, and (iii) low-temperature decoding to reduce hallucination. In this study, the actual hyperparameter values were adjusted to the characteristics of the HotpotQA dataset, while the underlying methodological rationale remained consistent. This controlled yet dataset-aware setup ensures that the RAG baseline is both robust and comparable, allowing performance differences to be attributed to the structured reasoning introduced in GraphRAG rather than to parameter tuning. 3.3 GraphRAG In the RAG architecture, external information is stored in a vector database after being converted into dense vector representations. Although relevant documents are retrieved based on vector similarity, the underlying data remains unstructured. In contrast, GraphRAG replaces this component with knowledge graphs, where external information is represented as structured triples (subject, predicate, object). This structured format allows the model to capture deeper semantic relations, going beyond surface-level similarity (Han et al., 2024 ). In this study, we adopt a lightweight, document-level variant of GraphRAG, where small task-specific graphs are constructed directly from the same HotpotQA contexts used in the RAG pipeline. In this study, a total of 500 examples from the HotpotQA dataset—250 bridge-type and 250 comparison-type questions—were processed using this GraphRAG framework. Each example consists of a question and its associated context, which was preprocessed in the same way as in the RAG pipeline to ensure a fair comparison between the two architectures. Triples were extracted from the context fields using a prompt-based method with the LLaMA3-70B-8192 model via the Groq API. The same model was also used for the question answering (QA) stage, reducing representational mismatch between the structured triples and the unstructured text, as both are produced and consumed by a single LLM. Although triples could in principle be derived using alternative NLP techniques (e.g., dependency parsing, Open Information Extraction), we opted for an LLM-based approach due to its ability to capture implicit and semantically rich relations. The full extraction prompt is provided in Appendix A . Consistent with prior work, we note that LLM-generated triples may not be perfectly accurate; in this study, we evaluate performance at the end-to-end QA level and leave triple-level validation to future work. The extracted triples were stored as a list and subsequently used to construct a local directed graph for each question. Each subject and object was mapped to a graph node, while each predicate defined a labelled directed edge connecting the corresponding nodes. During this process, duplicate or clearly malformed triples were discarded, resulting in a compact yet informative set of relational facts associated with each example. After the graph construction phase, the second stage of our GraphRAG architecture performs question answering using only these triples. The LLM is prompted to analyse the question, infer the expected answer type (e.g., person, date, place), identify the key entities, and determine the required reasoning style (e.g., multi-hop, comparison, temporal). Based on this internal analysis, the model selects the most relevant triples and attempts to infer an answer grounded solely in this structured evidence. If no answer can be supported by the available triples, the model is instructed to output “I don’t know” rather than guessing. Given that the gold answers in HotpotQA are short and factual, the prompt also enforces concise responses. This explicit restriction to triple-based reasoning suppresses unsupported hallucinations and isolates the contribution of structured knowledge compared to the unstructured RAG baseline. Figure 2 illustrates the complete GraphRAG pipeline used in this study. 4. Experiments Results and Discussion In this section, we describe the evaluation framework used in the experiments, present the results obtained with the baseline RAG and the GraphRAG pipelines on the HotpotQA dataset, and compare their behaviour on bridge and comparison questions. The main goal is to analyse how graph-augmented retrieval and reasoning influence answer quality and reliability in multi-hop question answering by reporting both the RAG and GraphRAG results on the same subsets, and by additionally examining how GraphRAG performs on the questions that the RAG system fails to answer. 4.1. Evaluation Framework In this study, we adopted a hybrid evaluation approach that combines semantic similarity metrics with a lightweight threshold-based labeling scheme to assess the quality and factual reliability of the generated answers. Our primary goal was to design an evaluation process that is efficient, reproducible, and scalable, especially important when working with multiple QA datasets and large numbers of examples in a cost-conscious academic context. Unlike our earlier work where we used LLM-based evaluation tools such as RAGAS, this study intentionally avoids such dependencies. Although tools like RAGAS offer fine-grained scoring by leveraging powerful commercial language models (typically GPT-4), they also introduce two major limitations: high token costs and slow processing times. These drawbacks make them impractical for large-scale experiments. Instead, we developed a lightweight and robust evaluation pipeline using only open-source tools and metrics that can be run locally and freely. The metrics used in this evaluation framework are detailed below, including cosine similarity and BERTScore, which jointly enabled a robust assessment of semantic fidelity in model responses. While both cosine similarity and BERTScore aim to assess semantic similarity between the generated and reference answers, they do so in fundamentally different ways. Cosine similarity operates on sentence-level embeddings, measuring overall semantic alignment using vector space distance. In contrast, using deep transformer-based contextual representations, BERTScore evaluates contextual alignment at the token level, comparing each word in the model's output with those in reference. This makes BERTScore more sensitive to paraphrasing, rewording, and nuanced phrasing. Cosine Similarity: To evaluate the semantic closeness between generated answers and ground-truth references, we used cosine similarity. Sentence embeddings were created using the all-MiniLM-L6-v2 model from the Sentence-Transformers library, which is known for its ability to capture sentence-level meaning beyond surface token overlap. A similarity threshold of 0.5 was set, and this value was also used to convert continuous similarity scores into binary classification labels (i.e., semantically correct or not). BERTScore: We also included BERTScore, computed using the roberta-large model, to provide a second layer of semantic evaluation. Unlike cosine similarity, BERTScore operates at the token level and accounts for contextual meaning, making it particularly useful when evaluating paraphrased or abstracted answers. To handle cases where the model returns defensive or evasive answers (e.g., “I don’t know”), we assigned a BERTScore of zero to those outputs—treating them as non-informative in the context of factual QA. All evaluations were carried out using open-access Python libraries such as scikit-learn, sentence-transformers, and bert_score. We ran all experiments in a Google Colab environment, which allowed us to avoid the use of commercial LLM APIs entirely. This setup ensured that our evaluation framework was fast, free to use, and fully reproducible, a critical factor for large-scale academic studies involving multiple datasets and thousands of test cases. The pseudo code describing the evaluation phase is as follows: Initialize sentence_embedding_model with "all-MiniLM-L6-v2" Initialize bert_model with "roberta-large" Set cosine_similarity_threshold = 0.5 For each row in the dataset: Get model_answer and true_answer as strings If either answer is empty or blank: Assign cosine_similarity = 0.0 Assign BERT precision, recall, and F1 = 0.0 Else: Encode both answers into vectors using sentence_embedding_model Compute cosine_similarity between vectors If model_answer starts with "I don't know": Assign BERT precision, recall, and F1 = 0.0 Else: Compute BERTScore precision, recall, and F1 between the two answers Append cosine_similarity and BERT scores to respective lists If cosine_similarity ≥ threshold: Assign predicted_label = 1 Else: Assign predicted_label = 0 After all rows are processed: Add cosine_similarity, BERT scores, and predicted_labels to the dataset Set true_labels = 1 for all rows in answerable-question subsets. Print: Mean Cosine Similarity Mean BERTScore F1 4.2. Overall Performance (RAG vs. GraphRAG) In this section, we present the results obtained from the RAG and GraphRAG pipelines on the HotpotQA bridge and comparison subsets. The analysis focuses on how each system handles different multi-hop reasoning patterns and how graph-augmented retrieval influences answer quality. We first report the performance of the baseline RAG system, followed by the corresponding GraphRAG results on the same subsets. We also examine how GraphRAG performs on questions for which the RAG system produced an “I don’t know” answer, allowing us to quantify the contribution of structured knowledge graphs in cases where standard retrieval fails. The results obtained from the four controlled subsets of the HotpotQA dataset are summarized in Table 1 below. Table 1 Subset Level performance on the HotpotQA dataset with RAG Pipeline Question Type Cosine Similarity BERT F1 Bridge 0.63 0.74 Comparison 0.64 0.76 Table 1 shows that the two subsets give very close results. The model is only slightly better on comparison questions, with cosine similarity at 0.64 and BERT F1 at 0.76. The bridge questions are a bit lower on both metrics, which makes sense because these questions usually require several steps and pulling information together from different parts of the context. In both cases, BERT F1 is higher than cosine similarity. This suggests that the model often produces answers that are reasonable in meaning even when the wording doesn't line up very closely. Looking at both metrics together helps make the overall behaviour a bit clearer. Table 2 presents the detailed analysis of the bridge type subset for WH-questions. Questions were categorized based on leading question words such as what, who, how, etc., and all others were grouped under the category “other.” This categorization provides insight into how different question formulations impact the quality of retrieval and rendering in our RAG pipeline. Table 2 . Question type-based performance on Bridge Dataset with RAG pipeline. Question Type Count Mean BERT F1 Avg. Chunk Len. (chars) Avg. Chunk Len. (tokens) what 64 0.7508 5672 922 when 11 0.7471 6033 997 which 26 0.7182 5305 864 where 8 0.6360 5739 897 how 5 0.3799 7083 1141 who 20 0.6168 5223 855 other 116 0.7816 5963.3 966.9 Table 2 results show that longer contexts often lead to lower performance. For example, how-type questions had the longest context lengths and the lowest BERT F1 scores. This is expected, as these questions usually require explanation or reasoning, which is harder for the model. On the other hand, what-type questions performed better and were the most common WH type. These questions often ask for specific facts, which can be found more easily in the context. Similarly, when and which questions also gave good results. The other category showed the highest performance overall. This may be because many questions in this group were shorter and more direct, making them easier to match with retrieved information. This findings suggest that the model performs better with fact-based and worded questions, while it struggles more with questions that need reasoning or multi-step answers. Unlike bridge-type questions, comparison-type questions are inherently structured to contrast two or more entities. As a result, categorizing them by WH-type (e.g., what, who, how) is not meaningful, since most comparison questions follow a functional pattern that focuses on equivalence, order, quantity, or attributes, rather than seeking specific factual responses. Therefore, a more informative analysis groups these questions by comparison subtype, explicit, ordinal, yes/no, other. Explicit comparisons : These questions include clear comparative expressions such as "which one", "who has more", or temporal markers like "earlier" and "later". They typically require retrieving two entities and selecting one based on a specified attribute. • Example • “Which satellite was launched earlier, Hakucho or Chandra?” Ordinal comparisons : These questions express sequence or order using terms like "first", "older", or "developed earlier". They require reasoning over timeline or development precedence. • Example • “Who was born first, Greg Lake or someone else?” Yes/No comparisons : These are binary questions starting with auxiliary verbs such as "Are", "Do", "Is", or "Did", and expect a confirmation or negation of a comparative claim. • Example • “Are Hayley Williams and Paul Simon both singers?” Other comparisons : This group includes questions that involve implicit or less structured comparisons which do not clearly fall into the above categories. These are often more open-ended or abstract. • Example • “What do Aram Avakian and Karo Parisyan have in common?” Results of comparison type question is shown Table 3 . Table 3 Question type-based performance on Comparison Dataset of RAG pipeline. Comparison Type Question Count BERT F1 Avg. Chunk (chars) Avg. Chunk (tokens) explicit 59 0.894 5215 838 ordinal 41 0.789 4946 810 yes/no 99 0.829 4802 777 other 51 0.810 5690 920 The performance analysis on the comparison set reveals consistent trends in how different comparison question types affect model efficiency. Explicit comparison questions with clear comparative structures achieved the highest BERT F1 score (0.894), indicating that the model handled direct and well-formulated comparison prompts more effectively. In contrast, sequential questions that required reasoning in order or sequence (e.g., “Who was born first?”) consistently performed the lowest. This is likely because these questions depend on locating and correctly interpreting temporal or ordered information, which can be harder to retrieve when such details are spread across the context. Yes/No and other comparison types showed moderate performance. Overall, these results suggest that the phrasing and structure of the comparison prompt itself play a noticeable role in how well the model can resolve multi-hop comparisons. The same analyses were performed using the GraphRAG pipeline for both bridge type and comparison type questions. Below we present the results and how they differ from RAG. Subset level performance -bridge, comparison type question- is shown in Table 4 . Table 4 Subset level performance with GraphRAG pipeline. Question Type Cosine Similarity BERT F1 Bridge 0,69 0,88 Comparison 0,78 0,92 The results in Table 4 show that the GraphRAG system performs better on both bridge and comparison questions. For bridge type questions, a cosine similarity of 0.69 and a BERT F1 score of 0.88 were obtained. For comparison type questions, a cosine similarity of 0.78, and a BERT F1 score of 0.92. BERT F1 score is higher in both subsets. This shows that the model can reach semantically correct results despite the differences in expression. The results show that overall, GraphRAG helps the model to provide more accurate and meaningful answers than the basic RAG system. A more detailed breakdown across WH-question types is presented in Table 5 , which illustrates how the graph-based structure influences the model’s performance in each question category. Table 5 Question type-based performance on Bridge Dataset with GraphRAG Question Type Count Mean BERT F1 Avg. Chunk Len. (chars) Avg. Chunk Len. (tokens) what 64 0.8914 5672 922 when 11 0.7057 6033 997 which 26 0.9748 5305 864 where 8 0.6854 5739 897 how 5 0.7293 7083 1141 who 20 0.9561 5223 855 other 116 0.878 5963.3 966.9 In the GraphRAG pipeline, the performance of bridge-type questions varies according to the WH type, with the highest BERT F1 scores observed for “which” (0.9748) and “who” (0.9561) questions. These types usually require defining specific entities or making choices, and are easily found with the triples provided by knowledge graphs. “What” questions also performed well (0.8914) because they contain concrete facts that can be easily retrieved. In contrast, lower scores were recorded for “how” (0.7293), “when” (0.7057), and “where” (0.6854) questions, which usually require temporal and spatial inference; these types of questions may not be directly extracted from the triples and remain difficult for knowledge graphs. Despite the similar context lengths across question types, GraphRAG provided a significant advantage thanks to structured data. The “other” category, which included non-standard WH forms, also yielded a strong result (0.878), indicating that clear and well-matched expressions were used effectively. In addition, in our study, questions that could not be answered by the RAG pipeline were stored in separate csv files and re-evaluated with the Graphrag pipeline, and the performance difference between RAG and GraphRag was evaluated in detail. The number of questions that could not be answered with the RAG pipeline - the model answered with I don't know - and their ratio to the data set are shown in Table 6 . Table 6 Distribution of Questions That Could Not Be Answered by the RAG Pipeline Subsets Subset I Don't Know Count Total Questions Ratio (%) Bridge-Type 51 250 20.4% Comparison-Type 43 250 17.2% Table 7 shows how many bridge-type questions the GraphRAG pipeline answered and with what success rate the RAG system failed to answer them. Out of the 51 questions left unanswered by RAG (which the model answered as “I don’t know”), GraphRAG provided semantically meaningful answers to 44 of them (86.27%) based on the BERT F1 score above 0.5. Only 31 answers (60.78%) were considered similar when measured by cosine similarity (threshold value 0.5). This shows that 31 questions were answered almost identically by the full word match, and 44 questions were answered semantically correctly by the GraphRAG pipeline, even though word differences existed. The average BERT F1 score was 0.801, and the average cosine similarity was 0.659. These results show that GraphRAG can answer many questions that RAG cannot, especially by using structured data for questions that require deeper reasoning. Table 7. GraphRAG Performance on Bridge-Type Questions Unanswered by RAG Subset Cosine Similarity BERT F1 Bridge-Type 0.6596 0.8012 Comparison-Type 0.7592 0.9086 Following this overall evaluation, we now provide a more detailed analysis of the bridge-type and comparison-type questions. Each category is examined based on question formulations (e.g., WH-type or comparison subtype), semantic accuracy, and contextual characteristics to understand better where GraphRAG performs well and where challenges remain. Table 8 and Table 9 provide detailed analyses of the bridge-type and comparison questions. Table 8. GraphRAG WH-Type Analysis on RAG-Unanswered Bridge Questions Question Type Count Mean BERT F1 Avg. Chunk Len. (chars) Avg. Chunk Len. (tokens) how 3 0.3333 7076.0 1158.3 what 13 0.8725 6017.7 977.8 when 2 0.9998 6368.0 1045.0 where 2 0.4249 6757.5 1035.5 which 5 1.0 5548.8 905.0 who 7 0.9028 5629.1 911.6 other 19 0.7552 5850.2 941.0 Table 9. GraphRAG Comparison-Type Analysis on RAG-Unanswered Comparison Type Questions Comparison Type Count Mean BERT F1 Avg. Chunk Len (chars) Avg. Chunk Len (tokens) explicit 8 0.6775 4527.2 727.6 ordinal 5 0.9247 5568.6 926.2 other 1 1.0 5368.0 858.0 yes/no 29 0.9664 5334.0 859.3 Table 8 shows the performance of GraphRAG according to the WH-question types of bridge-type questions that RAG could not answer. The “which” and “when” questions showed the highest success (BERT F1: 1.0 and 0.9998). These types of questions usually ask for clear information and GraphRAG’s structured data gives good results for these questions. The “what” and “who” questions are also quite successful. In contrast, the “how” and “where” questions gave lower results. These questions usually require explanation or location information and require more complex thinking. In addition, the supporting texts of “how” questions are longer on average, which is difficult for the model to process. Table 9 shows the analysis of the comparison-type questions that RAG could not answer with GraphRAG according to their types. The highest success was seen in the “yes/no” (BERT F1: 0.9664) and “ordinal” (0.9247) types. These types of questions can usually be solved easily with clear and structured information. The success rate is slightly lower in “Explicit” comparison questions (0.6775), because the features compared in these types of questions may not be clearly defined. The success rate is high in the single-question in the “Other” category, but generalizations cannot be made because the number is small. These detailed results provide the background needed to interpret the overall differences between the two systems. When the results are considered together, a consistent pattern becomes clear. GraphRAG performs better than the standard RAG system across both bridge and comparison questions, and this improvement is visible in all semantic metrics. Bridge questions remain more difficult for both systems because they require chaining several pieces of information, yet GraphRAG still produces higher cosine similarity and BERT F1 scores in this subset. The gains are even more noticeable in comparison questions, where the graph structure seems to support the reasoning steps needed to connect two entities more reliably. The question-type analysis also helps explain where this advantage comes from. GraphRAG handles “which”, “who”, and “yes/no” questions particularly well, likely because these forms correspond more directly to the triple patterns used in the graph. In contrast, “how” and “where” questions continue to be challenging, as they often require descriptive or spatial reasoning that is not explicitly encoded in triple form. Similar patterns were observed in both the bridge and comparison subsets. A notable outcome of the study is the behavior on questions that RAG could not answer. GraphRAG was able to produce meaningful responses for most of these cases. For bridge questions, it answered 86% of the “I don’t know” outputs with semantically valid responses, and it showed similarly strong performance on comparison questions. This indicates that the graph-based context enables the model to recover information that the standard RAG pipeline fails to retrieve or interpret. Overall, the results suggest that incorporating structured relational information provides clear benefits in multi-hop settings, especially in datasets with long contexts and distributed evidence. GraphRAG offers a more reliable retrieval signal and supports reasoning steps that RAG cannot perform effectively. This makes graph-augmented methods a promising direction for improving answer quality and reducing failure cases in complex QA tasks. 5. Conclusion and Future Work This study comparatively evaluated the classical Retrieval-Augmented Generation (RAG) approach and the knowledge graph-supported GraphRAG architecture in multi-hop question answering tasks. Using the HotpotQA dataset, detailed analyses were carried out on both bridge and comparison type questions, and the questions that the RAG system could not answer were reprocessed with the GraphRAG architecture. The results showed that GraphRAG, which incorporates structured information, was more successful in producing meaningful and correct answers—especially for multi-step questions that require deeper and more complex reasoning. The GraphRAG architecture used in the study goes beyond a simple triple extraction pipeline and includes additional reasoning steps based on question analysis. The model first analyzes the question, identifies the expected answer type, the entities involved, and the required reasoning pattern, and then generates an answer using only the relevant triples. This design supports both semantic accuracy and more precise information matching. In future work, several directions will be explored to further improve the system. One priority is enhancing the triple extraction stage, as the quality of extracted triples directly affects downstream reasoning performance. The system will also be tested on datasets such as QASPER, which contain longer and more complex contexts, in order to assess its robustness and adaptability in settings where textual heterogeneity and noisy evidence make triplet extraction more challenging. Beyond these extensions, future studies will investigate whether integrating knowledge graphs directly into model training provides additional improvements over retrieval-based reasoning. Comparing a dynamic, fine-tuned KG-aware model with the static GraphRAG pipeline may help reveal how much of the performance gain comes from explicit structured retrieval versus internalized relational knowledge. In addition, evaluating the pipeline across a broader set of language models with different sizes and architectural properties will allow a more comprehensive understanding of how model characteristics interact with graph-based retrieval and whether certain model families benefit more from structured knowledge. Overall, the findings of this study show that using structured recall and graph-supported reasoning can significantly improve performance in multi-hop question answering, especially in cases where standard RAG systems fail. These results highlight a promising direction for future QA systems aiming to achieve higher accuracy, more reliable reasoning, and reduced hallucination in complex information-seeking tasks. Declarations Author Contribution N.A. conceived and designed the study, implemented the RAG and GraphRAG pipelines, performed all experiments and analyses, and wrote the entire manuscript. M.O.Ü. supervised the research, provided conceptual guidance, and reviewed the manuscript. Both authors approved the final version Data Availability The datasets used in this study are publicly available (HotpotQA). All evaluation outputs and triple-extraction files generated during the study are available from the corresponding author upon reasonable request. References Aksoy, N., Güven, Z. A., & Ünalir, M. O. (2025). Understanding the Impact of Dataset Characteristics on RAG based Multi-hop QA Performance. https://doi.org/10.21203/rs.3.rs-6968562/v1 Aksoy, N., Ünalir, M. O., & Güven, Z. A. (n.d.). Architecting and Evaluating a RAG based Question Answering System for SQuAD Dataset. In Recent Developments in Engineering with Applied Mathematics and AI. World Scientific. Arefeen, M. A., Debnath, B., & Chakradhar, S. (2024). LeanContext: Cost-Efficient Domain-Specific Question Answering Using LLMs. Natural Language Processing Journal, 7. https://doi.org/https://doi.org/10.1016/j.nlp.2024.100065 Besbes, A. (2024, January 15). 3 Advanced Document Retrieval Techniques To Improve RAG Systems. https://towardsdatascience.com/3-advanced-document-retrieval-techniques-to-improve-rag-systems-0703a2375e1c Dasigi, P., Lo, K., Beltagy, I., Cohan, A., Smith, N. A., & Gardner, M. (2021). A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers. EQT Ventures. (2024, December 19). Knowledge Graph(s) and LLM-based ontologies have a very good shot at unlocking GenAI in production. https://medium.com/eqtventures/knowledge-graph-s-and-llm-based-ontologies-have-a-very-good-shot-at-unlocking-genai-in-production-1b167533ef63 Han, H., Shomer, H., Wang, Y., Lei, Y., Guo, K., Hua, Z., Long, B., Liu, H., & Tang, J. (2025). RAG vs. GraphRAG: A Systematic Evaluation and Key Insights. Han, H., Wang, Y., Shomer, H., Guo, K., Ding, J., Lei, Y., Halappanavar, M., Rossi, R. A., Mukherjee, S., Tang, X., He, Q., Hua, Z., Long, B., Zhao, T., Shah, N., Javari, A., Xia, Y., & Tang, J. (2024). Retrieval-Augmented Generation with Graphs (GraphRAG). Izacard, G., & Grave, E. (2021). Leveraging passage retrieval with generative models for open domain question answering. EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference. https://doi.org/10.18653/v1/2021.eacl-main.74 Jiang, Z., Ma, X., & Chen, W. (2024). LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs. Kau, A., He, X., Nambissan, A., Astudillo, A., Yin, H., & Aryani, A. (2024). Combining Knowledge Graphs and Large Language Models. Lee, J., Kwon, D., Jin, K., Jeong, J., Sim, M., & Kim, M. (2025). MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W. T., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In H. Larochelle and M. Ranzato and R. Hadsell and M.F. Balcan and H. Lin (Ed.), Advances in Neural Information Processing Systems (Vols. 2020-December, pp. 9459–9474). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. http://arxiv.org/abs/2307.03172 Mavi, V., Jangra, A., & Jatowt, A. (2024). Multi-hop Question Answering. Foundations and Trends® in Information Retrieval, 17(5), 457–586. https://doi.org/10.1561/1500000102 Pan, S., Luo, L., Wang, Y., Chen, C., Wang, J., & Wu, X. (2024). Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/pdf/2306.08302 Rosset, C., Xiong, C., Phan, M., Song, X., Bennett, P., & Tiwary, S. (2020). Knowledge-Aware Language Model Pretraining. Wang, L., Chen, H., Yang, N., Huang, X., Dou, Z., & Wei, F. (2025). Chain-of-Retrieval Augmented Generation. Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). Hotpotqa: A dataset for diverse, explainable multi-hop question answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018. https://doi.org/10.18653/v1/d18-1259 Yao, L., Mao, C., & Luo, Y. (2019). KG-BERT: BERT for Knowledge Graph Completion. Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., & Liu, Q. (2019). ERNIE: Enhanced Language Representation with Informative Entities. Zhuang, Z., Zhang, Z., Cheng, S., Yang, F., Liu, J., Huang, S., Lin, Q., Rajmohan, S., Zhang, D., & Zhang, Q. (2024). EfficientRAG: Efficient Retriever for Multi-Hop Question Answering. zilliz. (2024). GraphRAG Explained: Enhancing RAG with Knowledge Graphs. https://medium.com/@zilliz_learn/graphrag-explained-enhancing-rag-with-knowledge-graphs-3312065f99e1 Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8283065","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":555519747,"identity":"2cf99c7c-a013-4da9-b71a-2271402516e6","order_by":0,"name":"Nimet Aksoy","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABCklEQVRIiWNgGAWjYDACZghlAMSMDxIqbEB04wGCWg5AtDAbPDiTBtLSgF8LA0ILm+TDtsMwAdzAnJ078fEHBjtj+fYeM4mEM+ft1rYfBtpSYxONS4tlM+9mgwMMyWYGZ84YWyRU3E7ediYRqOVYWm4DDi0Gh3m3SRxgYLYxkMgxvJFw5nay2QGgFsaGw/i0bP9xgKHeRn5GjoFEYtu5ZLPzDwlq2Qb07WEzhhs5RkAtB+zMbhC2ZbPEGYPjxgZnjhUbJJxJTjC7AbQlAZ9fzp/d+KGiotpwfnvzxoc/Kuzszc6nP3zwocYGpxaoRhDBASYTwSoT8CqHA/YHINKeOMWjYBSMglEwkgAAsa5oWX518DcAAAAASUVORK5CYII=","orcid":"","institution":"Ege University","correspondingAuthor":true,"prefix":"","firstName":"Nimet","middleName":"","lastName":"Aksoy","suffix":""},{"id":555519748,"identity":"1c6f1a37-6b2e-47c9-830b-6ce112cab71a","order_by":1,"name":"Murat Osman Ünalır","email":"","orcid":"","institution":"Ege University","correspondingAuthor":false,"prefix":"","firstName":"Murat","middleName":"Osman","lastName":"Ünalır","suffix":""}],"badges":[],"createdAt":"2025-12-05 00:53:16","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8283065/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8283065/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":97766585,"identity":"34a49623-ebe7-40b2-a1c8-27355c39d341","added_by":"auto","created_at":"2025-12-09 07:18:15","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":757143,"visible":true,"origin":"","legend":"","description":"","filename":"AComparativeStudyofGraphRAGandRAGspringerV1.docx","url":"https://assets-eu.researchsquare.com/files/rs-8283065/v1/4aab69c77e0696cbb86ecbf3.docx"},{"id":97897352,"identity":"26070b7e-6872-42e9-8944-1b7684284cb4","added_by":"auto","created_at":"2025-12-10 15:37:46","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":4450,"visible":true,"origin":"","legend":"","description":"","filename":"674ed76ea37d4f9da159e9449ef5e0f1.json","url":"https://assets-eu.researchsquare.com/files/rs-8283065/v1/0ad476aca401e70eb88526ec.json"},{"id":97896486,"identity":"54f0a5a5-58df-45e2-bd6a-93b0ed9db40b","added_by":"auto","created_at":"2025-12-10 15:36:38","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":108955,"visible":true,"origin":"","legend":"","description":"","filename":"674ed76ea37d4f9da159e9449ef5e0f11enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8283065/v1/553da15ffac79df4c3e724d1.xml"},{"id":97896504,"identity":"1aa13afc-3902-412d-89d9-6f34a17abfb4","added_by":"auto","created_at":"2025-12-10 15:36:39","extension":"png","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":609767,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8283065/v1/bcc4beea92ee1e70649b205d.png"},{"id":97896552,"identity":"2f737b37-95e8-4729-80de-8be5e5b32c2e","added_by":"auto","created_at":"2025-12-10 15:36:45","extension":"jpeg","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":345496,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8283065/v1/d705ff0ba5e9bb67d2fc1d2c.jpeg"},{"id":97766583,"identity":"b2c72842-4d7a-494d-a8ae-5a6da98c6729","added_by":"auto","created_at":"2025-12-09 07:18:15","extension":"jpeg","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1374,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8283065/v1/7e7ff9794565591ddca88998.jpeg"},{"id":97766582,"identity":"3f8ac01a-70af-43a1-a74a-edf14891d354","added_by":"auto","created_at":"2025-12-09 07:18:15","extension":"png","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":34129,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8283065/v1/3f903cd5d50e4ff234a12f29.png"},{"id":97766590,"identity":"b8dc5c7d-68a3-4875-8bdd-c8b8e581000c","added_by":"auto","created_at":"2025-12-09 07:18:15","extension":"png","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":53558,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8283065/v1/3bccc8eb9394fa97b29b8703.png"},{"id":97766589,"identity":"5751995d-daf6-4b89-a743-883130d4cec3","added_by":"auto","created_at":"2025-12-09 07:18:15","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":943,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8283065/v1/57bb0a759153c8fc6c86aff5.png"},{"id":97766593,"identity":"36a1454c-c1c8-4d3d-aca9-128bee3da9f9","added_by":"auto","created_at":"2025-12-09 07:18:15","extension":"xml","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":107204,"visible":true,"origin":"","legend":"","description":"","filename":"674ed76ea37d4f9da159e9449ef5e0f11structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8283065/v1/1ca0d6a5c22ee81295bc92b9.xml"},{"id":97766586,"identity":"31026778-772d-4866-ac99-198819ffa325","added_by":"auto","created_at":"2025-12-09 07:18:15","extension":"html","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":114206,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8283065/v1/ff04464b166cb36b402b37a1.html"},{"id":97766581,"identity":"a8ed53c2-c4fb-49f6-81bd-2506f3bfe1dc","added_by":"auto","created_at":"2025-12-09 07:18:15","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":256300,"visible":true,"origin":"","legend":"\u003cp\u003eRAG Pipeline Structure Used in the Study\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-8283065/v1/8d272931d0afd802f55ce2a8.png"},{"id":97766580,"identity":"e473cf46-1399-495a-b548-176bb76a6149","added_by":"auto","created_at":"2025-12-09 07:18:14","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":36731,"visible":true,"origin":"","legend":"\u003cp\u003eOverview of the GraphRAG pipeline used in this study.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-8283065/v1/9c93f137d1d9e66a2626071a.png"},{"id":98430471,"identity":"ca8ee43a-3b8a-49c4-9fcb-876c7529d66c","added_by":"auto","created_at":"2025-12-17 16:45:32","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1071796,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8283065/v1/75b24d1c-f853-489f-a444-cb3004bafcde.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Structured Knowledge for Multi-hop QA: A Comparative Study of GraphRAG and RAG","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eThe emergence of large language models (LLMs) has led to substantial progress in various natural language processing (NLP) tasks such as summarization, question answering (QA), and information retrieval. These models exhibit strong language understanding capabilities due to their advanced architecture and training on large-scale open-domain datasets. However, their performance often declines in complex reasoning settings, particularly in tasks that require integrating information from multiple sources. To mitigate these limitations, the Retrieval-Augmented Generation (RAG) framework was proposed. RAG enhances LLMs by incorporating external knowledge sources, typically in the form of vectorized domain-relevant documents stored in a vector database (Izacard \u0026amp; Grave, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Lewis et al., \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). During inference, relevant passages are retrieved based on vector similarity and provided to the LLM as additional context, improving its ability to perform in specialized domains. Nevertheless, vector similarity\u0026ndash;based retrieval often struggles to surface all the necessary evidence required for multi-hop reasoning, leading to incomplete or noisy context.\u003c/p\u003e\u003cp\u003eDespite its effectiveness in many applications, vector similarity-based retrieval may fall short in multi-hop QA tasks, where the model must combine multiple pieces of information across documents. To address this, recent studies have introduced GraphRAG, which leverages structured representations of knowledge\u0026mdash;specifically knowledge graphs constructed from external sources\u0026mdash;rather than unstructured vector embeddings. In this framework, input documents are transformed into subject\u0026ndash;predicate\u0026ndash;object triples, allowing the model to reason more systematically over structured data and handle tasks requiring deeper inference (Han et al., \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2025\u003c/span\u003e; Kau et al., \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Pan et al., \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; zilliz, \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2024\u003c/span\u003e).Rather than relying on pre-built or externally curated knowledge graphs, our study adopts a lightweight, task-specific graph construction approach, where triples are extracted directly from the dataset context using an LLM.\u003c/p\u003e\u003cp\u003eStructurally informed models tend to generate more consistent responses and significantly reduce hallucinations by constraining the model to reason over explicit relational information. In addition to improving reliability, this approach enhances performance in tasks requiring multi-step inference. In this study, we conduct a detailed comparison of RAG and our GraphRAG implementation using the HotpotQA dataset, a widely adopted benchmark for multi-hop reasoning. Experiments are conducted across different question types (i.e., bridge and comparison). The results show that GraphRAG yields a performance improvement of nearly 20% across the full dataset. Moreover, in cases where RAG failed to generate an answer, GraphRAG successfully answered 80\u0026ndash;90% of those questions. Because GraphRAG produces synthesized answers derived from relational structures rather than extractive spans, we evaluate both systems using semantic similarity metrics, which more accurately capture correctness in structured reasoning settings. These findings highlight the potential of knowledge-graph-based reasoning in enhancing multi-hop QA systems.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e"},{"header":"2. Related Work","content":"\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eTo enhance the performance of large language models (LLMs) in domain-specific question answering tasks and reduce their tendency to generate hallucinations, the Retrieval-Augmented Generation (RAG) framework was introduced. RAG combines the generative capabilities of LLMs with external document retrieval, enabling the model to generate more grounded and informative responses without the need for retraining. In a typical RAG pipeline, relevant documents are retrieved based on semantic similarity to the query and appended to the model\u0026rsquo;s context window during inference (Lewis et al., \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). This architecture has been widely adopted in various QA tasks and shown to improve factual accuracy and reduce hallucination in both open-domain and knowledge-intensive settings (Arefeen et al., \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Besbes, 2024; Izacard \u0026amp; Grave, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2021\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eDespite its success, RAG faces limitations when applied to multi-hop question answering (QA) tasks, where reasoning across multiple documents is required. Datasets like HotpotQA are specifically designed to test multi-hop inference by including questions that necessitate integrating information from distinct sources (Yang et al., \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). However, standard RAG pipelines often struggle with these scenarios due to the increased context length, semantic dispersion, and the lack of structural organization in retrieved content. Chunk-based retrieval methods, commonly used in RAG, may overlook long-range dependencies or split coherent information across different chunks, leading to incomplete or incorrect answers, especially in bridge-type or comparison questions (Liu et al., \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Mavi et al., \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) .\u003c/p\u003e\u003cp\u003eTo address these limitations, several extensions to the RAG framework have been proposed. EfficientRAG, introduced by Zhuang et al. (\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) (Zhuang et al., \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), trains lightweight labeling models on synthetic datasets derived from HotpotQA to perform efficient multi-hop retrieval without requiring LLM calls during inference. While this approach significantly reduces inference costs, its reliance on word-level labels may fail to capture semantic variation, and preprocessing large datasets like QASPER still incurs substantial overhead (Dasigi et al., \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2021\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eSimilarly, Lee et al. (\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) proposed a Multi-Hop Tree Structure (MHTS) Framework that generates synthetic QA datasets with controlled difficulty levels (Lee et al., \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2025\u003c/span\u003e). However, the approach has only been tested on a narrow domain (e.g., the novel David Copperfield) and relies on proprietary models like GPT-4 Turbo, limiting its generalizability and scalability.\u003c/p\u003e\u003cp\u003eAnother prominent direction is multi-hop dense retrieval. Xiong et al. (2020) introduced a system that reformulates the query after each retrieval step by incorporating previously retrieved passages, enabling sequential information integration (Rosset et al., \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). While this method showed improvements over sparse retrieval, it is constrained by a fixed number of hops and incurs high computational cost due to repeated embedding and search operations.\u003c/p\u003e\u003cp\u003eMore recently, the Chain-of-Retrieval (CoRAG) framework proposed by Wang et al. (\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) introduced a dynamic query reformulation mechanism to enable chained multi-hop retrieval during inference (Wang et al., \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2025\u003c/span\u003e). CoRAG demonstrated promising results on HotpotQA and MuSiQue by using adaptive decoding strategies such as greedy decoding, best-of-N, and tree search. However, its performance gains are tightly linked to increased token consumption and computational overhead due to complex rejection sampling and chaining techniques.\u003c/p\u003e\u003cp\u003eLikewise, ReSP (Retrieve, Summarize, Plan), proposed by Jiang et al. (\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), introduces a planning module based on summarization to reduce context redundancy (Jiang et al., \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2024\u003c/span\u003e).. Although it improves retrieval quality, repeated invocation of a summarization model increases latency, making it impractical for real-time systems.\u003c/p\u003e\u003cp\u003eAlthough numerous architectural improvements have been proposed to adapt the RAG framework to multi-hop question answering tasks, the underlying source of information in these systems remains unstructured natural language text. This lack of structure weakens contextual coherence and reduces the effectiveness of vector-based retrieval methods, particularly in long passages where relevant information is dispersed. In datasets like HotpotQA, which require multi-step reasoning, such limitations often lead to lower answer quality and increased hallucination risks. In contrast, knowledge graphs (KGs) offer a structured data representation that can help address these issues. By representing information as triples\u0026mdash;(subject, predicate, object)\u0026mdash;KGs encode explicit semantic relationships between entities and enable language models to perform more interpretable and coherent reasoning. While the creation and maintenance of high-quality KGs is time-consuming and resource-intensive, their integration with language models has been shown to improve accuracy and reduce hallucinations.\u003c/p\u003e\u003cp\u003eRecent research has explored different strategies for integrating KGs with large language models (LLMs). A comprehensive analysis of this landscape is presented in Unifying Large Language Models and Knowledge Graphs: A Roadmap (Pan et al., \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), which categorizes KG-LLM integration approaches into three main paradigms: KG-enhanced LLMs, LLM-enhanced KGs, and hybrid architectures. Although each paradigm presents its own challenges and trade-offs, they all highlight the benefit of leveraging structured knowledge to enhance language model reasoning. Prior studies have proposed various methods: for example, KG-BERT aims to complete missing links in a KG using BERT by representing triples as textual input, whereas ERNIE integrates KG entities during pretraining to improve the performance of transformer-based models (Yao et al., \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Zhang et al., \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). However, both approaches involve costly processes such as KG construction and model retraining.\u003c/p\u003e\u003cp\u003eIn response to these limitations, lighter-weight techniques such as KG-prompting have gained popularity in recent years. These approaches use knowledge graphs during inference without changing the language model itself, making them easier and cheaper to apply in practice. In this setup, triples are automatically extracted from plain text through prompt-based methods, allowing the model to reason over structured information without additional training.\u003c/p\u003e\u003cp\u003eOne of the most prominent examples of this direction is the GraphRAG framework, which integrates knowledge graphs into the RAG pipeline to support structured reasoning during multi-hop question answering. GraphRAG allows the model to reason over relational triples extracted from relevant documents or linked from external ontologies, thus combining the strengths of retrieval-based and structure-based QA. This approach has attracted increasing attention in recent research, and many studies have been conducted using similar methods to improve multi-hop reasoning and reduce hallucination (EQT Ventures, 2024; Han et al., \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2025\u003c/span\u003e; Kau et al., \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2024\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eIn our work, we present a systematic comparison between the GraphRAG and standard RAG pipelines using the widely adopted HotpotQA dataset. Two distinct subsets of the dataset were selected\u0026mdash;one consisting of bridge-type questions and the other of comparison-type questions\u0026mdash;and the complete GraphRAG pipeline was applied end-to-end. To ensure that the extracted triples would support meaningful inference, careful prompt engineering was applied during the graph construction phase, aligning the triples semantically with the question intent. Experimental results show that GraphRAG outperformed the baseline RAG model by approximately 20% in answer accuracy. More importantly, a separate evaluation focused on questions left unanswered by the RAG system (i.e., \"I don't know\" responses) revealed that GraphRAG was able to answer 80\u0026ndash;90% of these previously unanswerable questions correctly, depending on the question type. These findings demonstrate that incorporating structured knowledge into QA pipelines can substantially improve not only performance but also reliability, especially in complex multi-hop scenarios.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e"},{"header":"3. Methodology","content":"\u003cp\u003eIn this section describes the datasets used in the study, as well as the design and implementation of both the RAG and GraphRAG frameworks. In particular, we first summarize the properties of the HotpotQA dataset, then detail the baseline RAG pipeline and the proposed GraphRAG pipeline built on top of the same data representation.\u003c/p\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e3.1 Dataset Description\u003c/h2\u003e\u003cp\u003eHotpotQA is a large-scale, Wikipedia-based question answering dataset specifically built to evaluate the ability of question answering systems and language models to perform multi-hop reasoning. The dataset introduced by Yang et al. (\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2018\u003c/span\u003e) contains about 112,779 examples with annotated supporting facts and is a valuable benchmark dataset for multi-hop QA and RAG-based architectures (Yang et al., \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). Each question in the dataset is mapped to two or more Wikipedia paragraphs, and the models are expected to find the answer through chains of reasoning. There are two different question types in the dataset, \u0026ldquo;bridge\u0026rdquo; and \u0026ldquo;comparison\u0026rdquo;:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eBridge questions, which require reasoning over a bridge entity that connects two distinct facts or paragraphs.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eComparison questions, which involve evaluating and comparing properties of two entities (e.g., \"Who is older, A or B?\").\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eIn addition to the question type, each example is annotated with a difficulty level (easy, medium, or hard), which reflects the number of reasoning steps and the complexity of the required evidence.\u003c/p\u003e\u003cp\u003eThe HotpotQA dataset is divided into subsets based on difficulty level. Easy questions usually require only a single fact to answer. Medium questions involve multi-hop reasoning and can be answered by baseline models. Hard questions also require multi-hop reasoning but are more complex and often cannot be answered by baseline systems. Each example in the HotpotQA dataset is a JSON object that includes a natural language question, its correct answer, the reasoning type (bridge or comparison), the difficulty level (easy, medium, or hard), a list of supporting facts, and the context. The context is made of Wikipedia paragraphs given as pairs of a title and a list of related sentences. The supporting facts specify which titles and sentence indices are required to answer the question, and thus explicitly encode the multi-hop nature of the dataset.\u003c/p\u003e\u003cp\u003eSince the goal of this study is to understand how structured data affects QA performance in multi-hop settings, we selected 500 examples from the HotpotQA dataset, focusing on two question types: bridge and comparison. These 500 examples (250 bridge-type and 250 comparison-type questions) were taken from our previous work, Understanding the Impact of Dataset Characteristics on RAG-based Multi-hop QA Performance, where they were already curated to represent typical multi-hop reasoning patterns(Aksoy et al., \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2025\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eIn both studies, we preprocessed the context field into a RAG-compatible format by restructuring it as \u0026ldquo;title: sentence1, sentence2, \u0026hellip;\u0026rdquo;, allowing for more consistent passage retrieval. This representation flattens each Wikipedia paragraph into a single textual unit while preserving the association between the title and its sentences, which is useful both for dense retrieval and for triple extraction.\u003c/p\u003e\u003cp\u003eIn the current study, we also extracted subject\u0026ndash;predicate\u0026ndash;object triples directly from this reformatted context to build knowledge graphs used in the GraphRAG pipeline. By using the same preprocessed context for both RAG and GraphRAG, we ensure that any performance differences can be attributed to the reasoning mechanism (unstructured retrieval vs. structured graph reasoning) rather than to changes in data representation.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\u003ch2\u003e3.2 RAG Pipeline\u003c/h2\u003e\u003cp\u003eRetrieval-Augmented Generation (RAG) is a method that combines document retrieval and answer generation in one system. First, the question is used to find related documents from an external source, such as a vector database. Then, the question and the retrieved documents are given to a language model, which uses both to create an answer. This approach helps the model give more accurate and informed responses without extra training, especially for tasks that need current or specific knowledge (Lewis et al., \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2020\u003c/span\u003e).RAG therefore provides a controlled way of grounding LLM outputs in external evidence, which is particularly important in multi-hop scenarios where multiple pieces of information must be integrated.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eIn this study, we adopt a standard Retrieval-Augmented Generation (RAG) framework implemented using the LangChain library. The pipeline consists of three main stages:\u003c/p\u003e\u003cp\u003e(1) preprocessing and chunking dataset-specific documents,\u003c/p\u003e\u003cp\u003e(2) semantic vector-based retrieval, and\u003c/p\u003e\u003cp\u003e(3) answer generation using a large language model (LLM).\u003c/p\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e illustrates the RAG pipeline as used in this study, including the document preprocessing steps, the chunk-level embedding process, and the retrieval\u0026ndash;generation workflow.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eFor this study, we reused the RAG pipeline previously developed and evaluated in our earlier multi-dataset experiments. Specifically, a subset of 500 questions from the HotpotQA dataset was selected, consisting of 250 bridge-type and 250 comparison-type questions. The context field for each question was formatted in the same way as in the earlier setup, where each paragraph was transformed into a unified structure of title: sentence1, sentence2, ... to preserve semantic coherence during retrieval. Using the same preprocessing strategy ensures comparability with our previous work (Aksoy et al., \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) and isolates the effect of structured reasoning introduced in GraphRAG.\u003c/p\u003e\u003cp\u003eThe RAG pipeline was implemented using LangChain\u0026rsquo;s RetrievalQA chain. Contexts were split into overlapping chunks using RecursiveCharacterTextSplitter (chunk size: 1000 tokens, overlap: 100) to maintain information continuity. This parameter choice reflects a balance between capturing sufficient local context and avoiding excessive chunk fragmentation, which is known to negatively affect multi-hop retrieval performance.\u003c/p\u003e\u003cp\u003eThese chunks were embedded using the all-MiniLM-L6-v2 model from Hugging Face\u0026rsquo;s sentence-transformers library and stored in a ChromaDB vector database. We used the default embedding dimension of 384, which is well suited for dense retrieval tasks where semantic rather than lexical similarity is required.\u003c/p\u003e\u003cp\u003eDuring retrieval, Maximal Marginal Relevance (MMR) similarity search was employed to fetch relevant chunks. MMR was selected because it reduces redundancy in retrieved passages and improves coverage of distinct reasoning paths\u0026mdash;an important requirement for multi-hop QA.\u003c/p\u003e\u003cp\u003eFor the answer generation step, we used the LLaMA3-70B-8192 model via the Groq API, selected for its support for long input sequences, ease of access, and increasing adoption in research. A question-specific prompt template was used to help the model produce short, accurate responses and reduce hallucinations. Temperature was set to 0.1, and a maximum output length of 64 tokens was used to encourage concise answers consistent with HotpotQA\u0026rsquo;s answer format.\u003c/p\u003e\u003cp\u003eNo model-specific fine-tuning was performed. Rather than directly reusing the numerical hyperparameter values from our earlier SQuAD-based study (Aksoy et al., n.d.), we applied the design principles established there: namely, that multi-hop tasks benefit from (i) chunk sizes aligned with typical context lengths, (ii) MMR-based retrieval to maximize evidence diversity, and (iii) low-temperature decoding to reduce hallucination. In this study, the actual hyperparameter values were adjusted to the characteristics of the HotpotQA dataset, while the underlying methodological rationale remained consistent. This controlled yet dataset-aware setup ensures that the RAG baseline is both robust and comparable, allowing performance differences to be attributed to the structured reasoning introduced in GraphRAG rather than to parameter tuning.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003e3.3 GraphRAG\u003c/h2\u003e\u003cp\u003eIn the RAG architecture, external information is stored in a vector database after being converted into dense vector representations. Although relevant documents are retrieved based on vector similarity, the underlying data remains unstructured. In contrast, GraphRAG replaces this component with knowledge graphs, where external information is represented as structured triples (subject, predicate, object). This structured format allows the model to capture deeper semantic relations, going beyond surface-level similarity (Han et al., \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). In this study, we adopt a lightweight, document-level variant of GraphRAG, where small task-specific graphs are constructed directly from the same HotpotQA contexts used in the RAG pipeline.\u003c/p\u003e\u003cp\u003eIn this study, a total of 500 examples from the HotpotQA dataset\u0026mdash;250 bridge-type and 250 comparison-type questions\u0026mdash;were processed using this GraphRAG framework. Each example consists of a question and its associated context, which was preprocessed in the same way as in the RAG pipeline to ensure a fair comparison between the two architectures.\u003c/p\u003e\u003cp\u003eTriples were extracted from the context fields using a prompt-based method with the LLaMA3-70B-8192 model via the Groq API. The same model was also used for the question answering (QA) stage, reducing representational mismatch between the structured triples and the unstructured text, as both are produced and consumed by a single LLM. Although triples could in principle be derived using alternative NLP techniques (e.g., dependency parsing, Open Information Extraction), we opted for an LLM-based approach due to its ability to capture implicit and semantically rich relations. The full extraction prompt is provided in \u003cb\u003eAppendix A\u003c/b\u003e. Consistent with prior work, we note that LLM-generated triples may not be perfectly accurate; in this study, we evaluate performance at the end-to-end QA level and leave triple-level validation to future work.\u003c/p\u003e\u003cp\u003eThe extracted triples were stored as a list and subsequently used to construct a local directed graph for each question. Each subject and object was mapped to a graph node, while each predicate defined a labelled directed edge connecting the corresponding nodes. During this process, duplicate or clearly malformed triples were discarded, resulting in a compact yet informative set of relational facts associated with each example.\u003c/p\u003e\u003cp\u003eAfter the graph construction phase, the second stage of our GraphRAG architecture performs question answering using only these triples. The LLM is prompted to analyse the question, infer the expected answer type (e.g., person, date, place), identify the key entities, and determine the required reasoning style (e.g., multi-hop, comparison, temporal). Based on this internal analysis, the model selects the most relevant triples and attempts to infer an answer grounded solely in this structured evidence. If no answer can be supported by the available triples, the model is instructed to output \u0026ldquo;I don\u0026rsquo;t know\u0026rdquo; rather than guessing. Given that the gold answers in HotpotQA are short and factual, the prompt also enforces concise responses. This explicit restriction to triple-based reasoning suppresses unsupported hallucinations and isolates the contribution of structured knowledge compared to the unstructured RAG baseline. Figure\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e illustrates the complete GraphRAG pipeline used in this study.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e"},{"header":"4. Experiments Results and Discussion","content":"\u003cp\u003eIn this section, we describe the evaluation framework used in the experiments, present the results obtained with the baseline RAG and the GraphRAG pipelines on the HotpotQA dataset, and compare their behaviour on bridge and comparison questions. The main goal is to analyse how graph-augmented retrieval and reasoning influence answer quality and reliability in multi-hop question answering by reporting both the RAG and GraphRAG results on the same subsets, and by additionally examining how GraphRAG performs on the questions that the RAG system fails to answer.\u003c/p\u003e\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\u003ch2\u003e4.1. Evaluation Framework\u003c/h2\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eIn this study, we adopted a hybrid evaluation approach that combines semantic similarity metrics with a lightweight threshold-based labeling scheme to assess the quality and factual reliability of the generated answers. Our primary goal was to design an evaluation process that is efficient, reproducible, and scalable, especially important when working with multiple QA datasets and large numbers of examples in a cost-conscious academic context. Unlike our earlier work where we used LLM-based evaluation tools such as RAGAS, this study intentionally avoids such dependencies. Although tools like RAGAS offer fine-grained scoring by leveraging powerful commercial language models (typically GPT-4), they also introduce two major limitations: high token costs and slow processing times. These drawbacks make them impractical for large-scale experiments. Instead, we developed a lightweight and robust evaluation pipeline using only open-source tools and metrics that can be run locally and freely.\u003c/p\u003e\u003cp\u003eThe metrics used in this evaluation framework are detailed below, including cosine similarity and BERTScore, which jointly enabled a robust assessment of semantic fidelity in model responses. While both cosine similarity and BERTScore aim to assess semantic similarity between the generated and reference answers, they do so in fundamentally different ways. Cosine similarity operates on sentence-level embeddings, measuring overall semantic alignment using vector space distance. In contrast, using deep transformer-based contextual representations, BERTScore evaluates contextual alignment at the token level, comparing each word in the model's output with those in reference. This makes BERTScore more sensitive to paraphrasing, rewording, and nuanced phrasing.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eCosine Similarity: To evaluate the semantic closeness between generated answers and ground-truth references, we used cosine similarity. Sentence embeddings were created using the all-MiniLM-L6-v2 model from the Sentence-Transformers library, which is known for its ability to capture sentence-level meaning beyond surface token overlap. A similarity threshold of 0.5 was set, and this value was also used to convert continuous similarity scores into binary classification labels (i.e., semantically correct or not).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eBERTScore: We also included BERTScore, computed using the roberta-large model, to provide a second layer of semantic evaluation. Unlike cosine similarity, BERTScore operates at the token level and accounts for contextual meaning, making it particularly useful when evaluating paraphrased or abstracted answers. To handle cases where the model returns defensive or evasive answers (e.g., \u0026ldquo;I don\u0026rsquo;t know\u0026rdquo;), we assigned a BERTScore of zero to those outputs\u0026mdash;treating them as non-informative in the context of factual QA.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eAll evaluations were carried out using open-access Python libraries such as scikit-learn, sentence-transformers, and bert_score. We ran all experiments in a Google Colab environment, which allowed us to avoid the use of commercial LLM APIs entirely. This setup ensured that our evaluation framework was fast, free to use, and fully reproducible, a critical factor for large-scale academic studies involving multiple datasets and thousands of test cases.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eThe pseudo code describing the evaluation phase is as follows:\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eInitialize sentence_embedding_model with \"all-MiniLM-L6-v2\"\u003c/p\u003e\u003cp\u003eInitialize bert_model with \"roberta-large\"\u003c/p\u003e\u003cp\u003eSet cosine_similarity_threshold\u0026thinsp;=\u0026thinsp;0.5\u003c/p\u003e\u003cp\u003eFor each row in the dataset:\u003c/p\u003e\u003cp\u003eGet model_answer and true_answer as strings\u003c/p\u003e\u003cp\u003eIf either answer is empty or blank:\u003c/p\u003e\u003cp\u003eAssign cosine_similarity\u0026thinsp;=\u0026thinsp;0.0\u003c/p\u003e\u003cp\u003eAssign BERT precision, recall, and F1\u0026thinsp;=\u0026thinsp;0.0\u003c/p\u003e\u003cp\u003eElse:\u003c/p\u003e\u003cp\u003eEncode both answers into vectors using sentence_embedding_model\u003c/p\u003e\u003cp\u003eCompute cosine_similarity between vectors\u003c/p\u003e\u003cp\u003eIf model_answer starts with \"I don't know\":\u003c/p\u003e\u003cp\u003eAssign BERT precision, recall, and F1\u0026thinsp;=\u0026thinsp;0.0\u003c/p\u003e\u003cp\u003eElse:\u003c/p\u003e\u003cp\u003eCompute BERTScore precision, recall, and F1 between the two answers\u003c/p\u003e\u003cp\u003eAppend cosine_similarity and BERT scores to respective lists\u003c/p\u003e\u003cp\u003eIf cosine_similarity\u0026thinsp;\u0026ge;\u0026thinsp;threshold:\u003c/p\u003e\u003cp\u003eAssign predicted_label\u0026thinsp;=\u0026thinsp;1\u003c/p\u003e\u003cp\u003eElse:\u003c/p\u003e\u003cp\u003eAssign predicted_label\u0026thinsp;=\u0026thinsp;0\u003c/p\u003e\u003cp\u003eAfter all rows are processed:\u003c/p\u003e\u003cp\u003eAdd cosine_similarity, BERT scores, and predicted_labels to the dataset\u003c/p\u003e\u003cp\u003eSet true_labels\u0026thinsp;=\u0026thinsp;1 for all rows in answerable-question subsets.\u003c/p\u003e\u003cp\u003ePrint:\u003c/p\u003e\u003cp\u003eMean Cosine Similarity\u003c/p\u003e\u003cp\u003eMean BERTScore F1\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\u003ch2\u003e4.2. Overall Performance (RAG vs. GraphRAG)\u003c/h2\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eIn this section, we present the results obtained from the RAG and GraphRAG pipelines on the HotpotQA bridge and comparison subsets. The analysis focuses on how each system handles different multi-hop reasoning patterns and how graph-augmented retrieval influences answer quality. We first report the performance of the baseline RAG system, followed by the corresponding GraphRAG results on the same subsets. We also examine how GraphRAG performs on questions for which the RAG system produced an \u0026ldquo;I don\u0026rsquo;t know\u0026rdquo; answer, allowing us to quantify the contribution of structured knowledge graphs in cases where standard retrieval fails.\u003c/p\u003e\u003cp\u003eThe results obtained from the four controlled subsets of the HotpotQA dataset are summarized in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e below.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eSubset Level performance on the HotpotQA dataset with RAG Pipeline\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eQuestion Type\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eCosine Similarity\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eBERT F1\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBridge\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.63\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.74\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eComparison\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.76\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e shows that the two subsets give very close results. The model is only slightly better on comparison questions, with cosine similarity at 0.64 and BERT F1 at 0.76. The bridge questions are a bit lower on both metrics, which makes sense because these questions usually require several steps and pulling information together from different parts of the context. In both cases, BERT F1 is higher than cosine similarity. This suggests that the model often produces answers that are reasonable in meaning even when the wording doesn't line up very closely. Looking at both metrics together helps make the overall behaviour a bit clearer.\u003c/p\u003e\u003cp\u003eTable 2 presents the detailed analysis of the bridge type subset for WH-questions. Questions were categorized based on leading question words such as what, who, how, etc., and all others were grouped under the category \u0026ldquo;other.\u0026rdquo; This categorization provides insight into how different question formulations impact the quality of retrieval and rendering in our RAG pipeline.\u003c/p\u003e\u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e. Question type-based performance on Bridge Dataset with RAG pipeline.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eQuestion Type\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eCount\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMean BERT F1\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eAvg. Chunk Len. (chars)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eAvg. Chunk Len. (tokens)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003ewhat\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.7508\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5672\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e922\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003ewhen\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e11\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.7471\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e6033\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e997\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003ewhich\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e26\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.7182\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5305\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e864\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003ewhere\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e8\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.6360\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5739\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e897\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003ehow\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.3799\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e7083\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e1141\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003ewho\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e20\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.6168\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5223\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e855\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eother\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e116\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.7816\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5963.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e966.9\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e results show that longer contexts often lead to lower performance. For example, how-type questions had the longest context lengths and the lowest BERT F1 scores. This is expected, as these questions usually require explanation or reasoning, which is harder for the model. On the other hand, what-type questions performed better and were the most common WH type. These questions often ask for specific facts, which can be found more easily in the context. Similarly, when and which questions also gave good results. The other category showed the highest performance overall. This may be because many questions in this group were shorter and more direct, making them easier to match with retrieved information. This findings suggest that the model performs better with fact-based and worded questions, while it struggles more with questions that need reasoning or multi-step answers.\u003c/p\u003e\u003cp\u003eUnlike bridge-type questions, comparison-type questions are inherently structured to contrast two or more entities. As a result, categorizing them by WH-type (e.g., what, who, how) is not meaningful, since most comparison questions follow a functional pattern that focuses on equivalence, order, quantity, or attributes, rather than seeking specific factual responses. Therefore, a more informative analysis groups these questions by comparison subtype, explicit, ordinal, yes/no, other.\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eExplicit comparisons\u003c/b\u003e: These questions include clear comparative expressions such as \"which one\", \"who has more\", or temporal markers like \"earlier\" and \"later\". They typically require retrieving two entities and selecting one based on a specified attribute.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003e\u0026bull; Example\u003c/strong\u003e\u003cp\u003e\u0026bull; \u0026ldquo;Which satellite was launched earlier, Hakucho or Chandra?\u0026rdquo;\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eOrdinal comparisons\u003c/b\u003e: These questions express sequence or order using terms like \"first\", \"older\", or \"developed earlier\". They require reasoning over timeline or development precedence.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003e\u0026bull; Example\u003c/strong\u003e\u003cp\u003e\u0026bull; \u0026ldquo;Who was born first, Greg Lake or someone else?\u0026rdquo;\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eYes/No comparisons\u003c/b\u003e: These are binary questions starting with auxiliary verbs such as \"Are\", \"Do\", \"Is\", or \"Did\", and expect a confirmation or negation of a comparative claim.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003e\u0026bull; Example\u003c/strong\u003e\u003cp\u003e\u0026bull; \u0026ldquo;Are Hayley Williams and Paul Simon both singers?\u0026rdquo;\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eOther comparisons\u003c/b\u003e: This group includes questions that involve implicit or less structured comparisons which do not clearly fall into the above categories. These are often more open-ended or abstract.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003e\u0026bull; Example\u003c/strong\u003e\u003cp\u003e\u0026bull; \u0026ldquo;What do Aram Avakian and Karo Parisyan have in common?\u0026rdquo;\u003c/p\u003e\u003c/p\u003e\u003cp\u003eResults of comparison type question is shown Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eQuestion type-based performance on Comparison Dataset of RAG pipeline.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"5\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eComparison Type\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eQuestion Count\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eBERT F1\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eAvg. Chunk (chars)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eAvg. Chunk (tokens)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eexplicit\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e59\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.894\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e5215\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e838\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eordinal\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e41\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.789\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e4946\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e810\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eyes/no\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e99\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.829\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e4802\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e777\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eother\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e51\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.810\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e5690\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e920\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eThe performance analysis on the comparison set reveals consistent trends in how different comparison question types affect model efficiency. Explicit comparison questions with clear comparative structures achieved the highest BERT F1 score (0.894), indicating that the model handled direct and well-formulated comparison prompts more effectively. In contrast, sequential questions that required reasoning in order or sequence (e.g., \u0026ldquo;Who was born first?\u0026rdquo;) consistently performed the lowest. This is likely because these questions depend on locating and correctly interpreting temporal or ordered information, which can be harder to retrieve when such details are spread across the context. Yes/No and other comparison types showed moderate performance. Overall, these results suggest that the phrasing and structure of the comparison prompt itself play a noticeable role in how well the model can resolve multi-hop comparisons.\u003c/p\u003e\u003cp\u003eThe same analyses were performed using the GraphRAG pipeline for both bridge type and comparison type questions. Below we present the results and how they differ from RAG. Subset level performance -bridge, comparison type question- is shown in Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eSubset level performance with GraphRAG pipeline.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eQuestion Type\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eCosine Similarity\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eBERT F1\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBridge\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0,69\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0,88\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eComparison\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0,78\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0,92\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eThe results in Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e show that the GraphRAG system performs better on both bridge and comparison questions. For bridge type questions, a cosine similarity of 0.69 and a BERT F1 score of 0.88 were obtained. For comparison type questions, a cosine similarity of 0.78, and a BERT F1 score of 0.92. BERT F1 score is higher in both subsets. This shows that the model can reach semantically correct results despite the differences in expression. The results show that overall, GraphRAG helps the model to provide more accurate and meaningful answers than the basic RAG system.\u003c/p\u003e\u003cp\u003eA more detailed breakdown across WH-question types is presented in Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, which illustrates how the graph-based structure influences the model\u0026rsquo;s performance in each question category.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eQuestion type-based performance on Bridge Dataset with GraphRAG\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"5\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eQuestion Type\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eCount\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMean BERT F1\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eAvg. Chunk Len. (chars)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eAvg. Chunk Len. (tokens)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003ewhat\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.8914\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5672\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e922\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003ewhen\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e11\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.7057\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e6033\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e997\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003ewhich\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e26\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.9748\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5305\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e864\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003ewhere\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e8\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.6854\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5739\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e897\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003ehow\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.7293\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e7083\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e1141\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003ewho\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e20\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.9561\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5223\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e855\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eother\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e116\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.878\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5963.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e966.9\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eIn the GraphRAG pipeline, the performance of bridge-type questions varies according to the WH type, with the highest BERT F1 scores observed for \u0026ldquo;which\u0026rdquo; (0.9748) and \u0026ldquo;who\u0026rdquo; (0.9561) questions. These types usually require defining specific entities or making choices, and are easily found with the triples provided by knowledge graphs. \u0026ldquo;What\u0026rdquo; questions also performed well (0.8914) because they contain concrete facts that can be easily retrieved. In contrast, lower scores were recorded for \u0026ldquo;how\u0026rdquo; (0.7293), \u0026ldquo;when\u0026rdquo; (0.7057), and \u0026ldquo;where\u0026rdquo; (0.6854) questions, which usually require temporal and spatial inference; these types of questions may not be directly extracted from the triples and remain difficult for knowledge graphs. Despite the similar context lengths across question types, GraphRAG provided a significant advantage thanks to structured data. The \u0026ldquo;other\u0026rdquo; category, which included non-standard WH forms, also yielded a strong result (0.878), indicating that clear and well-matched expressions were used effectively.\u003c/p\u003e\u003cp\u003eIn addition, in our study, questions that could not be answered by the RAG pipeline were stored in separate csv files and re-evaluated with the Graphrag pipeline, and the performance difference between RAG and GraphRag was evaluated in detail. The number of questions that could not be answered with the RAG pipeline - the model answered with I don't know - and their ratio to the data set are shown in Table\u0026nbsp;\u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e6\u003c/span\u003e.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab6\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 6\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eDistribution of Questions That Could Not Be Answered by the RAG Pipeline Subsets\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"4\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSubset\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eI Don't Know Count\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eTotal Questions\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eRatio (%)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBridge-Type\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e51\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e250\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e20.4%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eComparison-Type\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e43\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e250\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e17.2%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003c/br\u003e\u003cp\u003eTable 7 shows how many bridge-type questions the GraphRAG pipeline answered and with what success rate the RAG system failed to answer them. Out of the 51 questions left unanswered by RAG (which the model answered as \u0026ldquo;I don\u0026rsquo;t know\u0026rdquo;), GraphRAG provided semantically meaningful answers to 44 of them (86.27%) based on the BERT F1 score above 0.5. Only 31 answers (60.78%) were considered similar when measured by cosine similarity (threshold value 0.5). This shows that 31 questions were answered almost identically by the full word match, and 44 questions were answered semantically correctly by the GraphRAG pipeline, even though word differences existed. The average BERT F1 score was 0.801, and the average cosine similarity was 0.659. These results show that GraphRAG can answer many questions that RAG cannot, especially by using structured data for questions that require deeper reasoning.\u003c/p\u003e\n\u003cp\u003eTable 7. GraphRAG Performance on Bridge-Type Questions Unanswered by RAG\u003c/p\u003e\n\u003ctable border=\"0\" cellspacing=\"3\" cellpadding=\"0\" width=\"338\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eSubset\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eCosine Similarity\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eBERT F1\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eBridge-Type\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.6596\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.8012\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eComparison-Type\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.7592\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.9086\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cbr\u003e\u003c/p\u003e\n\u003cp\u003eFollowing this overall evaluation, we now provide a more detailed analysis of the bridge-type and comparison-type questions. Each category is examined based on question formulations (e.g., WH-type or comparison subtype), semantic accuracy, and contextual characteristics to understand better where GraphRAG performs well and where challenges remain. Table 8 and Table 9 provide detailed analyses of the bridge-type and comparison questions.\u003c/p\u003e\n\u003cp\u003eTable 8. GraphRAG WH-Type Analysis on RAG-Unanswered Bridge Questions\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"576\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eQuestion Type\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eCount\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eMean BERT F1\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAvg. Chunk Len. (chars)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAvg. Chunk Len. (tokens)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003ehow\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e0.3333\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e7076.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e1158.3\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003ewhat\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e0.8725\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e6017.7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e977.8\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003ewhen\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e0.9998\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e6368.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e1045.0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003ewhere\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e0.4249\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e6757.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e1035.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003ewhich\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e1.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e5548.8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e905.0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003ewho\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e0.9028\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e5629.1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e911.6\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003eother\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e19\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e0.7552\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e5850.2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e941.0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cbr\u003e\u003c/p\u003e\n\u003cp\u003eTable 9. GraphRAG Comparison-Type Analysis on RAG-Unanswered Comparison Type Questions\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eComparison Type\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eCount\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eMean BERT F1\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAvg. Chunk Len (chars)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAvg. Chunk Len (tokens)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003eexplicit\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e0.6775\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e4527.2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e727.6\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003eordinal\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e0.9247\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e5568.6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e926.2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003eother\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e1.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e5368.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e858.0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003eyes/no\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e29\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e0.9664\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e5334.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 115px;\"\u003e\n \u003cp\u003e859.3\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cbr\u003e\u003c/p\u003e\n\u003cp\u003eTable 8 shows the performance of GraphRAG according to the WH-question types of bridge-type questions that RAG could not answer. The \u0026ldquo;which\u0026rdquo; and \u0026ldquo;when\u0026rdquo; questions showed the highest success (BERT F1: 1.0 and 0.9998). These types of questions usually ask for clear information and GraphRAG\u0026rsquo;s structured data gives good results for these questions. The \u0026ldquo;what\u0026rdquo; and \u0026ldquo;who\u0026rdquo; questions are also quite successful. In contrast, the \u0026ldquo;how\u0026rdquo; and \u0026ldquo;where\u0026rdquo; questions gave lower results. These questions usually require explanation or location information and require more complex thinking. In addition, the supporting texts of \u0026ldquo;how\u0026rdquo; questions are longer on average, which is difficult for the model to process.\u003c/p\u003e\n\u003cp\u003eTable 9 shows the analysis of the comparison-type questions that RAG could not answer with GraphRAG according to their types. The highest success was seen in the \u0026ldquo;yes/no\u0026rdquo; (BERT F1: 0.9664) and \u0026ldquo;ordinal\u0026rdquo; (0.9247) types. These types of questions can usually be solved easily with clear and structured information. The success rate is slightly lower in \u0026ldquo;Explicit\u0026rdquo; comparison questions (0.6775), because the features compared in these types of questions may not be clearly defined. The success rate is high in the single-question in the \u0026ldquo;Other\u0026rdquo; category, but generalizations cannot be made because the number is small.\u003c/p\u003e\n\u003cp\u003eThese detailed results provide the background needed to interpret the overall differences between the two systems. When the results are considered together, a consistent pattern becomes clear. GraphRAG performs better than the standard RAG system across both bridge and comparison questions, and this improvement is visible in all semantic metrics. Bridge questions remain more difficult for both systems because they require chaining several pieces of information, yet GraphRAG still produces higher cosine similarity and BERT F1 scores in this subset. The gains are even more noticeable in comparison questions, where the graph structure seems to support the reasoning steps needed to connect two entities more reliably.\u003c/p\u003e\n\u003cp\u003eThe question-type analysis also helps explain where this advantage comes from. GraphRAG handles \u0026ldquo;which\u0026rdquo;, \u0026ldquo;who\u0026rdquo;, and \u0026ldquo;yes/no\u0026rdquo; questions particularly well, likely because these forms correspond more directly to the triple patterns used in the graph. In contrast, \u0026ldquo;how\u0026rdquo; and \u0026ldquo;where\u0026rdquo; questions continue to be challenging, as they often require descriptive or spatial reasoning that is not explicitly encoded in triple form. Similar patterns were observed in both the bridge and comparison subsets.\u003c/p\u003e\n\u003cp\u003eA notable outcome of the study is the behavior on questions that RAG could not answer. GraphRAG was able to produce meaningful responses for most of these cases. For bridge questions, it answered 86% of the \u0026ldquo;I don\u0026rsquo;t know\u0026rdquo; outputs with semantically valid responses, and it showed similarly strong performance on comparison questions. This indicates that the graph-based context enables the model to recover information that the standard RAG pipeline fails to retrieve or interpret.\u003c/p\u003e\n\u003cp\u003eOverall, the results suggest that incorporating structured relational information provides clear benefits in multi-hop settings, especially in datasets with long contexts and distributed evidence. GraphRAG offers a more reliable retrieval signal and supports reasoning steps that RAG cannot perform effectively. This makes graph-augmented methods a promising direction for improving answer quality and reducing failure cases in complex QA tasks.\u003c/p\u003e"},{"header":"5. Conclusion and Future Work","content":"\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eThis study comparatively evaluated the classical Retrieval-Augmented Generation (RAG) approach and the knowledge graph-supported GraphRAG architecture in multi-hop question answering tasks. Using the HotpotQA dataset, detailed analyses were carried out on both bridge and comparison type questions, and the questions that the RAG system could not answer were reprocessed with the GraphRAG architecture. The results showed that GraphRAG, which incorporates structured information, was more successful in producing meaningful and correct answers\u0026mdash;especially for multi-step questions that require deeper and more complex reasoning.\u003c/p\u003e\u003cp\u003eThe GraphRAG architecture used in the study goes beyond a simple triple extraction pipeline and includes additional reasoning steps based on question analysis. The model first analyzes the question, identifies the expected answer type, the entities involved, and the required reasoning pattern, and then generates an answer using only the relevant triples. This design supports both semantic accuracy and more precise information matching.\u003c/p\u003e\u003cp\u003eIn future work, several directions will be explored to further improve the system. One priority is enhancing the triple extraction stage, as the quality of extracted triples directly affects downstream reasoning performance. The system will also be tested on datasets such as QASPER, which contain longer and more complex contexts, in order to assess its robustness and adaptability in settings where textual heterogeneity and noisy evidence make triplet extraction more challenging.\u003c/p\u003e\u003cp\u003eBeyond these extensions, future studies will investigate whether integrating knowledge graphs directly into model training provides additional improvements over retrieval-based reasoning. Comparing a dynamic, fine-tuned KG-aware model with the static GraphRAG pipeline may help reveal how much of the performance gain comes from explicit structured retrieval versus internalized relational knowledge. In addition, evaluating the pipeline across a broader set of language models with different sizes and architectural properties will allow a more comprehensive understanding of how model characteristics interact with graph-based retrieval and whether certain model families benefit more from structured knowledge.\u003c/p\u003e\u003cp\u003eOverall, the findings of this study show that using structured recall and graph-supported reasoning can significantly improve performance in multi-hop question answering, especially in cases where standard RAG systems fail. These results highlight a promising direction for future QA systems aiming to achieve higher accuracy, more reliable reasoning, and reduced hallucination in complex information-seeking tasks.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eN.A. conceived and designed the study, implemented the RAG and GraphRAG pipelines, performed all experiments and analyses, and wrote the entire manuscript. M.O.\u0026Uuml;. supervised the research, provided conceptual guidance, and reviewed the manuscript. Both authors approved the final version\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eThe datasets used in this study are publicly available (HotpotQA). All evaluation outputs and triple-extraction files generated during the study are available from the corresponding author upon reasonable request.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eAksoy, N., G\u0026uuml;ven, Z. A., \u0026amp; \u0026Uuml;nalir, M. O. (2025). Understanding the Impact of Dataset Characteristics on RAG based Multi-hop QA Performance. https://doi.org/10.21203/rs.3.rs-6968562/v1\u003c/li\u003e\n\u003cli\u003eAksoy, N., \u0026Uuml;nalir, M. O., \u0026amp; G\u0026uuml;ven, Z. A. (n.d.). Architecting and Evaluating a RAG based Question Answering System for SQuAD Dataset. In Recent Developments in Engineering with Applied Mathematics and AI. World Scientific.\u003c/li\u003e\n\u003cli\u003eArefeen, M. A., Debnath, B., \u0026amp; Chakradhar, S. (2024). LeanContext: Cost-Efficient Domain-Specific Question Answering Using LLMs. Natural Language Processing Journal, 7. https://doi.org/https://doi.org/10.1016/j.nlp.2024.100065\u003c/li\u003e\n\u003cli\u003eBesbes, A. (2024, January 15). 3 Advanced Document Retrieval Techniques To Improve RAG Systems. https://towardsdatascience.com/3-advanced-document-retrieval-techniques-to-improve-rag-systems-0703a2375e1c\u003c/li\u003e\n\u003cli\u003eDasigi, P., Lo, K., Beltagy, I., Cohan, A., Smith, N. A., \u0026amp; Gardner, M. (2021). A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers.\u003c/li\u003e\n\u003cli\u003eEQT Ventures. (2024, December 19). Knowledge Graph(s) and LLM-based ontologies have a very good shot at unlocking GenAI in production. https://medium.com/eqtventures/knowledge-graph-s-and-llm-based-ontologies-have-a-very-good-shot-at-unlocking-genai-in-production-1b167533ef63\u003c/li\u003e\n\u003cli\u003eHan, H., Shomer, H., Wang, Y., Lei, Y., Guo, K., Hua, Z., Long, B., Liu, H., \u0026amp; Tang, J. (2025). RAG vs. GraphRAG: A Systematic Evaluation and Key Insights.\u003c/li\u003e\n\u003cli\u003eHan, H., Wang, Y., Shomer, H., Guo, K., Ding, J., Lei, Y., Halappanavar, M., Rossi, R. A., Mukherjee, S., Tang, X., He, Q., Hua, Z., Long, B., Zhao, T., Shah, N., Javari, A., Xia, Y., \u0026amp; Tang, J. (2024). Retrieval-Augmented Generation with Graphs (GraphRAG).\u003c/li\u003e\n\u003cli\u003eIzacard, G., \u0026amp; Grave, E. (2021). Leveraging passage retrieval with generative models for open domain question answering. EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference. https://doi.org/10.18653/v1/2021.eacl-main.74\u003c/li\u003e\n\u003cli\u003eJiang, Z., Ma, X., \u0026amp; Chen, W. (2024). LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs.\u003c/li\u003e\n\u003cli\u003eKau, A., He, X., Nambissan, A., Astudillo, A., Yin, H., \u0026amp; Aryani, A. (2024). Combining Knowledge Graphs and Large Language Models.\u003c/li\u003e\n\u003cli\u003eLee, J., Kwon, D., Jin, K., Jeong, J., Sim, M., \u0026amp; Kim, M. (2025). MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation.\u003c/li\u003e\n\u003cli\u003eLewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K\u0026uuml;ttler, H., Lewis, M., Yih, W. T., Rockt\u0026auml;schel, T., Riedel, S., \u0026amp; Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In H. Larochelle and M. Ranzato and R. Hadsell and M.F. Balcan and H. Lin (Ed.), Advances in Neural Information Processing Systems (Vols. 2020-December, pp. 9459\u0026ndash;9474). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf\u003c/li\u003e\n\u003cli\u003eLiu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., \u0026amp; Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. http://arxiv.org/abs/2307.03172\u003c/li\u003e\n\u003cli\u003eMavi, V., Jangra, A., \u0026amp; Jatowt, A. (2024). Multi-hop Question Answering. Foundations and Trends\u0026reg; in Information Retrieval, 17(5), 457\u0026ndash;586. https://doi.org/10.1561/1500000102\u003c/li\u003e\n\u003cli\u003ePan, S., Luo, L., Wang, Y., Chen, C., Wang, J., \u0026amp; Wu, X. (2024). Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/pdf/2306.08302\u003c/li\u003e\n\u003cli\u003eRosset, C., Xiong, C., Phan, M., Song, X., Bennett, P., \u0026amp; Tiwary, S. (2020). Knowledge-Aware Language Model Pretraining.\u003c/li\u003e\n\u003cli\u003eWang, L., Chen, H., Yang, N., Huang, X., Dou, Z., \u0026amp; Wei, F. (2025). Chain-of-Retrieval Augmented Generation.\u003c/li\u003e\n\u003cli\u003eYang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., \u0026amp; Manning, C. D. (2018). Hotpotqa: A dataset for diverse, explainable multi-hop question answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018. https://doi.org/10.18653/v1/d18-1259\u003c/li\u003e\n\u003cli\u003eYao, L., Mao, C., \u0026amp; Luo, Y. (2019). KG-BERT: BERT for Knowledge Graph Completion.\u003c/li\u003e\n\u003cli\u003eZhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., \u0026amp; Liu, Q. (2019). ERNIE: Enhanced Language Representation with Informative Entities.\u003c/li\u003e\n\u003cli\u003eZhuang, Z., Zhang, Z., Cheng, S., Yang, F., Liu, J., Huang, S., Lin, Q., Rajmohan, S., Zhang, D., \u0026amp; Zhang, Q. (2024). EfficientRAG: Efficient Retriever for Multi-Hop Question Answering.\u003c/li\u003e\n\u003cli\u003ezilliz. (2024). GraphRAG Explained: Enhancing RAG with Knowledge Graphs. https://medium.com/@zilliz_learn/graphrag-explained-enhancing-rag-with-knowledge-graphs-3312065f99e1\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-8283065/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8283065/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThis study comparatively examines the classical RAG approach and the knowledge graph-based GraphRAG architecture in multi-hop question answering. While RAG uses external knowledge in an unstructured manner, GraphRAG aims to provide more controlled and meaningful answers by relying on structured knowledge triples, thereby reducing hallucinations. In the experiments, both architectures were tested on 500 questions selected from the HotpotQA dataset, and their performances were compared. In particular, the questions that the RAG system answered with \u0026ldquo;I don\u0026rsquo;t know\u0026rdquo; were re-evaluated using GraphRAG. In the GraphRAG pipeline, knowledge triples were first extracted from the context, and then the same language model performed question analysis\u0026mdash;identifying the question type, the expected reasoning pattern, and selecting the most relevant triples. Answers were generated using only filtered and contextually appropriate structured information.\u003c/p\u003e\u003cp\u003eThe results show that incorporating structured knowledge provides a clear improvement in semantic answer quality. On average, both cosine similarity and BERT F1 scores increased by 20\u0026ndash;30% across the tested subsets. Moreover, GraphRAG successfully answered approximately 80% of the questions that the classical RAG system could not answer. These findings demonstrate that structured knowledge enables more reliable reasoning in multi-step QA and highlight the potential of the GraphRAG approach as a stronger alternative for complex question answering tasks.\u003c/p\u003e","manuscriptTitle":"Structured Knowledge for Multi-hop QA: A Comparative Study of GraphRAG and RAG","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-12-09 07:18:10","doi":"10.21203/rs.3.rs-8283065/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"6e5ecec5-a973-4d89-8761-88d3ba400c5d","owner":[],"postedDate":"December 9th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-12-13T22:53:27+00:00","versionOfRecord":[],"versionCreatedAt":"2025-12-09 07:18:10","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8283065","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8283065","identity":"rs-8283065","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.