A Vector Database Approach for Enhancing Data Warehouse Development Practices | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article A Vector Database Approach for Enhancing Data Warehouse Development Practices Sherif R Eldemerdash, Osama E Emam, Manal A Abdelfattah, Wael Mohamed abass This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8235989/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract The importance of data warehouses (DWH) in all companies cannot be denied. Online analytical processing (OLAP) is a crucial component of a data warehouse (DWH). Business data serves as the input for creating valuable information essential for a business's sustainability. It comes from a variety of sources and takes many different forms, from traditional structured data to unstructured data. A vector database is a specific type of database that stores data in multidimensional vectors, representing traits or attributes. The process of transforming high-dimensional data, including unstructured text and images, into a representation with fewer dimensions is known as embedding in vector database usage. Vector embeddings are structured numerical representations generated from unstructured data, such as text and images, using modern techniques that preserve semantic notions of similarity and difference in the vector morphology. DWH systems are always designed with structured data in mind. However, as the volume of unstructured data grows exponentially, organizations need more complex methods to understand, represent, and analyze this material. In this case, vector databases offer a ground-breaking answer. To enable vector database technology in data warehouses and handle unstructured data, it is necessary to employ modern techniques, such as Retrieval Augmented Generation (RAG). The recently proposed approach offers a viable method for building an unstructured data warehouse. vector database RAG DWH OLAP Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 1. INTRODUCTION Business data originates from a variety of sources, ranging from unstructured data, such as emails and text documents, to more traditional structured data, including employee information stored in relational databases. These elements serve as inputs to generate the valuable information organizations need to make decisions and gain a competitive edge. Information that does not cleanly fit into traditional databases or have a pre-existing data model is referred to as unstructured data. While structured data is organizing into rows and columns in relational databases, unstructured data lacks a predefined format and is typically storing in text files, images, PDFs, videos, emails, audio recordings, and social media posts. Unstructured data, due to its inherent complexity and variety, often poses a challenge for analysis using conventional techniques. A treasure trove of information comes from a variety of sources. [ 1 , 2 ]. A data warehouse serves as a repository for valuable data and analytical tools generated from various business applications. The planned, designed, organized, and sporadic replication of data from many sources, both inside and outside the project, into a domain optimized for analytical and data processing is known as data warehousing. One aspect of data warehousing is promoting business form change. Organizations learn about significant domains where knowledge can help them make critical decisions about the main components of their business and enhance data-driven operational and strategic decision-making [ 3 ]. OLAP platforms offer visualization capabilities for business analysis. The "explain to me what happened and why" processing you undertake after building your data warehouse aligns with the business analysis style of data access. In contrast to querying and reporting tools, OLAP technology allows you to dig deeper, explore further, and determine the "why" behind what is happening in your company [ 3 , 4 ]. From traditional structured data, such as customer data stored in relational databases, to unstructured data, including emails and text documents, business data originates from a variety of sources and takes many different forms. These components are the inputs that produce the valuable data businesses need to make informed decisions and gain a competitive edge. Thus, converting the traditional data warehouse into an effective unstructured data warehouse is necessary [ 5 ]. A vector database is one type that naturally supports a variety of unstructured data types with effective indexing, retrieval, and storage. High-dimensional vectors can be efficiently stored, indexed, and queried using a vector database. Data is stored as vectors in a multidimensional space in vector databases, rather than in rows and columns in traditional databases. A mathematical array of numbers in each vector represents the qualities or attributes of data points. [ 6 , 7 ]. Important semi-structured and unstructured data can often be stored in complex file formats, such as emails and the notoriously difficult-to-work-with PDF format. Consider how essential documents are usually in PDF format. Transcripts of earnings calls, investor reports, news articles, research papers, and customer information are just a few examples. We need a method to cleanly and efficiently extract embedded information — such as text, tables, images, graphs, customer information, and other relevant data — from these PDF files. Then, this vital data should be entered into the data warehouse in an organized manner, enabling it to be retrieved quickly and efficiently. Retrieval Augmented Generation (RAG) is a machine learning technique that blends the benefits of generative models and retrieval-based approaches. It is specifically utilizing to improve the capabilities of large language models (LLMs) in Natural Language Processing (NLP). To produce more precise, contextually appropriate outputs, RAG retrieves pertinent documents or data snippets in response to queries. The hybrid technique processes retrieved data to create responses using a generator model after sorting through external data sources using a retriever model. RAG is especially well-suited for tasks requiring both creative language handling and well-informed reactions because of this approach, which helps close the gap between large data reserves and the requirement for accurate, pertinent linguistic creation. The initial step in RAG's operation is for a retrieval system to find relevant information from a data source (such as the internet or internal company records). Second, an LLM uses the gathered data and the initial prompt to produce a grounded response without retraining [ 8 , 9 ]. This research aims to demonstrate that traditional data warehouses can accommodate unstructured data by converting it to structured data in an intermediary step before being fed into the data warehouse. It also provides a method for creating an unstructured data warehouse using RAG and a vector database. This technique allows organizations to continue using their existing data warehouses while adding a feature for handling unstructured data. In the next section, we will outline the work of other researchers and their connection to vector databases and data warehouses. However, upon further research, it becomes clear that all these articles have focused primarily on the Vector Database Management System, the relationship between the Vector Database and LLMs, and Big Raster and Vector Database Systems. When we read the next section, we find that researchers have not made significant contributions to our research topic, which is primarily basing on vector databases transforming data warehouses to handle various types of data. 2. RELATED WORK A New Approach to Use Big Data Tools to Substitute an Unstructured Data Warehouse. This study proposes a novel approach to using Big Data tools to build an unstructured data warehouse. By demonstrating that both can coexist effectively, it seeks to close the gap between conventional (structured) data warehouses and the growing volume of unstructured text data. The system uses a three-tier architecture that combines IBM Big Insights Text Analytics, Pentaho Data Integration (Spoon), and a PostgreSQL Data Warehouse with Pentaho Mondrian OLAP [ 1 ]. A survey of Vector Database Management Systems examines the evolution of Vector Database Management Systems (VDBMSs) in detail, highlighting how these systems have advanced in response to the growing need to handle unstructured data, facilitate complex query processing with large language models, and support other data-intensive applications. The paper explores new approaches to query processing, storage, indexing, and query optimization and execution, developed to address these challenges. Improving VDBMS capabilities is also discussed [ 10 ]. A Comprehensive Survey on Vector Database provides an extensive review of vector databases, highlighting their use, storage strategies, search tactics, difficulties, integration with large language models (LLMs), and significance in the big data and artificial intelligence era [ 11 ]. Vector Database Management Systems provides an accessible introduction to the fundamental concepts, use cases, and current issues in vector database management systems, providing an overview for scholars and practitioners looking to promote effective vector data management. This study examined several VDBMSs and their characteristics, as well as well-known vector data use cases, including chatbots and picture similarity searches. The final topics covered in this study are the high dimensionality and sparsity of vector data, as well as the relative uniqueness of VDBMS products and their ramifications [ 12 ]. An experimental study of Big Raster and Vector Database Systems investigates the challenges and performance of state-of-the-art big spatial data systems in concurrently processing raster and vector data. This publication raises awareness among the scientific community of issues with concurrent raster and vector data processing. It demonstrates the calculation and access patterns for raster and vector data in three real-world applications. Since they utilize the relational data model, vector-based systems require raster data to be vectorized before processing. On the other hand, most raster + vector-based systems do not require converting data from one format to another but instead compute an index or an intermediate data structure to facilitate query processing (e.g., vector-based systems such as Adaptive Cell Trie, Sedona, and Beast) [ 13 ]. Manu provides a comprehensive overview of its innovative approach to vector database management, emphasizing adaptability, performance, and ease of use for handling large-scale vector data across various applications. The document suggests that Manu represents a significant advancement in the field of vector databases, with potential for further development and application in multiple domains [ 14 ]. The NFT Vector Database presents a discussion on the development of a scalable, hardware-agnostic, cloud-based system for managing Non-Fungible Token (NFT) data utilizing vector representations. By enabling similarity tracking across NFTs, this approach aims to address NFT duplication. NFT Vector Database is crucial because it is possible to produce almost identical NFTs with very slight changes [ 15 ]. A Collaborative MULTI-AGENT Approach to Retrieval-Augmented Generation Across Diverse Data Sources. Specialized agents work together within a modular framework to handle different types of data in the Multi-Agent RAG system proposed in this paper. This architecture overcomes the major drawbacks of conventional single-agent RAG systems by including specialized agents made for different database types, a centralized query execution environment, and a generative agent for supplying replies. It guarantees scalability across many data sources, optimizes token usage, and improves query precision.. The Multi-Agent RAG framework significantly enhances LLMs' retrieval and use of external data. It substantially improves the accuracy, scalability, and efficiency of generative AI systems across heterogeneous databases compared to single-agent RAG designs [ 16 ]. HYBGRAG: Hybrid Retrieval-Augmented Generation on Textual and Relational Knowledge Bases. Traditional RAG systems help LLMs by extracting information from text documents (RAG) or knowledge graphs (Graph RAG). However, many real-world questions, sometimes called hybrid questions, call for both textual and relational data. HYBGRAG introduces a unified framework for effectively managing hybrid reasoning. HYBGRAG is a next-generation hybrid RAG framework that uses self-reflective multi-agent reasoning to merge text content with graph relations. By increasing accuracy, adaptability, and interpretability in intricate, semi-structured knowledge contexts, it establishes a new benchmark for Hybrid Question Answering (HQA) [ 17 ]. Evaluating Retrieval Quality in Retrieval-Augmented Generation. Standard end-to-end testing is sluggish, opaque, and memory-intensive, making it difficult to assess how well retrievers perform inside RAG systems. The study suggests eRAG as an alternative to global metrics or human labels. This innovative assessment method measures the actual contribution of each retrieved document to the downstream task. (e.g., QA correctness or fact-checking). eRAG redefines retrieval model evaluation in RAG pipelines. It assesses genuine utility — how much each article aids the LLM in providing the correct answer — rather than treating relevance as static or human-defined. It outperforms all previous evaluation techniques in terms of accuracy, interpretability, and efficiency [ 18 ] Searching for Best Practices in Retrieval-Augmented Generation. The study examines how to balance accuracy, speed, and usefulness when designing the best Retrieval-Augmented Generation (RAG) systems. To determine the top-performing configurations, it assesses every element of the RAG pipeline. By showing how each design choice affects speed and quality throughout retrieval, reranking, and summarization, the paper offers a modular baseline for RAG system optimization.. It also provides replicable tools on GitHub (Fudan DNN-NLP/RAG) and presents a multimodal extension (text2image, image2text) [ 19 ]. Developing Retrieval-Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report. The paper serves as a helpful manual and experience report for developing RAG systems that use PDF documents as their primary data source. It shows how LLMs (GPT, LLaMa) and information retrieval can be combined to build systems that provide precise, open, and up-to-date answers for knowledge-intensive fields such as law, healthcare, and customer service. By connecting LLMs to current, reliable sources, RAG systems go beyond static data. By combining OpenAI and Llama methods, this book provides developers with a reproducible path from PDF extraction to complete system implementation. It is an essential part of the development of applied generative AI because of its focus on repeatability, transparency, and factual base. [ 20 ]. Retrieval-Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make Your LLMs Use External Data More Wisely. This study presents a systematic methodology for comprehending and enhancing data-augmented Large Language Models (LLMs)—models such as GPT or Llama that use RAG, fine-tuning, or context injection to exploit external knowledge. It addresses the central issue of enabling LLMs to use external data in an efficient, prudent, and open manner. The study emphasizes the construction of modular, adaptive pipelines that cleverly combine various techniques [ 21 ]. Retrieval-Augmented Generation for Large Language Models: A Survey. The paper provides a comprehensive survey of Retrieval-Augmented Generation (RAG). By collecting external data, this method improves Large Language Models (LLMs) to increase accuracy, reduce hallucinations, and dynamically update knowledge. It charts the development, technical paradigms, assessment techniques, and future paths of RAG. By bridging the gap between external data and LLMs' internal knowledge, RAG makes AI systems more precise, interpretable, and flexible. RAG is a fundamental technique for next-generation AI applications because of its progression from Naive to Advanced to Modular, which demonstrates a change from simple retrieval to intelligent, adaptable, and multimodal reasoning [ 22 ]. Multi-PDF RAG Chatbot Using LangChain and Streamlit. The article presents a web-based RAG system that uses Streamlit, OpenAI LLMs, and LangChain to enable interactive querying and summarization of numerous PDF documents. For real-time, understandable AI-driven document analysis, it seeks to transform static PDFs into a dynamic, searchable knowledge base. By using LangChain, Streamlit, and RAG—an approachable, scalable architecture for next-generation intelligent document assistants—the project transforms static PDFs into interactive, explainable knowledge systems, bridging the gap between document handling and generative AI [ 9 ]. 3. PROPOSED SYSREM This design helps transform data warehouses from their current handling of structured data to the processing of unstructured data (e.g., text and PDFs). A vector database and RAG technique in this architecture will help enhance existing traditional data warehouses rather than replace them. Data will be extracted from unstructured data and stored in the data warehouse so we can work with it. As shown in Fig. 1 , the Architecture Design components include the data source (structured and unstructured data), ETL phase (RAG pipeline, Transform, loading), Vector Embedding tools, the Vector Data warehouse, and the OLAP (user interface). As shown in Fig. 2 (RAG Pipeline), the RAG pipeline will initially process unstructured data by collecting domain-specific textual information from multiple sources, including PDFs, structured files, and plain-text documents. A custom data set is built on this carefully chosen collection, enabling the system to provide more precise data and more focused responses. After collection, the raw data is transforming to make it more transparent and usable. The process involves standardizing the text—eliminating extraneous components such as special characters or formatting errors—and then dividing it into digestible portions—typically smaller tokens like words or phrases. Their structured output will subsequently be integrated with other structured data sources. Numerical representations of the relationships and meaning of words, sentences, and other data types are called vector embeddings, as shown in Fig. 3 . Vector embedding enables computers to quickly access data by converting an objects key attributes or features into a clear and structured array of numbers. After being converted to points in a multidimensional space, the data points are clustered more closely together [ 7 ]. The terms "vectors" and "embeddings" can be used interchangeably when discussing vector embeddings. These numerical data formats represent each data point as a vector in a high-dimensional space. Vector embedding uses vectors to represent data points in a continuous space, whereas a vector is an array of numbers with a defined dimension [ 23 ]. As shown in Fig. 3 , Architecture Design uses three types of vector embedding (Word embedding, Sentence embedding, and Document embedding): - Word embedding are widely using to capture semantic relationships between words. Sentence embedding are proper for sentiment analysis, text categorization, and information retrieval because they capture a sentence's meaning and context. Document embedding captures the broad meaning and content of a document and is commonly employing in tasks such as recommendation systems, clustering, and document similarity. As shown in Fig. 4 , the Data is designed with a multidimensional star schema to support multidimensional analytics reporting. The Fact Table contains numerical, measurable data, including foreign keys linking to dimension tables. Dimension Tables provide context to the facts (who, what, where, when, how). Contain descriptive attributes used for filtering, grouping, and reporting. After the data is transforming and storing in the Vector Data warehouse, we can use intelligent models to extract and report data from it. An online analytical processing (OLAP) server supports reporting and querying the data warehouse. As shown in Fig. 1 , the Architecture Design OLAP first uses vector embedding to convert queries into vectors, which are sending to the Vector Data Warehouse for extraction the results. 4. RESULTS AND DISCUSSIONS Testing will verify each phase's output with the acceptance criteria: Data extraction: The RAG pipeline will first handle unstructured textual input, and then its structured output will be merge with other structured data sources. TL process: Text and words are extracting from files or records. They are transforming, cleansing, and loading into the data warehouse. Vector Embedding: transform all ETL output to numerical representations and the meaning of words and sentences. Analytics Reporting: An OLAP report can be create from the data warehouse extraction. 5. IMPLEMENTATION AND RESULT The suggested architecture and database designs have successfully developed and tested an unstructured data warehouse using approximately 1.58 million datasets. Word, sentence, and document embedding technologies can extract structured data from unstructured review content. In the multidimensional star schema design, the ETL process successfully retrieved, transformed, loaded, and refreshed the review fact and its dimensions data from the files, including the extracted review sentiment, into the data warehouse. Cleansing Rules and system learning This section presents rules to follow before using vector-embedding tools to learn the system and load valid data into the vector data warehouse. The rule is: Name: Have NO-BREAK SPACE character presented instead of the space table. Different presentations exist for some Arabic characters ‘ أ,ة , ً ,...’ in the table. Names that have ‘ أبو', 'عبد ' should not have a space after them in tables. The name has some tokens with more than three repeated characters in the tables. Issue: The Month and day contain different number formats, and no single business rule is applying in the tables. The issue year data type is a number; no condition is applying to guarantee the year format is displaying in tables. A birthdate data type is a number that dose not guarantee the date format can only be presenting in tables. The Passport Issue Date column is 100% complete when entered as a number; however, there is no guarantee that the date format will be displaying consistently across table formats. Some names have Digits, ex, '1,2,3,4,5, etc.' in tables. A set of Special characters presented '%*@<' in tables. Some names do not have a space after some tokens in tables. Arabic countries with English personal names in tables. English countries with Arabic person names in tables. The Data Profiling activity focused on Data Domain Analysis to identify key information about data within the project's scope. The primary focus was to profile as much existing data as possible to provide a comprehensive understanding and complete view of the data sources. The sub-sections below will give a detailed this is a table profiling report covering different dimensions for the profiling active constants, uniqueness, null-ability, data types, storage, and most frequent values for each data source, differences in distinct, uniqueness, and null ability between defined, inferred, and chosen values: TABLE 1_Data profiling Column Name Uniqueness % Null &Empty % Passenger_ID 00% 2% Nationality 100% 20% Nationality_ID 100% 00% Movement _ID 100% 00 Birth_Data 70% 15% Passport_No 40% 20% Full_Name 70% 15% Last_Update_Date 00% 5% Movement _Date 100% 5% Movement _Type_ID 100% 00% Movement _Type 100% 00% Trip_No 100% 5% Registration_Date 100% 5% Travel_To 100% 5% Based on the above table 1, the researcher covered all the fields needed for the matching and vectorization process, and the nationality field used to differentiate between passengers from different countries. Another deep analysis of name structure conducted in two steps: the first step involved extracting all available tokenization within each name, and the second step involved obtaining more detailed information down to the character level. Now that we have refined the data, extracted other data from the files, transformed it, and stored it in the Vector Data Warehouse, we can compare the speed of inserting data using the traditional method (ETL -> Data Warehouse) and the new process (RAGTL -> Vector Embedding -> Vector Data Warehouse). As shown in Figure 5, the time required to store data inside the data warehouse using the new method is greater than that of the traditional process; however, we successfully stored unstructured data. An OLAP cube is mading for specific analytics. It shows that this unstructured data warehouse can handle newly updated data from the unstructured data source. We can compare the speed of retrieving analytic data using the traditional method (OLAP -> Data Warehouse) and the new process (OLAP -> Vector Embedding -> Vector Data Warehouse). As shown in Figure 6, the time required to retrieve data from the data warehouse using the new method is greater than that of the traditional method. However, we successfully addressed unstructured data in the data warehouse. 6. CONCLUSION The proposed traditional data warehouse solution and the suggested solution are comparable in terms of their functionality. The proposed design of this study is further enhancing by using a Retrieval Augmented Generation (RAG) tool with hundreds of pre-built text annotators or extractors, which makes extracting relevant text considerably easier. After the Data warehouse modified to extract relevant data from datasets, the extractor, using the Retrieval Augmented Generation (RAG), produced all accurate findings in the experiment. Despite its flexibility and convenience, this system still stores data from unstructured sources using a relational database management system (RDBMS). Later, these sources may grow substantially, potentially affecting data warehouse performance. A forthcoming design should use a new ETL process to store unstructured data in a data warehouse and an OLAP to analyse it. This will enable the advanced design to be used in the right place, yielding better outcomes by leveraging more advanced tools. In the future, we can store conversational images in an unstructured data warehouse with a similar design. However, that requires more advanced transformation tools to convert the image data into structured data for storage. Declarations ACKNOWLEDGMENT We acknowledge and are grateful to Helwan University's Scientific Research Centre for its assistance in completing this study. The reviewers' input and essential assistance are greatly appreciated. Funding Declaration We did not receive support from any organization for the submitted work. No funding was received to assist with the preparation of this manuscript. No funding was received for conducting this study. No funds, grants, or other support were received. Data Availability declaration Our data cannot be shared openly to protect the privacy of study participants. Unfortunately, the data supporting this study's findings are not openly available due to data sensitivity. Competing Interest declaration I declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper. Author contributions Sherif R. Eldemerdash was born in Cairo, Egypt, in 1983. He received the B.S. degree in Software Engineering and Control Systems from the Faculty of Engineering, Mansoura University, El dakahlia, Egypt, in 2004, and the M.D. degree in Information Systems from the Faculty of Computer Science and AI, Helwan University, in 2021. He is currently pursuing a Ph.D. degree in Information Systems at Helwan University. He also serves as IT Manager. For further inquiries, He can be contacted at [email protected] , [email protected] . OSAMA EMAM received a Ph.D. degree in Pure Mathematics (Operations Research) from Helwan University, Egypt. He has been the Dean of the Faculty of Computers & Artificial Intelligence, Helwan University, since 2018. He is currently a Professor at the Faculty of Computers and Artificial Intelligence, Helwan University, Faculty of Computers and Information Technology, Future University of Egypt, and Faculty of Computer Studies, Arab Open University in Cairo. Prof. Osama supervised numerous master’s and Ph.D. theses. Her research interests include big data analytics, data mining, and evaluation methodologies. He is also a reviewer for several information systems journals. Manal A. Abdel-Fattah received her Ph.D. degree in Information Systems from the Faculty of Computers and Information, Cairo University. She has worked as a Business Development Consultant at the Management National Institute and as a Project Manager at the Ministry of State and Administrative Development. She is currently a Professor at the Faculty of Computers and Artificial Intelligence, Helwan University. Prof. Manal has supervised numerous master’s and Ph.D. theses. Her research interests include big data analytics, data mining, and evaluation methodologies. She is also a reviewer for several information systems journals. Wael Mohamed is an Assistant Professor at the Faculty of Computers and Artificial Intelligence, Helwan University. He holds a B.S. degree in Software Engineering, an M.Sc. degree in Information Systems, and a Ph.D. in Information Systems. His research interests include big data analytics, data mining, and software engineering. References a. C. N. T. Oras Baker, "A New Approach to Use Big Data Tools to Substitute Unstructured Data Warehouse," in IEEE Conference on Big Data and Analytics (ICBDA), 2020. B. Williams, "insight7," insight7, [Online]. Available: https://insight7.io/unstructured-data-examples-and-tools-for-generating-insights/. [Accessed 1 10 2025]. T. C. H. a. A. R. Simon, Data Warehousing For Dummies, 2nd Edition, Wiley Publishing, Inc., 2009. R. N. S. L.-M. L. R. a. A. A.-C. Diana Martinez-Mosquera, "Integrating OLAP with NoSQL Databases in Big Data Environments: Systematic Mapping," MDPI, Big Data and Cognitive Computing, Vols. 8, 64, no. 5 June 2024, pp. 1-29, 2024. a. C. N. T. Oras Baker, "A New Approach to Use Big Data Tools to Substitute Unstructured Data Warehouse," in 2020 IEEE Conference on Big Data and Analytics (ICBDA), 2020. Y. S. H. Y. H. X. C. L. K. C. ,. M. Z. Zhi Jing, "When Large Language Models Meet Vector Databases: A Survey," researchgate, no. 07 March 2024, 2024. Pere Martra, "Vector Databases and LLMs," in Large Language Models Projects: Apply and Implement Strategies for Large Language Models, Barcelona, Spain, Apress, 2024, pp. 31-62. Y. Y. ,. Z. W. Z. H. L. K. Q. a. L. Q. Siyun Zhao, "Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely," arXiv, no. 23 Sep 2024, 2024. U. V. G. A. J. R. a. S. H. G Murali1, "Multi-PDF RAG Chatbot Using LangChain and Streamlit," GJEIIR Global Journal of Engineering Innovations and Interdisciplinary Research, vol. 5, no. 5, pp. 1-8, 2025. J. W. a. G. L. James Jie Pan, "Survey of Vector Database Management Systems," arXiv, 2023. C. L. a. P. W. Yikun Han, "A Comprehensive Survey on Vector Database: Storage and Retrieval Technique Challenge," arXiv, 2023. T. Taipalus, "Vector Database Management Systems: Fundamental Concepts, Use-Cases, and Current Challenges," arXiv, 2024. A. E. T. D. A. M. a. E. S. Samriddhi Singla, "Experimental Study of Big Raster and Vector Database Systems," in IEEE 37th International Conference on Data Engineering (ICDE), 2021. Q. C. X. L. W. X. a. o. Rentong Guo, "Manu: A Cloud Native Vector Database Management System," arXiv, 2022. N. P. A. S. A. H. a. S. C. Samrat Sahoo, "The Universal NFT Vector Database: A Scalable Vector Database for NFT Similarity Matching," arXiv, 2023. M. D. S. A. A. M. U. ,. a. S. S. Aniruddha Salve, "A Collaborative MULTI-AGENT Approach to Retrieval-Augmented Generation Across Diverse Data Sources," arXiv, vol. 1, 2024. Q. Z. C. M. Z. H. S. A. V. N. I. H. R. a. C. F. Meng-Chieh Lee, "HYBGRAG: Hybrid Retrieval-Augmented Generation on Textual and Relational Knowledge Bases," arXiv, vol. 2, 2025. A. S. a. H. Zamani, "Evaluating Retrieval Quality in Retrieval-Augmented Generation," SIGIR , pp. 2395-2400, 2024. Z. W. X. G. F. Z. Y. W. Z. X. T. S. Z. W. S. L. Q. Q. R. Y. C. L. a. X. Z. Xiaohua Wang, "Searching for Best Practices in Retrieval-Augmented Generation," arXiv, vol. 1, 2024. M. T. H. K. K. K. J. R. a. P. A. Ayman Asad Khan, "Developing Retrieval-Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report," arXiv, vol. 1, 2024. Y. Y. ,. Z. W. Z. H. L. K. Q. a. L. Q. Siyun Zhao, "Retrieval-Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs Use External Data More Wisely," arXiv, Microsoft Research Asia, vol. 1, 2024. Y. X. X. G. K. J. J. P. Y. B. Y. D. J. S. M. W. a. H. W. Yunfan Gao, "Retrieval-Augmented Generation for Large Language Models: A Survey," arXiv, vol. 1, 2024. K. Yasar, "What are vector embeddings?," TechTarget, 2025. [Online]. Available: https://www.techtarget.com/searchenterpriseai/definition/vector-embeddings. [Accessed 17 5 2025]. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8235989","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":554813666,"identity":"951ab9c2-4824-4190-be9b-4713836765f7","order_by":0,"name":"Sherif R Eldemerdash","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABLUlEQVRIie2RMUvDQBTHXwgkS7Drk6D5CicBHeKHeUcgWVLqJB2ERoR07Brpl+jkauEgXTI4VuwQKHTqEBEkk3hJdSgc2lHwfnDv4Lgf//fuADSavwjKRZey2LcVEMzhaHdu/qJEsjiCdYp1kAKtgiGT9QDFm94VdUUBnxyvjbqC1amF4aaCYcBTe/ykUtiqCHOimN9PQxMJNr6F0QWDMuapU14pFUx84I3oz14GhZxF8MylczQywVNMSNlYPngDItF/fBZWq4wyN35H40Mq3lapwDIxO2WGZqeQ5SYyJW1T7LmysWXkSyUe5WU7CxNnmbe9RipiP3MS9Yvl4dpoKPB7Y2G8NkPh9ZzFA9Y3wcnEXlRKZ8f3L7CvvR3CAoepb+8p+9g/pWg0Gs3/4RNW2GFyo+XOsgAAAABJRU5ErkJggg==","orcid":"","institution":"Helwan University","correspondingAuthor":true,"prefix":"","firstName":"Sherif","middleName":"R","lastName":"Eldemerdash","suffix":""},{"id":554813667,"identity":"85b24d86-cab7-446d-9b36-8206ae4bfd08","order_by":1,"name":"Osama E Emam","email":"","orcid":"","institution":"Helwan University","correspondingAuthor":false,"prefix":"","firstName":"Osama","middleName":"E","lastName":"Emam","suffix":""},{"id":554813668,"identity":"33720041-ec03-43eb-b31d-25cd318291a7","order_by":2,"name":"Manal A Abdelfattah","email":"","orcid":"","institution":"Helwan University","correspondingAuthor":false,"prefix":"","firstName":"Manal","middleName":"A","lastName":"Abdelfattah","suffix":""},{"id":554813669,"identity":"71caa148-0508-466d-b540-b53e2acbc268","order_by":3,"name":"Wael Mohamed abass","email":"","orcid":"","institution":"Helwan University","correspondingAuthor":false,"prefix":"","firstName":"Wael","middleName":"Mohamed","lastName":"abass","suffix":""}],"badges":[],"createdAt":"2025-11-29 09:38:14","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8235989/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8235989/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":98426554,"identity":"8c57522a-6750-4fc8-aa2d-e62b02ea6cf0","added_by":"auto","created_at":"2025-12-17 16:36:41","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1021893,"visible":true,"origin":"","legend":"","description":"","filename":"AVectorDatabaseApproachforEnhancingDataWarehouseDevelopmentPractices.docx","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/44cb71c9d22d5abf6d80050c.docx"},{"id":98427185,"identity":"c23aa242-4ec0-4108-ab42-412f0433e035","added_by":"auto","created_at":"2025-12-17 16:39:53","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":7875,"visible":true,"origin":"","legend":"","description":"","filename":"6ff3a7fdeb754a6f884fd9f72db355c0.json","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/6eac8462917b9a715498a424.json"},{"id":98427094,"identity":"f72e1ddb-81e1-4fe5-a929-c6f9d42c2278","added_by":"auto","created_at":"2025-12-17 16:39:32","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":68414,"visible":true,"origin":"","legend":"","description":"","filename":"6ff3a7fdeb754a6f884fd9f72db355c01enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/263efae71f047099faf47305.xml"},{"id":98426080,"identity":"0323d740-bedc-4955-a0e5-a7b255ac6b01","added_by":"auto","created_at":"2025-12-17 16:35:35","extension":"eps","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":404,"visible":true,"origin":"","legend":"","description":"","filename":"drawingimage5.eps","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/1bed53de0405240bed9dc253.eps"},{"id":98426580,"identity":"0f509270-10a0-44a7-a6b4-f996f3f6e4f8","added_by":"auto","created_at":"2025-12-17 16:36:59","extension":"jpeg","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":242441,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/51dbcd6441284dce44ee6df1.jpeg"},{"id":98022687,"identity":"0f5f3835-95a2-43a9-a5b5-40cfaa33d7b3","added_by":"auto","created_at":"2025-12-12 01:12:11","extension":"png","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":51445,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage10.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/702aa5a3d4db813d8a35f38a.png"},{"id":98426896,"identity":"6b5e2720-c237-42fc-8303-2cef872d9a00","added_by":"auto","created_at":"2025-12-17 16:38:58","extension":"jpeg","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":248630,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/c312cdb134de8c653a0d25c5.jpeg"},{"id":98022689,"identity":"e6fe0513-8411-434c-b2cf-8b09a50cff28","added_by":"auto","created_at":"2025-12-12 01:12:11","extension":"jpeg","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":183885,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/5b5843e102e3432e77c54f0a.jpeg"},{"id":98425875,"identity":"1db76d25-a1ad-4341-9c3f-7d2c1d97182f","added_by":"auto","created_at":"2025-12-17 16:35:19","extension":"jpeg","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":51303,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/800849731e7453a01b9d4c92.jpeg"},{"id":98426218,"identity":"e1d7fa37-eb36-4eb1-ae9f-7694baac8696","added_by":"auto","created_at":"2025-12-17 16:35:52","extension":"png","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":97224,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/86d3a0cb1a73d2f7b9b8954c.png"},{"id":98426104,"identity":"7ae03efd-d4c2-4615-87ac-14e0defe9d23","added_by":"auto","created_at":"2025-12-17 16:35:43","extension":"png","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":92066,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/703103afc846ed45c98a2c3d.png"},{"id":98426644,"identity":"6bccd819-df93-4c30-966c-7fb8734ecfa7","added_by":"auto","created_at":"2025-12-17 16:38:01","extension":"jpeg","order_by":16,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":71866,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage7.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/bf50f504eeb4341993c9553b.jpeg"},{"id":98424934,"identity":"a60b193e-e8e2-4e88-bf76-b2e406888f5e","added_by":"auto","created_at":"2025-12-17 16:34:04","extension":"jpeg","order_by":17,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":18494,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage8.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/92f4ccc5cb5ccea2d9500c88.jpeg"},{"id":98426539,"identity":"800b01a1-4440-40bd-bf57-4c5c888fe931","added_by":"auto","created_at":"2025-12-17 16:36:37","extension":"png","order_by":18,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":73786,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/70ad20fcafcac769d6e28278.png"},{"id":98426491,"identity":"a2bbeae2-e5e2-46bd-bc23-8fdda3607419","added_by":"auto","created_at":"2025-12-17 16:36:30","extension":"png","order_by":19,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":128492,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/41f70a2df6838646601cbd4f.png"},{"id":98426885,"identity":"0d4d6847-7ba7-484d-8ab4-0df401833c86","added_by":"auto","created_at":"2025-12-17 16:38:56","extension":"png","order_by":20,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":9067,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage10.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/2a49345e908773f5cb94d531.png"},{"id":98022709,"identity":"d4fd00fe-ed8f-4c97-a31c-09108348ea9f","added_by":"auto","created_at":"2025-12-12 01:12:11","extension":"png","order_by":21,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":150274,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/b4eb0ec0dcef4d573690a810.png"},{"id":98022706,"identity":"1a495ad5-e908-4f71-aab9-89cbb0778df7","added_by":"auto","created_at":"2025-12-12 01:12:11","extension":"png","order_by":22,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":90104,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/e7d967aa42865b65df2f4f30.png"},{"id":98022710,"identity":"bba69f24-e7d3-4fd2-a333-9ae21f912a52","added_by":"auto","created_at":"2025-12-12 01:12:11","extension":"png","order_by":23,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":29197,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/e7ba24011ef3d59a931237b1.png"},{"id":98425622,"identity":"43bd202f-5b55-46a3-940e-6c8f9e49f48f","added_by":"auto","created_at":"2025-12-17 16:34:59","extension":"png","order_by":24,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":38109,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/d0779f2b6b504dd2b705cb6d.png"},{"id":98022700,"identity":"4a23871f-8209-4e3f-8572-4b8d4539f9da","added_by":"auto","created_at":"2025-12-12 01:12:11","extension":"png","order_by":25,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":33669,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/e110560ac3c94fef31223dba.png"},{"id":98426672,"identity":"05c12ba2-e535-4ef1-a128-d5f0fcd4623e","added_by":"auto","created_at":"2025-12-17 16:38:10","extension":"png","order_by":26,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":109820,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/6ba7ad1362150c6c28fd90e4.png"},{"id":98427227,"identity":"437a5581-0c33-49d9-a577-984a5333c55e","added_by":"auto","created_at":"2025-12-17 16:40:00","extension":"png","order_by":27,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":18448,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/d941407e9dcb39793eb6f39a.png"},{"id":98022707,"identity":"1b1e90e9-542c-479e-820a-6748177b5e68","added_by":"auto","created_at":"2025-12-12 01:12:11","extension":"png","order_by":28,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":9742,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/e4af3d76c809952f39b24b61.png"},{"id":98022708,"identity":"a709a01e-700d-4b59-a724-20c0e47be613","added_by":"auto","created_at":"2025-12-12 01:12:11","extension":"xml","order_by":29,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":65402,"visible":true,"origin":"","legend":"","description":"","filename":"6ff3a7fdeb754a6f884fd9f72db355c01structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/5336e1ebc492bcd4ff54ae4f.xml"},{"id":98022712,"identity":"b3f00358-9bee-477f-9706-f3fc1aa0b86a","added_by":"auto","created_at":"2025-12-12 01:12:11","extension":"html","order_by":30,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":75729,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/5f872f1d017bfa237db1d5ca.html"},{"id":98022681,"identity":"c789214b-5c05-493e-8438-24e5b34c0631","added_by":"auto","created_at":"2025-12-12 01:12:11","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":314672,"visible":true,"origin":"","legend":"\u003cp\u003eThis is a figure. Architecture Proposed System.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/4323593017317479546bf693.png"},{"id":98427317,"identity":"ceaff47d-ae60-4938-87ba-4b0ca5fecff4","added_by":"auto","created_at":"2025-12-17 16:40:05","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":461006,"visible":true,"origin":"","legend":"\u003cp\u003eThis is a figure. RAG Pipeline.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/22d646a8513b25c7022b7d51.png"},{"id":98425911,"identity":"9d8625e5-e189-4946-a085-3d97e8758b3b","added_by":"auto","created_at":"2025-12-17 16:35:21","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":303957,"visible":true,"origin":"","legend":"\u003cp\u003eThis is a figure. Vector Embedding\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/1d0295dc7cec3529a462df8d.png"},{"id":98022682,"identity":"ef152021-6a82-435c-b81a-a5316e93394b","added_by":"auto","created_at":"2025-12-12 01:12:11","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":119687,"visible":true,"origin":"","legend":"\u003cp\u003eThis is a figure. Star Schema\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/36d93a041709255f46d0c503.png"},{"id":98426090,"identity":"b452167c-a43d-4e1f-b4b8-a45e8cd82767","added_by":"auto","created_at":"2025-12-17 16:35:37","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":26324,"visible":true,"origin":"","legend":"\u003cp\u003eThis is a figure. Compare Tim's insert between the Old ETL and the New Approach.\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/653ff4db1583860c67378fa2.png"},{"id":98022685,"identity":"6443c681-4447-4bc6-a2db-b205c5538b7d","added_by":"auto","created_at":"2025-12-12 01:12:11","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":24184,"visible":true,"origin":"","legend":"\u003cp\u003eThis is a figure. Compare Tim's retrieval between Old OLAP and the New Approach.\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/8595ecd9a5eea76d7025a5a2.png"},{"id":105687162,"identity":"9e6cfe81-5158-4a1b-996e-e351ed31e0da","added_by":"auto","created_at":"2026-03-29 23:39:04","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1767762,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8235989/v1/f2c2608f-fedd-4917-ae5c-b6b69c8d5503.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"A Vector Database Approach for Enhancing Data Warehouse Development Practices","fulltext":[{"header":"1. INTRODUCTION","content":"\u003cp\u003eBusiness data originates from a variety of sources, ranging from unstructured data, such as emails and text documents, to more traditional structured data, including employee information stored in relational databases. These elements serve as inputs to generate the valuable information organizations need to make decisions and gain a competitive edge. Information that does not cleanly fit into traditional databases or have a pre-existing data model is referred to as unstructured data. While structured data is organizing into rows and columns in relational databases, unstructured data lacks a predefined format and is typically storing in text files, images, PDFs, videos, emails, audio recordings, and social media posts. Unstructured data, due to its inherent complexity and variety, often poses a challenge for analysis using conventional techniques. A treasure trove of information comes from a variety of sources. [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eA data warehouse serves as a repository for valuable data and analytical tools generated from various business applications. The planned, designed, organized, and sporadic replication of data from many sources, both inside and outside the project, into a domain optimized for analytical and data processing is known as data warehousing. One aspect of data warehousing is promoting business form change. Organizations learn about significant domains where knowledge can help them make critical decisions about the main components of their business and enhance data-driven operational and strategic decision-making [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eOLAP platforms offer visualization capabilities for business analysis. The \"explain to me what happened and why\" processing you undertake after building your data warehouse aligns with the business analysis style of data access. In contrast to querying and reporting tools, OLAP technology allows you to dig deeper, explore further, and determine the \"why\" behind what is happening in your company [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eFrom traditional structured data, such as customer data stored in relational databases, to unstructured data, including emails and text documents, business data originates from a variety of sources and takes many different forms. These components are the inputs that produce the valuable data businesses need to make informed decisions and gain a competitive edge. Thus, converting the traditional data warehouse into an effective unstructured data warehouse is necessary [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eA vector database is one type that naturally supports a variety of unstructured data types with effective indexing, retrieval, and storage. High-dimensional vectors can be efficiently stored, indexed, and queried using a vector database. Data is stored as vectors in a multidimensional space in vector databases, rather than in rows and columns in traditional databases. A mathematical array of numbers in each vector represents the qualities or attributes of data points. [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e, \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eImportant semi-structured and unstructured data can often be stored in complex file formats, such as emails and the notoriously difficult-to-work-with PDF format. Consider how essential documents are usually in PDF format. Transcripts of earnings calls, investor reports, news articles, research papers, and customer information are just a few examples. We need a method to cleanly and efficiently extract embedded information \u0026mdash; such as text, tables, images, graphs, customer information, and other relevant data \u0026mdash; from these PDF files. Then, this vital data should be entered into the data warehouse in an organized manner, enabling it to be retrieved quickly and efficiently.\u003c/p\u003e\u003cp\u003eRetrieval Augmented Generation (RAG) is a machine learning technique that blends the benefits of generative models and retrieval-based approaches. It is specifically utilizing to improve the capabilities of large language models (LLMs) in Natural Language Processing (NLP). To produce more precise, contextually appropriate outputs, RAG retrieves pertinent documents or data snippets in response to queries. The hybrid technique processes retrieved data to create responses using a generator model after sorting through external data sources using a retriever model. RAG is especially well-suited for tasks requiring both creative language handling and well-informed reactions because of this approach, which helps close the gap between large data reserves and the requirement for accurate, pertinent linguistic creation. The initial step in RAG's operation is for a retrieval system to find relevant information from a data source (such as the internet or internal company records). Second, an LLM uses the gathered data and the initial prompt to produce a grounded response without retraining [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eThis research aims to demonstrate that traditional data warehouses can accommodate unstructured data by converting it to structured data in an intermediary step before being fed into the data warehouse. It also provides a method for creating an unstructured data warehouse using RAG and a vector database. This technique allows organizations to continue using their existing data warehouses while adding a feature for handling unstructured data.\u003c/p\u003e\u003cp\u003eIn the next section, we will outline the work of other researchers and their connection to vector databases and data warehouses. However, upon further research, it becomes clear that all these articles have focused primarily on the Vector Database Management System, the relationship between the Vector Database and LLMs, and Big Raster and Vector Database Systems.\u003c/p\u003e\u003cp\u003eWhen we read the next section, we find that researchers have not made significant contributions to our research topic, which is primarily basing on vector databases transforming data warehouses to handle various types of data.\u003c/p\u003e"},{"header":"2. RELATED WORK","content":"\u003cp\u003eA New Approach to Use Big Data Tools to Substitute an Unstructured Data Warehouse. This study proposes a novel approach to using Big Data tools to build an unstructured data warehouse. By demonstrating that both can coexist effectively, it seeks to close the gap between conventional (structured) data warehouses and the growing volume of unstructured text data. The system uses a three-tier architecture that combines IBM Big Insights Text Analytics, Pentaho Data Integration (Spoon), and a PostgreSQL Data Warehouse with Pentaho Mondrian OLAP [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eA survey of Vector Database Management Systems examines the evolution of Vector Database Management Systems (VDBMSs) in detail, highlighting how these systems have advanced in response to the growing need to handle unstructured data, facilitate complex query processing with large language models, and support other data-intensive applications. The paper explores new approaches to query processing, storage, indexing, and query optimization and execution, developed to address these challenges. Improving VDBMS capabilities is also discussed [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eA Comprehensive Survey on Vector Database provides an extensive review of vector databases, highlighting their use, storage strategies, search tactics, difficulties, integration with large language models (LLMs), and significance in the big data and artificial intelligence era [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eVector Database Management Systems provides an accessible introduction to the fundamental concepts, use cases, and current issues in vector database management systems, providing an overview for scholars and practitioners looking to promote effective vector data management. This study examined several VDBMSs and their characteristics, as well as well-known vector data use cases, including chatbots and picture similarity searches. The final topics covered in this study are the high dimensionality and sparsity of vector data, as well as the relative uniqueness of VDBMS products and their ramifications [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eAn experimental study of Big Raster and Vector Database Systems investigates the challenges and performance of state-of-the-art big spatial data systems in concurrently processing raster and vector data. This publication raises awareness among the scientific community of issues with concurrent raster and vector data processing. It demonstrates the calculation and access patterns for raster and vector data in three real-world applications. Since they utilize the relational data model, vector-based systems require raster data to be vectorized before processing. On the other hand, most raster\u0026thinsp;+\u0026thinsp;vector-based systems do not require converting data from one format to another but instead compute an index or an intermediate data structure to facilitate query processing (e.g., vector-based systems such as Adaptive Cell Trie, Sedona, and Beast) [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eManu provides a comprehensive overview of its innovative approach to vector database management, emphasizing adaptability, performance, and ease of use for handling large-scale vector data across various applications. The document suggests that Manu represents a significant advancement in the field of vector databases, with potential for further development and application in multiple domains [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eThe NFT Vector Database presents a discussion on the development of a scalable, hardware-agnostic, cloud-based system for managing Non-Fungible Token (NFT) data utilizing vector representations. By enabling similarity tracking across NFTs, this approach aims to address NFT duplication. NFT Vector Database is crucial because it is possible to produce almost identical NFTs with very slight changes [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eA Collaborative MULTI-AGENT Approach to Retrieval-Augmented Generation Across Diverse Data Sources. Specialized agents work together within a modular framework to handle different types of data in the Multi-Agent RAG system proposed in this paper. This architecture overcomes the major drawbacks of conventional single-agent RAG systems by including specialized agents made for different database types, a centralized query execution environment, and a generative agent for supplying replies. It guarantees scalability across many data sources, optimizes token usage, and improves query precision.. The Multi-Agent RAG framework significantly enhances LLMs' retrieval and use of external data. It substantially improves the accuracy, scalability, and efficiency of generative AI systems across heterogeneous databases compared to single-agent RAG designs [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eHYBGRAG: Hybrid Retrieval-Augmented Generation on Textual and Relational Knowledge Bases. Traditional RAG systems help LLMs by extracting information from text documents (RAG) or knowledge graphs (Graph RAG). However, many real-world questions, sometimes called hybrid questions, call for both textual and relational data. HYBGRAG introduces a unified framework for effectively managing hybrid reasoning. HYBGRAG is a next-generation hybrid RAG framework that uses self-reflective multi-agent reasoning to merge text content with graph relations. By increasing accuracy, adaptability, and interpretability in intricate, semi-structured knowledge contexts, it establishes a new benchmark for Hybrid Question Answering (HQA) [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eEvaluating Retrieval Quality in Retrieval-Augmented Generation. Standard end-to-end testing is sluggish, opaque, and memory-intensive, making it difficult to assess how well retrievers perform inside RAG systems. The study suggests eRAG as an alternative to global metrics or human labels. This innovative assessment method measures the actual contribution of each retrieved document to the downstream task. (e.g., QA correctness or fact-checking). eRAG redefines retrieval model evaluation in RAG pipelines. It assesses genuine utility \u0026mdash; how much each article aids the LLM in providing the correct answer \u0026mdash; rather than treating relevance as static or human-defined. It outperforms all previous evaluation techniques in terms of accuracy, interpretability, and efficiency [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]\u003c/p\u003e\u003cp\u003eSearching for Best Practices in Retrieval-Augmented Generation. The study examines how to balance accuracy, speed, and usefulness when designing the best Retrieval-Augmented Generation (RAG) systems. To determine the top-performing configurations, it assesses every element of the RAG pipeline. By showing how each design choice affects speed and quality throughout retrieval, reranking, and summarization, the paper offers a modular baseline for RAG system optimization.. It also provides replicable tools on GitHub (Fudan DNN-NLP/RAG) and presents a multimodal extension (text2image, image2text) [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eDeveloping Retrieval-Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report. The paper serves as a helpful manual and experience report for developing RAG systems that use PDF documents as their primary data source. It shows how LLMs (GPT, LLaMa) and information retrieval can be combined to build systems that provide precise, open, and up-to-date answers for knowledge-intensive fields such as law, healthcare, and customer service. By connecting LLMs to current, reliable sources, RAG systems go beyond static data. By combining OpenAI and Llama methods, this book provides developers with a reproducible path from PDF extraction to complete system implementation. It is an essential part of the development of applied generative AI because of its focus on repeatability, transparency, and factual base. [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eRetrieval-Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make Your LLMs Use External Data More Wisely. This study presents a systematic methodology for comprehending and enhancing data-augmented Large Language Models (LLMs)\u0026mdash;models such as GPT or Llama that use RAG, fine-tuning, or context injection to exploit external knowledge. It addresses the central issue of enabling LLMs to use external data in an efficient, prudent, and open manner. The study emphasizes the construction of modular, adaptive pipelines that cleverly combine various techniques [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eRetrieval-Augmented Generation for Large Language Models: A Survey. The paper provides a comprehensive survey of Retrieval-Augmented Generation (RAG). By collecting external data, this method improves Large Language Models (LLMs) to increase accuracy, reduce hallucinations, and dynamically update knowledge. It charts the development, technical paradigms, assessment techniques, and future paths of RAG. By bridging the gap between external data and LLMs' internal knowledge, RAG makes AI systems more precise, interpretable, and flexible. RAG is a fundamental technique for next-generation AI applications because of its progression from Naive to Advanced to Modular, which demonstrates a change from simple retrieval to intelligent, adaptable, and multimodal reasoning [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eMulti-PDF RAG Chatbot Using LangChain and Streamlit. The article presents a web-based RAG system that uses Streamlit, OpenAI LLMs, and LangChain to enable interactive querying and summarization of numerous PDF documents. For real-time, understandable AI-driven document analysis, it seeks to transform static PDFs into a dynamic, searchable knowledge base. By using LangChain, Streamlit, and RAG\u0026mdash;an approachable, scalable architecture for next-generation intelligent document assistants\u0026mdash;the project transforms static PDFs into interactive, explainable knowledge systems, bridging the gap between document handling and generative AI [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e].\u003c/p\u003e"},{"header":"3. PROPOSED SYSREM","content":"\u003cp\u003eThis design helps transform data warehouses from their current handling of structured data to the processing of unstructured data (e.g., text and PDFs). A vector database and RAG technique in this architecture will help enhance existing traditional data warehouses rather than replace them. Data will be extracted from unstructured data and stored in the data warehouse so we can work with it.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eAs shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, the Architecture Design components include the data source (structured and unstructured data), ETL phase (RAG pipeline, Transform, loading), Vector Embedding tools, the Vector Data warehouse, and the OLAP (user interface).\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eAs shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e (RAG Pipeline), the RAG pipeline will initially process unstructured data by collecting domain-specific textual information from multiple sources, including PDFs, structured files, and plain-text documents. A custom data set is built on this carefully chosen collection, enabling the system to provide more precise data and more focused responses.\u003c/p\u003e\u003cp\u003eAfter collection, the raw data is transforming to make it more transparent and usable. The process involves standardizing the text\u0026mdash;eliminating extraneous components such as special characters or formatting errors\u0026mdash;and then dividing it into digestible portions\u0026mdash;typically smaller tokens like words or phrases. Their structured output will subsequently be integrated with other structured data sources.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eNumerical representations of the relationships and meaning of words, sentences, and other data types are called vector embeddings, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e. Vector embedding enables computers to quickly access data by converting an objects key attributes or features into a clear and structured array of numbers. After being converted to points in a multidimensional space, the data points are clustered more closely together [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eThe terms \"vectors\" and \"embeddings\" can be used interchangeably when discussing vector embeddings. These numerical data formats represent each data point as a vector in a high-dimensional space. Vector embedding uses vectors to represent data points in a continuous space, whereas a vector is an array of numbers with a defined dimension [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eAs shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, Architecture Design uses three types of vector embedding (Word embedding, Sentence embedding, and Document embedding): -\u003c/p\u003e\u003cp\u003eWord embedding are widely using to capture semantic relationships between words.\u003c/p\u003e\u003cp\u003eSentence embedding are proper for sentiment analysis, text categorization, and information retrieval because they capture a sentence's meaning and context.\u003c/p\u003e\u003cp\u003eDocument embedding captures the broad meaning and content of a document and is commonly employing in tasks such as recommendation systems, clustering, and document similarity.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eAs shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, the Data is designed with a multidimensional star schema to support multidimensional analytics reporting. The Fact Table contains numerical, measurable data, including foreign keys linking to dimension tables. Dimension Tables provide context to the facts (who, what, where, when, how). Contain descriptive attributes used for filtering, grouping, and reporting. After the data is transforming and storing in the Vector Data warehouse, we can use intelligent models to extract and report data from it.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eAn online analytical processing (OLAP) server supports reporting and querying the data warehouse. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, the Architecture Design OLAP first uses vector embedding to convert queries into vectors, which are sending to the Vector Data Warehouse for extraction the results.\u003c/p\u003e"},{"header":"4. RESULTS AND DISCUSSIONS","content":"\u003cp\u003eTesting will verify each phase\u0026apos;s output with the acceptance criteria:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData extraction:\u003c/strong\u003e The RAG pipeline will first handle unstructured textual input, and then its structured output will be merge with other structured data sources.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTL process:\u003c/strong\u003e Text and words are extracting from files or records. They are transforming, cleansing, and loading into the data warehouse.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eVector Embedding:\u003c/strong\u003e transform all ETL output to numerical representations and the meaning of words and sentences.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAnalytics Reporting:\u003c/strong\u003e An OLAP report can be create from the data warehouse extraction.\u003c/p\u003e"},{"header":"5. IMPLEMENTATION AND RESULT","content":"\u003cp\u003eThe suggested architecture and database designs have successfully developed and tested an unstructured data warehouse using approximately 1.58 million datasets. Word, sentence, and document embedding technologies can extract structured data from unstructured review content. In the multidimensional star schema design, the ETL process successfully retrieved, transformed, loaded, and refreshed the review fact and its dimensions data from the files, including the extracted review sentiment, into the data warehouse.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCleansing Rules and system learning\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis section presents rules to follow before using vector-embedding tools to learn the system and load valid data into the vector data warehouse. The rule is:\u003c/p\u003e\n\u003col\u003e\n \u003cli\u003eName: Have NO-BREAK SPACE character presented instead of the space table.\u003c/li\u003e\n \u003cli\u003eDifferent presentations exist for some Arabic characters \u0026lsquo;\u003cspan dir=\"RTL\"\u003eأ,ة\u003c/span\u003e, \u003cspan dir=\"RTL\"\u003eً\u003c/span\u003e,...\u0026rsquo; in the table.\u003c/li\u003e\n \u003cli\u003eNames that have \u0026lsquo;\u003cspan dir=\"RTL\"\u003eأبو\u0026apos;, \u0026apos;عبد\u003c/span\u003e \u0026apos; should not have a space after them in tables.\u003c/li\u003e\n \u003cli\u003eThe name has some tokens with more than three repeated characters in the tables.\u003c/li\u003e\n \u003cli\u003eIssue: The Month and day contain different number formats, and no single business rule is applying in the tables.\u003c/li\u003e\n \u003cli\u003eThe issue year data type is a number; no condition is applying to guarantee the year format is displaying in tables.\u003c/li\u003e\n \u003cli\u003eA birthdate data type is a number that dose not guarantee the date format can only be presenting in tables.\u003c/li\u003e\n \u003cli\u003eThe Passport Issue Date column is 100% complete when entered as a number; however, there is no guarantee that the date format will be displaying consistently across table formats.\u003c/li\u003e\n \u003cli\u003eSome names have Digits, ex, \u0026apos;1,2,3,4,5, etc.\u0026apos; in tables.\u003c/li\u003e\n \u003cli\u003eA set of Special characters presented \u0026apos;%*@\u0026lt;\u0026apos; in tables.\u003c/li\u003e\n \u003cli\u003eSome names do not have a space after some tokens in tables.\u003c/li\u003e\n \u003cli\u003eArabic countries with English personal names in tables.\u003c/li\u003e\n \u003cli\u003eEnglish countries with Arabic person names in tables.\u0026nbsp;\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eThe Data Profiling activity focused on Data Domain Analysis to identify key information about data within the project\u0026apos;s scope. The primary focus was to profile as much existing data as possible to provide a comprehensive understanding and complete view of the data sources.\u003c/p\u003e\n\u003cp\u003eThe sub-sections below will give a detailed\u0026nbsp;this is a table\u0026nbsp;profiling report covering different dimensions for the profiling active constants, uniqueness, null-ability, data types, storage, and most frequent values for each data source, differences in distinct, uniqueness, and null ability between defined, inferred, and chosen values:\u003c/p\u003e\n\u003cp\u003eTABLE 1_Data profiling\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eColumn Name\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eUniqueness %\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eNull \u0026amp;Empty %\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003ePassenger_ID\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e00%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e2%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003eNationality\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e100%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e20%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003eNationality_ID\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e100%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e00%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003eMovement _ID\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e100%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e00\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003eBirth_Data\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e70%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e15%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003ePassport_No\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e40%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e20%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003eFull_Name\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e70%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e15%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003eLast_Update_Date\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e00%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e5%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003eMovement _Date\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e100%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e5%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003eMovement _Type_ID\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e100%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e00%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003eMovement _Type\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e100%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e00%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003eTrip_No\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e100%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e5%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003eRegistration_Date\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e100%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e5%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003eTravel_To\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e100%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 224px;\"\u003e\n \u003cp\u003e5%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eBased on the above table 1, the researcher covered all the fields needed for the matching and vectorization process, and the nationality field used to differentiate between passengers from different countries.\u003c/p\u003e\n\u003cp\u003eAnother deep analysis of name structure conducted in two steps: the first step involved extracting all available tokenization within each name, and the second step involved obtaining more detailed information down to the character level.\u003c/p\u003e\n\u003cp\u003eNow that we have refined the data, extracted other data from the files, transformed it, and stored it in the Vector Data Warehouse, we can compare the speed of inserting data using the traditional method (ETL -\u0026gt; Data Warehouse) and the new process (RAGTL -\u0026gt; Vector Embedding -\u0026gt; Vector Data Warehouse). As shown in Figure 5, the time required to store data inside the data warehouse using the new method is greater than that of the traditional process; however, we successfully stored unstructured data.\u003c/p\u003e\n\u003cp\u003eAn OLAP cube is mading for specific analytics. It shows that this unstructured data warehouse can handle newly updated data from the unstructured data source. We can compare the speed of retrieving analytic data using the traditional method (OLAP -\u0026gt; Data Warehouse) and the new process (OLAP -\u0026gt; Vector Embedding -\u0026gt; Vector Data Warehouse). As shown in Figure 6, the time required to retrieve data from the data warehouse using the new method is greater than that of the traditional method. However, we successfully addressed unstructured data in the data warehouse.\u003c/p\u003e"},{"header":"6. CONCLUSION","content":"\u003cp\u003eThe proposed traditional data warehouse solution and the suggested solution are comparable in terms of their functionality. The proposed design of this study is further enhancing by using a Retrieval Augmented Generation (RAG) tool with hundreds of pre-built text annotators or extractors, which makes extracting relevant text considerably easier.\u003c/p\u003e\u003cp\u003eAfter the Data warehouse modified to extract relevant data from datasets, the extractor, using the Retrieval Augmented Generation (RAG), produced all accurate findings in the experiment. Despite its flexibility and convenience, this system still stores data from unstructured sources using a relational database management system (RDBMS). Later, these sources may grow substantially, potentially affecting data warehouse performance.\u003c/p\u003e\u003cp\u003eA forthcoming design should use a new ETL process to store unstructured data in a data warehouse and an OLAP to analyse it. This will enable the advanced design to be used in the right place, yielding better outcomes by leveraging more advanced tools.\u003c/p\u003e\u003cp\u003eIn the future, we can store conversational images in an unstructured data warehouse with a similar design. However, that requires more advanced transformation tools to convert the image data into structured data for storage.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eACKNOWLEDGMENT\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe acknowledge and are grateful to Helwan University\u0026apos;s Scientific Research Centre for its assistance in completing this study. The reviewers\u0026apos; input and essential assistance are greatly appreciated.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding Declaration\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe did not receive support from any organization for the submitted work. No funding was received to assist with the preparation of this manuscript. No funding was received for conducting this study. No funds, grants, or other support were received.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData Availability declaration \u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eOur data cannot be shared openly to protect the privacy of study participants. Unfortunately, the data supporting this study\u0026apos;s findings are not openly available due to data sensitivity.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting Interest declaration\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eI declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSherif R. Eldemerdash was born in Cairo, Egypt, in 1983. He received the B.S. degree in Software Engineering and Control Systems from the Faculty of Engineering, Mansoura University, El dakahlia, Egypt, in 2004, and the M.D. degree in Information Systems from the Faculty of Computer Science and AI, Helwan University, in 2021. He is currently pursuing a Ph.D. degree in Information Systems at Helwan University. He also serves as IT Manager. For further inquiries, He can be contacted at
[email protected],
[email protected].\u003c/p\u003e\n\u003cp\u003eOSAMA EMAM received a Ph.D. degree in Pure Mathematics (Operations Research) from Helwan University, Egypt. \u0026nbsp;He has been the Dean of the Faculty of Computers \u0026amp; Artificial Intelligence, Helwan University, since 2018. He is currently a Professor at the Faculty of Computers and Artificial Intelligence, Helwan University, Faculty of Computers and Information Technology, Future University of Egypt, and Faculty of Computer Studies, Arab Open University in Cairo. Prof. Osama supervised numerous master\u0026rsquo;s and Ph.D. theses. Her research interests include big data analytics, data mining, and evaluation methodologies. He is also a reviewer for several information systems journals.\u003c/p\u003e\n\u003cp\u003eManal A. Abdel-Fattah received her Ph.D. degree in Information Systems from the Faculty of Computers and Information, Cairo University. She has worked as a Business Development Consultant at the Management National Institute and as a Project Manager at the Ministry of State and Administrative Development. She is currently a Professor at the Faculty of Computers and Artificial Intelligence, Helwan University. Prof. Manal has supervised numerous master\u0026rsquo;s and Ph.D. theses. Her research interests include big data analytics, data mining, and evaluation methodologies. She is also a reviewer for several information systems journals.\u003c/p\u003e\n\u003cp\u003eWael Mohamed is an Assistant Professor at the Faculty of Computers and Artificial Intelligence, Helwan University. He holds a B.S. degree in Software Engineering, an M.Sc. degree in Information Systems, and a Ph.D. in Information Systems. His research interests include big data analytics, data mining, and software engineering.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003ea. C. N. T. Oras Baker, \u0026quot;A New Approach to Use Big Data Tools to Substitute Unstructured Data Warehouse,\u0026quot; in IEEE Conference on Big Data and Analytics (ICBDA), 2020. \u003c/li\u003e\n\u003cli\u003eB. Williams, \u0026quot;insight7,\u0026quot; insight7, [Online]. Available: https://insight7.io/unstructured-data-examples-and-tools-for-generating-insights/. [Accessed 1 10 2025].\u003c/li\u003e\n\u003cli\u003eT. C. H. a. A. R. Simon, Data Warehousing For Dummies, 2nd Edition, Wiley Publishing, Inc., 2009. \u003c/li\u003e\n\u003cli\u003eR. N. S. L.-M. L. R. a. A. A.-C. Diana Martinez-Mosquera, \u0026quot;Integrating OLAP with NoSQL Databases in Big Data Environments: Systematic Mapping,\u0026quot; MDPI, Big Data and Cognitive Computing, Vols. 8, 64, no. 5 June 2024, pp. 1-29, 2024. \u003c/li\u003e\n\u003cli\u003ea. C. N. T. Oras Baker, \u0026quot;A New Approach to Use Big Data Tools to Substitute Unstructured Data Warehouse,\u0026quot; in 2020 IEEE Conference on Big Data and Analytics (ICBDA), 2020. \u003c/li\u003e\n\u003cli\u003eY. S. H. Y. H. X. C. L. K. C. ,. M. Z. Zhi Jing, \u0026quot;When Large Language Models Meet Vector Databases: A Survey,\u0026quot; researchgate, no. 07 March 2024, 2024. \u003c/li\u003e\n\u003cli\u003ePere Martra, \u0026quot;Vector Databases and LLMs,\u0026quot; in Large Language Models Projects: Apply and Implement Strategies for Large Language Models, Barcelona, Spain, Apress, 2024, pp. 31-62.\u003c/li\u003e\n\u003cli\u003eY. Y. ,. Z. W. Z. H. L. K. Q. a. L. Q. Siyun Zhao, \u0026quot;Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely,\u0026quot; arXiv, no. 23 Sep 2024, 2024. \u003c/li\u003e\n\u003cli\u003eU. V. G. A. J. R. a. S. H. G Murali1, \u0026quot;Multi-PDF RAG Chatbot Using LangChain and Streamlit,\u0026quot; GJEIIR Global Journal of Engineering Innovations and Interdisciplinary Research, vol. 5, no. 5, pp. 1-8, 2025. \u003c/li\u003e\n\u003cli\u003eJ. W. a. G. L. James Jie Pan, \u0026quot;Survey of Vector Database Management Systems,\u0026quot; arXiv, 2023. \u003c/li\u003e\n\u003cli\u003eC. L. a. P. W. Yikun Han, \u0026quot;A Comprehensive Survey on Vector Database: Storage and Retrieval Technique Challenge,\u0026quot; arXiv, 2023. \u003c/li\u003e\n\u003cli\u003eT. Taipalus, \u0026quot;Vector Database Management Systems: Fundamental Concepts, Use-Cases, and Current Challenges,\u0026quot; arXiv, 2024. \u003c/li\u003e\n\u003cli\u003eA. E. T. D. A. M. a. E. S. Samriddhi Singla, \u0026quot;Experimental Study of Big Raster and Vector Database Systems,\u0026quot; in IEEE 37th International Conference on Data Engineering (ICDE), 2021. \u003c/li\u003e\n\u003cli\u003eQ. C. X. L. W. X. a. o. Rentong Guo, \u0026quot;Manu: A Cloud Native Vector Database Management System,\u0026quot; arXiv, 2022. \u003c/li\u003e\n\u003cli\u003eN. P. A. S. A. H. a. S. C. Samrat Sahoo, \u0026quot;The Universal NFT Vector Database: A Scalable Vector Database for NFT Similarity Matching,\u0026quot; arXiv, 2023. \u003c/li\u003e\n\u003cli\u003eM. D. S. A. A. M. U. ,. a. S. S. Aniruddha Salve, \u0026quot;A Collaborative MULTI-AGENT Approach to Retrieval-Augmented Generation Across Diverse Data Sources,\u0026quot; arXiv, vol. 1, 2024. \u003c/li\u003e\n\u003cli\u003eQ. Z. C. M. Z. H. S. A. V. N. I. H. R. a. C. F. Meng-Chieh Lee, \u0026quot;HYBGRAG: Hybrid Retrieval-Augmented Generation on Textual and Relational Knowledge Bases,\u0026quot; arXiv, vol. 2, 2025. \u003c/li\u003e\n\u003cli\u003eA. S. a. H. Zamani, \u0026quot;Evaluating Retrieval Quality in Retrieval-Augmented Generation,\u0026quot; SIGIR , pp. 2395-2400, 2024. \u003c/li\u003e\n\u003cli\u003eZ. W. X. G. F. Z. Y. W. Z. X. T. S. Z. W. S. L. Q. Q. R. Y. C. L. a. X. Z. Xiaohua Wang, \u0026quot;Searching for Best Practices in Retrieval-Augmented Generation,\u0026quot; arXiv, vol. 1, 2024. \u003c/li\u003e\n\u003cli\u003eM. T. H. K. K. K. J. R. a. P. A. Ayman Asad Khan, \u0026quot;Developing Retrieval-Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report,\u0026quot; arXiv, vol. 1, 2024. \u003c/li\u003e\n\u003cli\u003eY. Y. ,. Z. W. Z. H. L. K. Q. a. L. Q. Siyun Zhao, \u0026quot;Retrieval-Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs Use External Data More Wisely,\u0026quot; arXiv, Microsoft Research Asia, vol. 1, 2024. \u003c/li\u003e\n\u003cli\u003eY. X. X. G. K. J. J. P. Y. B. Y. D. J. S. M. W. a. H. W. Yunfan Gao, \u0026quot;Retrieval-Augmented Generation for Large Language Models: A Survey,\u0026quot; arXiv, vol. 1, 2024. \u003c/li\u003e\n\u003cli\u003eK. Yasar, \u0026quot;What are vector embeddings?,\u0026quot; TechTarget, 2025. [Online]. Available: https://www.techtarget.com/searchenterpriseai/definition/vector-embeddings. [Accessed 17 5 2025].\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"vector database, RAG, DWH, OLAP","lastPublishedDoi":"10.21203/rs.3.rs-8235989/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8235989/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThe importance of data warehouses (DWH) in all companies cannot be denied. Online analytical processing (OLAP) is a crucial component of a data warehouse (DWH). Business data serves as the input for creating valuable information essential for a business's sustainability. It comes from a variety of sources and takes many different forms, from traditional structured data to unstructured data. A vector database is a specific type of database that stores data in multidimensional vectors, representing traits or attributes. The process of transforming high-dimensional data, including unstructured text and images, into a representation with fewer dimensions is known as embedding in vector database usage. Vector embeddings are structured numerical representations generated from unstructured data, such as text and images, using modern techniques that preserve semantic notions of similarity and difference in the vector morphology. DWH systems are always designed with structured data in mind. However, as the volume of unstructured data grows exponentially, organizations need more complex methods to understand, represent, and analyze this material. In this case, vector databases offer a ground-breaking answer. To enable vector database technology in data warehouses and handle unstructured data, it is necessary to employ modern techniques, such as Retrieval Augmented Generation (RAG). The recently proposed approach offers a viable method for building an unstructured data warehouse.\u003c/p\u003e","manuscriptTitle":"A Vector Database Approach for Enhancing Data Warehouse Development Practices","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-12-12 01:12:06","doi":"10.21203/rs.3.rs-8235989/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"81b960f5-6de9-4bb9-b3b1-c478129cc997","owner":[],"postedDate":"December 12th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-03-29T23:38:45+00:00","versionOfRecord":[],"versionCreatedAt":"2025-12-12 01:12:06","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8235989","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8235989","identity":"rs-8235989","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.