A Support Vector Machine-Based Cache Replacement Method for Efficient MapReduce Execution | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article A Support Vector Machine-Based Cache Replacement Method for Efficient MapReduce Execution Rana Ghazali, Ali Rezaee, Sahar Adabi, Douglas G. Down, Ali Movaghar This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7801155/v1 This work is licensed under a CC BY 4.0 License Status: Under Revision Version 1 posted 9 You are reading this latest preprint version Abstract Modern applications can generate a large amount of data from different sources with high velocity, a combination that is difficult to store and process via traditional tools. Hadoop is one framework that is used for the parallel processing of a large amount of data in a distributed environment, however, various challenges can lead to poor performance. Two particular issues that can limit performance are the high access time for I/O operations and the recomputation of intermediate data. The combination of these two issues can result in resource wastage. In recent years, there have been attempts to overcome these problems by using caching mechanisms. Due to cache space limitations, it is crucial to use this space efficiently and avoid cache pollution (the cache contains data that is not used in the future). We propose HDFS-based SVM-LRU (H-SVM-LRU) to improve Hadoop performance. For this purpose, we use an intelligent cache replacement algorithm, SVM-LRU, that combines the well-known Least-Recently-Used (LRU) mechanism with a machine learning algorithm, Support Vector Machine (SVM), to classify cached data into two groups based on their future usage. Experimental results show a significant decrease in execution time as a result of an increased cache hit ratio, leading to a positive impact on Hadoop performance. Caching mechanism Cache replacement algorithm SVM-LRU Hadoop performance Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 1 Introduction Hadoop [ 1 ] is an open-source framework designed to facilitate the processing of large datasets within a distributed computing environment. It consists of two primary components: (1) the Hadoop Distributed File System (HDFS) [ 2 ], which provides storage for substantial volumes of data, and (2) MapReduce [ 3 ], a parallel programming model that processes large amounts of data across a cluster of machines in a distributed setting. While Hadoop offers several advantages over traditional Relational Database Management Systems (RDBMS), such as enhanced flexibility, high throughput, cost efficiency, and concurrent processing capabilities, it also encounters notable challenges. Specifically, issues like high data access latency and prolonged data transmission can negatively impact performance. The reliance on hard disk drive (HDD) systems for HDFS storage exacerbates this problem, as the high access times associated with HDDs can significantly affect overall execution time [ 4 ]. In recent years, many researchers have proposed the use of caching mechanisms to address these challenges [ 4 ], [ 5 ], [ 6 ], [ 7 ], [ 8 ], [ 9 ]. The caching mechanisms involve storing frequently accessed data (hot data) in temporary storage and closer to the requesting entity (such as a worker node, or user), which reduces both waiting time and the resources required to fetch the data from the HDFS. Frequently accessed or high-priority data that demands fast retrieval is known as hot data; by contrast, cold data is rarely accessed and does not need to be retrieved immediately. In this paper, we categorize data blocks into either hot data or cold data based on their likelihood of being reused in the near future. The concept of cache states is crucial to understanding cache performance. A cold cache refers to an initial state where the cache contains little or no data, leading to a higher incidence of cache misses as requested data must be fetched from HDFS, resulting in slower access times. Conversely, a warm cache has been populated with a substantial amount of hot data over time, increasing the likelihood of cache hits and reducing data access latency. The transition from a cold to a warm cache state can significantly enhance job execution, as demonstrated in our experimental analysis. We conducted experiments on various applications combining I/O-bound and CPU-bound jobs under three scenarios: (1) using the original Hadoop without a caching mechanism, (2) with a cold cache, and (3) with a warm cache. As shown in Fig. 1 , the normalized execution time decreases notably in the warm cache state across all tested applications. In this case, the cache contains more hot data that is frequently demanded and has a positive effect on the cache hit ratio resulting in improving job execution time. Cache warming refers to the process of preloading the cache with hot data to expedite the transition from a cold to a warm state. This process involves identifying hot data based on historical access patterns and making strategic decisions about which data to cache. More importantly, an effective cache replacement strategy is required when the cache reaches capacity, to ensure optimal data retention. In this paper, we propose an HDFS-based SVM-LRU (H-SVM-LRU) approach that integrates an intelligent cache replacement algorithm, SVM-LRU, to optimize cache utilization and minimize cache pollution. The machine learning component of our approach classifies cached data into hot and cold categories, predicting whether the data is likely to be reused in future computations. This classification allows the system to decide which data should be retained in the cache and which should be replaced, thus maximizing the efficiency of the limited cache space and improving the cache-hit ratio. As a result, the proposed method is well-suited for iterative programs, reducing data access time from disk and avoiding unnecessary recomputation by retrieving more intermediate data from the cache. Our contributions in this paper are: We provide an overview of intelligent cache replacement methods used in web proxy caches. Different cache replacement strategies are investigated in the Hadoop environment, and we discuss their advantages and disadvantages. We introduce the intelligent caching mechanism, H-SVM-LRU, in the Hadoop environment, which uses a Support Vector Machine (SVM) for classifying data into two groups: hot data and cold data based on the likelihood of future reuse. We evaluate H-SVM-LRU’s performance (hit ratio) and compare it with LRU. We carry out experiments to investigate the impact of H-SVM-LRU on Hadoop’s execution time performance. The rest of the paper is organized as follows: We discuss existing cache replacement strategies for Hadoop and intelligent caching methods for web proxy caches in Section 2 . Next, Section 3 defines the problem and our solution approach. We then describe the H-SVM-LRU framework and present our H-SVM-LRU algorithm for the Hadoop environment in Section 4 . Next, we explain the details of the H-SVM-LRU implementation in Section 5. We evaluate the performance of H-SVM-LRU via different experiments in Section 6 . Finally, Section 7 contains conclusions and suggestions for future work. 2 Related work In this section, we first provide an overview of cache replacement algorithms that have been proposed for improving Hadoop performance. We then discuss different intelligent caching mechanisms that use machine learning methods. 2.1 Cache replacement policies in Hadoop In this section, we investigate different cache replacement strategies in Hadoop, including their advantages and disadvantages. LIFE and LFU-F are two replacement strategies used in PacMan [ 9 ] for an in-memory coordinated caching mechanism with data-intensive parallel jobs. In PacMan , parallel jobs run numerous tasks concurrently in a wave, with the all-or-nothing property. The LIFE algorithm evicts data blocks of files with the largest wave-width and results in reducing the average job completion time. LFU-F aims to maximize cluster efficiency. For this purpose, it evicts data blocks with less frequent access. Both strategies prioritize incomplete files over completed files for eviction and use a window-based aging mechanism to avoid cache pollution. They first check whether there are data blocks that have not been accessed within a given time window. Among these files, the one with the least number of accesses is chosen for eviction. Enhanced Data-Aware Cache (EDACHE) [ 10 ] was introduced for caching intermediate results to accelerate MapReduce job execution times. In this strategy, WSClock is used as a cache replacement algorithm in which cached items are maintained in a circular list, and a clock hand advances around this ring. This algorithm replaces cached items based on their reference bit and the last time used. It first checks the reference bit. If its value is one, it means this item is used. The item's reference bit is then reset, its last time used is updated, and the clock hand is advanced. Otherwise, an item with an age greater than a threshold value is evicted. The bottleneck of this mechanism is related to the fact that large blocks lead to long search times for requested content. In collaborative caching , a Modified ARC replacement algorithm [ 11 ] was proposed in order to increase the cache hit ratio and improve efficiency. In this strategy, the cache is divided into recent cache, recent history, frequent cache, and frequent history sections such that the cache sections contain data blocks and the history sections include references to evicted items. Initially, on a request for a block, the references in the history caches are checked. If present, the corresponding block is placed in the recent or frequent cache; otherwise, the cache references and serves the request from either of the history caches, which helps in faster caching as well as locating files for initial checks. If references are found in recent history, then the block is placed in the recent cache. If the block is found in the recent cache, then it is moved to the frequent cache; hence, a hit in either of the history caches removes the reference and places the corresponding block in one of the caches (recent or frequent). Caching a block also involves caching metadata. When either of the caches is fully utilized, then a block is evicted from the recent or frequent cache, but its reference is placed into its corresponding history. When either of the history caches is fully utilized, the references simply drop out of the cache. An adaptive cache algorithm [ 12 ] was designed to cache the partition of tables into the HDFS cache for Big SQL. Selective LRU-K (SLRU-K) and Exponential-Decay (EXD) are used as online caching algorithms and for selective cache insertion to reduce the overhead of inserting items into the HDFS cache. SLRU-K takes into account the variable size of the partition and uses a weighted heuristic to selectively place the partitions into the cache. It keeps a list of the K most recent access times for each partition. However, EXD maintains only the time of the most recent access to compute the score for each partition to determine a weight between access frequency versus recently used. In Adaptive SLRU-K and EXD , the adaptor adjusts its behavior with access patterns of various workloads by automatically adjusting the value of its parameters. Maximizing the byte-hit ratio and minimizing the byte-insertion ratio are the primary aims of the adaptor. The block goodness aware cache replacement strategy [ 13 ] uses two metrics for cache management: cache affinity (CA), which depends on resources used by the application, and block goodness (BG), which measures how much a cached data block is worth. For each cached item, this strategy first calculates the BG value based on the data block access count and MapReduce application cache affinity then selects a data block with the lowest BG value for eviction. A data block with the oldest access time will be discarded if there is more than one data block with the same lowest BG value. The Cache Affinity Aware Cache Replacement Algorithm [ 14 ] categorizes MapReduce applications based on their cache affinity. This algorithm prioritizes caching input data of applications with high cache affinity. It takes into account the cache affinity of a MapReduce application and data access frequency to calculate the benefit of caching for input data. As a result, it evicts a data block with the lowest caching benefit. If there are some data blocks with identical lowest benefits, it evicts a block based on the LRU policy. AutoCache [ 15 ] employs a lightweight gradient-boosted tree (XGBoost) to predict file access patterns on the HDFS cache. In this mechanism, the probability of accessing a file is measured by a probability score, which is used as a metric by the cache replacement policy to avoid cache pollution. When the free space of the cache is less than 10%, the eviction operation is started, and it continues until the cache capacity becomes lower than 85%. This cache replacement algorithm has low overhead by limiting computation to a fixed number of files. In [ 16 ], Newberry et al. evaluate four advanced cache replacement policies: Two Queue (2Q), Adaptive Replacement Cache (ARC), Multi-Queue (MQ), and Low Inter-reference Recency Set (LIRS) that are designed to handle larger reuse distances and reduce cache thrashing, particularly in big data environments. 2Q employs numerous queues to distinguish between frequently and infrequently accessed data, effectively filtering single-use blocks, while ARC enhances this by dynamically adjusting queue sizes to better capture temporal locality. MQ extends 2Q by using multiple queues to categorize data based on access frequency, offering greater precision but performing similarly to 2Q when access patterns are uniform. LIRS excels by prioritizing data based on reuse distance, efficiently managing temporal locality without polluting the cache. Their evaluation revealed that LIRS outperforms other strategies for smaller cache sizes, while ARC is superior for larger caches. Combining LIRS and ARC in a layered architecture further improved performance, highlighting their complementary strengths in big data caching. In Table 1 , we compare these cache replacement strategies in terms of their criteria for eviction and mitigating cache pollution and summarize their advantages and disadvantages. Table 1 Hadoop cache replacement comparisons Replacement strategy Metrics for eviction Cache pollution Advantages Disadvantages LIFE Largest wave width and incomplete file Window age strategy Reduces average completion time Effective for short jobs LFU-F Frequency access and incomplete file Window age strategy Maximizes cluster efficiency Effective for short jobs WSClock Last time used No Decreases execution time Long search times for large data blocks Modified ARC Recency and frequency of access No Increases cache hit ratio Needs space for storing history Adaptive cache The score of each partition No Adapts to various workload characteristics Significant overhead Block goodness aware Block goodness value and access time No Effective for multiple concurrent applications Needs to calculate block goodness Cache Affinity aware Cache affinity of application and recency No Effective use of cache space Needs to know the cache affinity of applications AutoCache Probability of accessing the file Calculating probability score Reduces average completion time and improves cluster efficiency File-oriented cache LIRS Reuse distance Reuse distance Performs best for smaller cache sizes Overhead of a large number of network requests 2.2 Intelligent caching mechanisms In a related application domain, several intelligent caching strategies have been presented that use different machine learning techniques to enhance the performance of web proxy caches. Ali et al. proposed SVM–LRU, SVM–GDSF , and C4.5–GDS [ 17 ] that combine a support vector machine (SVM) and a decision tree (C4.5) with Least-Recently-Used (LRU), Greedy-Dual-Size (GDS), and Greedy-Dual-Size Frequency (GDSF) replacement strategies. In these techniques, web objects are classified into two groups: revisited later or not. These methods use a web proxy log file as a training dataset and different features of web objects, such as recency, frequency, size, access latency, and type of object, are considered for classification. Experimental results show that SVM-LRU appears to have the best hit ratio. Employing a Bayesian network, BN-GDS , and BN-LRU [ 18 ] were introduced to improve the performance of cache replacement strategies like Greedy-Dual-Size (GDS) and Least-Recently-Used (LRU). In these strategies, the probability of web objects belonging to the revisited class (and so should be cached) is calculated based on features such as retrieval time, frequency, size, and type. Experimental results suggest that BN-GDS achieves the best hit ratio, while BN-LRU has the best byte-hit ratio. Hybrid ELM-LFU [ 19 ] is a two-level caching scheme for web proxies. In the first level, LFU is used for fast caching replacement (due to its low complexity) and thus is suitable for real-time communication. An Extreme Learning Machine (ELM) is used in the second level, applying a single hidden-layer feed-forward network, where there is no need to adjust the weights. In this mechanism, the chosen web object for eviction in the first level will be placed in the second-level cache. This method features low training times. In [ 20 ], Herotodos et al. designed a framework for moving data automatically through tiered storage in the distributed file system via a set of pluggable policies. For this purpose, they employ incremental learning to find which data should be downgraded or upgraded, allowing for adaptation to workload changes over time. For downgrading, this method uses different caching replacement strategies like LRU, LFU, Least Recently & Frequently Used (LRFU), LIFE, LFU-F, Exponential Decay (EXD), and XGBoost-based Modeling (XGB). Also, On Single Access (OSA), LRFU, EXD, and XGB are used for the upgrade policy. Experimental results show that XGB appears to be preferable because it requires minimal storage, has low training overhead, makes useful predictions, and can learn incrementally over time. PACS-oriented SVM-LRU [ 21 ] was proposed for picture archiving and communication systems. This algorithm calculates the probability of future access to cached items. In this strategy, SVM-LRU has brought some benefits like low training time, low computation, high prediction accuracy, and high hit ratio. In [ 22 ], Al-Qtiemat et al. introduced an intelligent web proxy caching algorithm that leverages K-means clustering to improve cache performance by categorizing objects into prioritized groups for efficient replacement. The algorithm operates in two phases: an offline phase, where proxy server logs are analyzed to group objects hierarchically based on individual features, and an online phase, where these priorities guide the replacement of low-priority objects to optimize cache content. The proposed multi-level clustering (MC) approach simplifies priority assignment by considering one feature at a time, enhancing decision-making efficiency. By factoring in object size and prioritization, the MC algorithm achieves significant improvements in hit rate and byte hit rate. This demonstrates its superior performance for diverse cache sizes and its effectiveness in managing large, low-priority objects. In [ 23 ], a smart caching system for Content Delivery Networks (CDNs) was proposed that uses a Long-Short-Term Memory (LSTM) model to predict the future popularity of content and optimize cache replacement decisions. The methodology consists of four phases: admission control, popularity prediction, filtering probabilities, and applying a caching policy to the top-k list. By caching the most popular content, the system aims to enhance the cache hit ratio, reduce latency, and improve user experience. Experimental results on a synthetic dataset showed that the LSTM-based model outperformed conventional algorithms like FIFO, LRU, and LFU, as well as other machine learning-based approaches, including DeepCache, in terms of total hits across different cache sizes. Even though using a caching mechanism has yielded some benefits in the Hadoop environment, some challenges remain. For instance, cache management imposes a heavy load on the NameNode, both in terms of required storage and computational load, potentially degrading performance. Moreover, existing cache replacement policies in Hadoop do not take into account cache pollution and effective use of cache space, and they do not apply intelligent caching mechanisms. In this paper, we design a cache replacement mechanism as an approach for overcoming these problems, using SVM to classify data, resulting in improved performance. We choose SVM because its generalization ability can be maximized when training data are scarce, and it can control the misclassification error. 3 Problem definition Reducing job execution time is an effective means for improving Hadoop performance. The execution time is composed of two components: I/O operation time and processing time. Since input data are stored in an HDD-based system, HDFS, access time for I/O operations can be high, adversely affecting execution time. One approach to tackle this problem is to use a caching mechanism to store hot data in the cache. A major challenge in employing caching mechanisms is accurately identifying hot data based on historical access patterns. Conventional caching algorithms often struggle to adapt to dynamic changes in data access patterns, resulting in inefficient caching decisions. To optimize caching, an effective approach must differentiate between hot and cold data, ensuring that the cache primarily stores data that is frequently requested, thus accelerating data access. As the cache is populated with hot data, it reaches a warm state, characterized by an increased likelihood of cache hits. However, sustaining this warm state necessitates robust cache replacement strategies that prevent the eviction of hot data in favor of less frequently accessed cold data. The goal is to maintain a high cache hit ratio by minimizing cache pollution, which occurs when cold data occupies cache space, reducing the availability of space for hot data. Cache pollution can be quantified using the ratio of cold data blocks in the cache to the total cache size, reflecting the extent of cache space occupied by infrequently accessed data. Thus, an intelligent cache replacement policy is essential to mitigate cache pollution, prioritize the retention of hot data, and facilitate the transition of the cache from a cold to a warm state. This approach ensures efficient data retrieval, ultimately improving the performance of Hadoop. 4 H-SVM-LRU cache replacement A Support Vector Machine (SVM) [ 24 ] is a supervised machine learning technique that is used for binary classification, dividing data into two classes: positive and negative. In this section, we provide H-SVM-LRU implementation details and explain our replacement algorithm that combines LRU with an SVM to classify data into two classes: hot data or cold data based on the likelihood of future reuse. This algorithm aims to avoid cache pollution and to effectively use cache space by cache warming, leading to decreased execution times. In this section, we present a framework for intelligent cache replacement based on machine learning and explain the H-SVM-LRU algorithm with an example to illustrate its operation. 4.1 The proposed H-SVM-LRU framework In this section, we propose a framework for an intelligent LRU approach using an SVM and in-memory cache [ 25 ], [ 26 ], [ 27 ] for Hadoop which consists of two functional components: the classification component is responsible for training the data classification by using an SVM classifier and the Hadoop in-memory cache component uses this trained classifier to manage the cache space. Figure 2 gives the system structure and its components which we now explain in detail. The classification component consists of Hadoop job history, training dataset preparation, SVM model training, and SVM classifier deployment. The job history server allows the user to retrieve log information on completed applications. This information can be exploited as a source to extract training data. Next, preprocessing is employed to normalize data and eliminate outliers. After preparing training data, an SVM classifier is trained. Finally, the classifier is deployed, and SVM-LRU uses this data classification in its cache replacement decision. The Hadoop in-memory cache component is composed of NameNode, DataNodes, Application Master, and containers. The NameNode is responsible for coordinating all the DataNode caches in the cluster and stores two types of metadata: block metadata includes the location of data blocks on DataNodes, and cache metadata maps the locations of cached data. The NameNode periodically receives a cache report from each DataNode describing all the blocks cached on a given DataNode. The cache report is used to update cache metadata. For better utilization of the large distributed in-memory caches in Hadoop clusters, each Hadoop container always sends a request to cache the accessed block to the NameNode; the NameNode then controls which data blocks are added and evicted to and from in-memory caches. We use centralized cache management; as a result, H-SVM-LRU is located on the NameNode. In our system, a container is launched to run a task (either a Map task or a Reduce task) and always sends a request to find cached data blocks. Application Master manages the user job lifecycle and resource needs of individual applications. Each application has an associated unique, framework-specific Application Master. It coordinates an application’s execution in the cluster and also manages faults. Assume that a MapReduce application requires two data blocks A and B, where data block A is located on DataNode X and data block B is cached on DataNode Y. The Application Master communicates with the NameNode (which contains the cache metadata) to query the locations of the input blocks and their availability in the cache. A cache miss occurs when looking for data block A and a cache hit occurs for data block B. In the cache miss state, the NameNode looks for block metadata to find the DataNode that contains data block A. Although there are multiple replicas of a given data block that can be accessed by the query, we choose the first one to reduce search time. We could cache all replicas of this data block in the DataNodes that contain them. In this case, cache replication is identical to data replication and can increase the cache hit rate ratio. In this case, excessive cache space is occupied, conflicting with the goal of our proposed method. We then use the H-SVM-LRU algorithm and the PutCache(A, X) method to place this data block in the cache. After caching, DataNode X piggybacks the cache report with a heartbeat message and sends it to the NameNode to update the cache metadata. The NameNode informs the Application Master of the location of cached data by using GetCache(A, X) and GetCache(B, Y). When a cache hit occurs, the GetCache method of the H-SVM-LRU algorithm is called to retrieve the cached data. The result is that not only do applications wait for the data block to be cached, but also the probability of accessing cached data has increased via effectively using cache space. It is important to note that applications do not necessarily access all their demand data from the cache; it is not necessary to wait for data to be cached. 4.2 The proposed H-SVM-LRU algorithm The proposed H-SVM-LRU algorithm is designed to mitigate cache pollution by intelligently distinguishing between frequently accessed (hot) and infrequently accessed (cold) data. The cache is logically divided into two lists: the Cold list, which stores less frequently used blocks (cold data), and the Hot list, which stores frequently accessed blocks (hot data). Both lists are maintained as doubly linked structures with head and tail pointers. A boundary pointer separates these two lists and always points to the head of the Cold list, ensuring that the least valuable block (the oldest cold item) is the primary candidate for eviction. The algorithm is composed of three main procedures: GetCache, CachePlacement, and PutCache which collectively regulate data retrieval, placement, and insertion into the cache based on a block’s predicted class. For each requested data block, the system checks whether the block resides in the cache. If the block exists (cache hit), the GetCache procedure retrieves it and repositions it within the appropriate list according to its updated classification. If the block is absent (cache miss), its metadata is used to locate it on the corresponding DataNode, after which the PutCache procedure inserts it into the cache, potentially evicting a candidate block if the cache is full. Algorithm 1 : H-SVM-LRU on Hadoop Input: Sequence of requested data blocks R = {DB 1 , DB 2 , …, DB n } Output: Updated cache contents with reduced pollution Data Structures: ColdListHead → pointer to first (oldest) cold item ColdListTail → pointer to last (newest) cold item HotListHead → pointer to first (oldest) hot item HotListTail → pointer to last (newest) hot item BoundaryPtr → pointer marking the boundary between the Cold list and the Hot list 1- for each data block DB x requested by task T i do 2- DN y ← lookup(DB x ) in cache metadata 3- if DB x ∈ Cache then 4- cache hit 5- call GetCache(DB x , DN y ) 6- else 7- cache miss 8- DN z ← lookup(DB x ) in block metadata 9- call PutCache(DB x , DN z ) 10- end if 11- end for Algorithm 1 outlines the general mechanism of H-SVM-LRU. The input is a sequence of requested data blocks R={DB 1 ,DB 2 ,…,DB n }, and the output is an updated cache state with reduced pollution. The procedure first attempts to find each requested block in the cache metadata. In the case of a cache hit, GetCache is invoked to detach and reclassify the block before re-insertion into the appropriate list based on its predicted class. In the event of a cache miss, PutCache is called to fetch and insert the block, while evicting the oldest cold block whenever necessary. If the Cold list is empty, eviction is performed from the Hot list. By maintaining explicit pointers for list management and eviction boundaries, H-SVM-LRU efficiently removes cold data at an early stage, reduces cache pollution, and improves space utilization. If all cached items belong to the same class, the algorithm reduces to conventional LRU, where eviction decisions rely solely on recency of access. The GetCache procedure (Algorithm 2) is responsible for retrieving and repositioning a cached block. The procedure begins by removing the block from its current location to preserve structural integrity. Pointers of neighboring nodes are updated accordingly, and the head or tail references of the respective lists are adjusted when the block being removed occupies these positions. Once removed, the block is classified using a SVM classifier, which predicts whether it should be placed in the hot or cold list. The CachePlacement procedure is then invoked to reinsert the block based on its updated classification. The PutCache procedure (Algorithm 3) manages the insertion of new blocks into the cache. When the cache is at full capacity, it first attempts to evict the head of the Cold list. If the Cold list is empty, eviction is performed from the head of the Hot list. After sufficient space is created, the incoming block is classified using the SVM model, and the CachePlacement procedure is called to insert the block into the corresponding list. This approach ensures that eviction prioritizes less frequently used items, while hot data is retained for longer periods. Finally, the CachePlacement procedure (Algorithm 4) inserts the block into the cache based on its predicted class. If the block is classified as hot, it is placed at the tail of the Hot list (or as the first element if the list is empty). If the block is classified as cold, it is inserted into the Cold list, with the boundary pointer updated to reflect the new head of the Cold list. This classification-driven strategy dynamically adapts cache content to evolving workload characteristics, thereby reducing pollution and improving cache efficiency. Algorithm 2: Procedure GetCache(DB x , DN y ) // retrieve DB x from current list 1- if DB x .prev ≠ NULL then 2- DB x .prev.next ← DB x .next 3- else // DB x was a head node 4- if DB x is in the Hot list then 5- HeadHot ← DB x .next 6- else 7- HeadCold ← DB x .next 8- BoundaryPtr ← HeadCold 9- end if 10- end if 11- if DB x .next ≠ NULL then 12- DBx.next.prev ← DB x .prev 13- else // DB x was a tail node 14- if DB x is in the Hot list then 15- TailHot ← DB x .prev 16- else 17- TailCold ← DB x .prev 18- end if 19- end if 20- DB x .prev ← NULL 21- DB x .next ← NULL 22- Class-DB x ← Apply-SVM(features) 23- Call CachePlacement(DB x , Class-DB x ) Algorithm 3: Procedure PutCache(DB x , DN z ) 1- if cache_size = = N then // Cache size is full 2- if HeadCold ≠ NULL then // Evict from cold list 3- evict(HeadCold) 4- HeadCold ← HeadCold.next 5- BoundaryPtr ← HeadCold 6- else 7- evict(HeadHot) // Cold list is empty and evict from hot list 8- HeadHot ← HeadHot.next 9- end if 10- end if 11- Class-DB x ← Apply-SVM(features) 12- Call CachePlacement(DB x , Class-DB x ) Algorithm 4: Procedure CachePlacement(DB x , Class-DB x ) ===== Case 1: Hot data ===== 1-if class(DB x ) = 1 then // Insert at Hot list 2- if HeadHot = = NULL then //Hot list is empty 3- HeadHot ← DB x 4- TailHot ← DB x 5- DB x .prev ← NULL 6- DB x .next ← BoundaryPtr 7- else // Insert at the tail of hot list 8- DB x .prev ← TailHot 9- DB x .next ← BoundaryPtr 10- TailHot.next ← DB x 11- TailHot ← DB x 12- end if // ===== Case 2: Cold data ===== 13- else 14- if HeadCold = = NULL then //Cold list is empty 15- HeadCold ← DB x 16- TailCold ← DB x 17- DBx.prev ← TailHot 18- DB x .next ← NULL 19- if TailHot ≠ NULL then 20- TailHot.next ← DB x 21- end if 22- else 23- DB x .prev ← TailCold 24- DB x .next ← NULL 25- TailCold.next ← DB x 26- TailCold ← DB x 27- end if 28- BoundaryPtr ← HeadCold 29- end if 4.2.1 Illustrative Example of Algorithm Behavior To further demonstrate the advantages of the proposed SVM-assisted cache management strategy, we present a comparative trace analysis against the baseline LRU policy. In this example, the cache capacity is fixed at N = 3, and the request sequence is defined as:{A,B,C,A,D,B,C,C,A,E, C}. Initially, the classifier labels blocks A and B as hot, while blocks C, D, and E are identified as cold. To emulate the dynamic nature of real workloads, we introduce a reclassification event at step 8, where the SVM model updates its prediction for block C, changing its label from cold to hot. This reflects scenarios where previously infrequent blocks become popular due to shifting access patterns. Under the LRU policy, eviction decisions rely solely on recency information. As shown in Table 2 , LRU achieves only three cache hits over the eleven requests (steps 4, 8, and 11), yielding a hit ratio of approximately 27%. The policy prematurely evicts hot items when cold blocks occupy the cache, failing to adapt when block C becomes frequently accessed later in the trace. Table 2 The sample of LRU Step Request Cache state Action Hit/Miss 1 A [A] Insert A Miss 2 B [B, A] Insert B Miss 3 C [C, B, A] Insert C Miss 4 A [A, C, B] Move A Hit 5 D [D, A, C] Evict B, insert D Miss 6 B [B, D, A] Evict C, insert B Miss 7 C [C, B, D] Evict A, insert C Miss 8 C [C, B, D] Move C Hit 9 A [A, C, B] Evict D, insert A Miss 10 E [E, A, C] Evict B, insert E Miss 11 C [C, A, B] Move C Hit In contrast, the proposed H-SVM-LRU policy incorporates classification prior to reinsertion. As can be seen in Table 3 , during the early stages (steps 1–7), block C is placed in the Cold list and thus becomes the primary target for eviction, while hot blocks A and B remain protected in the Hot list. At step 8, when the classifier re-labels C as hot, the algorithm immediately migrates the block into the Hot list. From that point onward, cold blocks (like D) are preferentially evicted, preserving the newly promoted hot block C in the cache. At step 10, the cold block E arrives while the cache is full, the algorithm evicts the hot data to insert E into the Cold list. In this case, when the cache is full and a newly classified cold block arrives, H-SVM-LRU first attempts eviction from the Cold list. If the Cold list is empty, the algorithm evicts from the Hot list to admit the new block. While this appears to contradict the principle of pollution reduction, it ensures adaptability by maintaining a dynamic balance between hot and cold regions of the cache. This design prevents cache monopolization by hot data and allows cold blocks the opportunity to be reclassified into the Hot list if their access frequency increases. Thus, the mechanism trades a small risk of pollution for long-term adaptability and robustness. This adaptive behavior results in five cache hits (steps 4, 6, 8, 9, and 11), a hit ratio of approximately 45%. This example highlights two critical properties of the proposed scheme. First, the classification-driven separation of hot and cold data ensures that eviction targets are selected based on predicted utility rather than recency alone, thereby reducing cache pollution. Second, the design inherently supports dynamic reclassification, enabling the system to adapt to evolving workloads without requiring additional mechanisms. These capabilities allow H-SVM-LRU to outperform classical LRU in scenarios where data access distributions are non-stationary. 4.2.2 Algorithm complexity Since the Hot and Cold lists are maintained as doubly linked lists, insertion, deletion, and pointer updates require O(1) operations. Cache lookup is performed via a metadata index, which also requires O(1) time. The space complexity remains O(n), where n is the cache capacity, with a negligible constant overhead for maintaining head, tail, and boundary pointers. Table 3 The sample of H-SVM-LRU Step Request Class Hot Cold Action Hit/Miss 1 A hot [A] [] Insert into Hot list Miss 2 B hot [A, B] [] Insert into Hot list Miss 3 C cold [A, B] [C] Insert into Cold list Miss 4 A hot [B, A] [C] Move A to Hot tail Hit 5 D cold [B, A] [D] Evict C, insert D into Cold list Miss 6 B hot [A, B] [D] Move B to the tail Hit 7 C cold [A, B] [C] Evict D, insert C into Cold list Miss 8 C hot [A, B, C] [] Move C to Hot tail Hit 9 A hot [B, C, A] [] Move A to Hot tail Hit 10 E cold [C, A] [E] Evict B, insert E into Cold list Miss 11 C hot [A, C] [E] Move C to Hot tail Hit 5. H-SVM-LRU implementation In this section, we provide details for the three phases of H-SVM-LRU implementation: training data preparation, model training, and Integration of H-SVM-LRU with Hadoop’s Caching Mechanism. 5.1 Training data preparation phase This phase consists of four steps: data collection, feature selection, providing target labels, and data preprocessing. We now explain each step in more detail: Data collection : We consider two independent scenarios: request awareness and non-request awareness. In the first scenario, the sequence of requested data is determined by the tasks, and we consider the following data features: size, recency, frequency, and type (input data of Map tasks, intermediate data, and output of Reduce tasks). Table 4 describes these features. Table 4 Features for the request-awareness scenario Feature name Description Type The input of the Map task, intermediate data, and the output of the Reduce task Size Size of data blocks in MB Recency Time last used Frequency Number of uses In the second scenario, we use the ALOJA [ 28 ] Hadoop dataset that gathers training data by executing various workloads from the Intel Hi-Bench benchmark suite [ 29 ], [ 30 ], a comprehensive benchmark suite for Hadoop consisting of a set of Hadoop programs, including both synthetic micro-benchmarks and real-world Hadoop applications. We then extract some useful features from the Hadoop job history [ 31 ], consisting of log information of MapReduce jobs. In the request awareness scenario, the demand data are predefined. In other words, training data have a label; therefore, target labels do not need to be generated. This allows us to consider fewer features than in the second scenario, where the requirement to generate target labels may require the consideration of more features. Feature selection : It is important to choose suitable features to ensure model performance while reducing overfitting and computational demand. There are different metrics to select features. Selecting features based on missing values Selecting features based on variance Selecting features based on correlation with other features Selecting features based on model performance Table 5 Features for the non-request-awareness scenario to provide the target label Feature name Feature type Description JobName Job Name of job: WordCount, Sort, Grep, Sort, etc MapsTotal Job The total number of Map tasks MapsCompleted Job The number of completed Map Tasks ReducesTotal Job The total number of Reduce tasks ReducesCompleted Job The number of completed Reduce tasks Job-Status Job Valid values of job state are: New, Initiated, Running, Succeeded, Failed, Killed, and Error. Cache Affinity Job Cache affinity of application: Low, High, Medium Start time Job The time the job started (in ms) Finish time Job The time the job finished (in ms) Task type Task Map or Reduce task Task status Task The states of the task are: New, Scheduled, Running, Succeeded, Failed, and Killed AvgMapTime Task The average time of a Map task (in ms) AvgReduceTime Task The average time of a Reduce task (in ms) Progress Task The progress of the task as a percentage In this step, we select the features given in Table 5 . For simplicity, we ignore the size of the data and recently used data features because input data are split into data blocks of the same size, and the recently used data feature is taken into account by the LRU policy. Providing target labels : Since the training dataset does not have target labels, we should provide them. For this purpose, we use a scenario that is based on job status and the status of its Map and Reduce tasks to provide a label for the requested task data. Table 6 describes different cases for this scenario. Dataset preprocessing : The last step of preparation of the training dataset is data preprocessing, which includes the elimination of irrelevant data, unnecessary fields, and data normalization. Table 6 Guidelines to provide target labels Job status Map task status Reduce task status Input Map task label Input Reduce task label Rationale New New New Cold Cold The job is waiting to be processed. Initiated Scheduling Waiting Hot Cold The outputs of the Map tasks have not been generated yet. Running Running Waiting Hot Cold The outputs of the Map tasks have not been generated yet. Running Succeeded Scheduling Cold Hot If the input of Reduce is the output of the completed Map task. Running Succeeded Running Cold Hot The map task has been completed. Running Failed Waiting Cold Cold The Map task failed and cannot generate intermediate data. Running Succeeded Failed Cold Cold The map task has been completed, and the Reduce task has failed and cannot continue. Running Killed Waiting Hot Cold The killed task may execute on another node (speculative task) Running Succeeded Killed Cold Hot The failed task may execute on another node (speculative task) Succeeded Succeeded Succeeded Cold Cold The job has been completed, and we do not consider the relationship between jobs and repetitive and recurring jobs. Failed Don’t care Don’t care Cold Cold Job status has a higher priority than task status 5.2 Model training phase In this phase, we use the Scikit-Learn library in Python to implement an SVM for classifying data. The training process includes two steps: choosing the best kernel function and evaluating the trained model. Choosing the best kernel function : SVM has several available kernel functions that can be used for training, including polynomial, sigmoid, linear, and RBF. We evaluate the performance of kernel functions by using the confusion matrix to choose an appropriate kernel function for the training dataset. A confusion matrix is a table that is often used to describe the performance of a classification model. The metrics that are used in the confusion matrix method for investigating the correctness of classification are: Recall : The ability of a classification model to identify all relevant instances. Precision : The ability of a classification model to return only relevant instances. F1 score : This metric combines recall and precision using the harmonic mean. The formulas for calculating these metrics are as follows: $$\:\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}=\frac{TP}{TP+FN}\:\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}=\frac{TP}{TP+FP}\:\text{F}1\:\text{S}\text{c}\text{o}\text{r}\text{e}=2*\frac{Precision*Recall}{Precision+Recall}$$ We chose the RBF function as the kernel function for our dataset because it demonstrated the best performance. The experimental results are reported in Table 7 . Table 7 Evaluation of different kernel functions Kernel function Precision Recall F1-score Accuracy Linear 0 0.67 0 1 0 0.8 0.71 1 1 1 0.33 1 0.5 RBF 0 0.8 0 1 0 0.81 0.85 1 0.65 1 0.7 1 0.75 Sigmoid 0 0.57 0 1 0 0.73 0.57 1 0 1 o 1 0 Evaluating the trained model : The dataset is divided randomly into training data (75%) and testing data (25%). In this phase, we use testing data and cross-validation methods to evaluate the training model and its prediction accuracy. The resulting prediction accuracy is 83%. Impact of misclassification in H-SVM-LRU : The effectiveness of the proposed H-SVM-LRU approach is highly dependent on the accuracy of the underlying SVM classifier, which determines whether a cached block is hot (frequently accessed) or cold (a candidate for eviction). When the SVM employs an RBF kernel, the parameters C (regularization) and γ (kernel width) play a critical role in shaping the decision boundary. A low value of C allows the model to tolerate more misclassifications, resulting in premature eviction of hot blocks and an increased cache miss ratio. Conversely, a very high C enforces strict error minimization but risks overfitting transient workload patterns, reducing adaptability in dynamic Hadoop environments. Similarly, a low γ produces a smoother boundary that may fail to capture short-term locality of reference, while an excessively high γ yields a highly complex boundary that memorizes recent access patterns without generalizing to future workloads. Thus, improper tuning of these parameters may increase misclassification rates and reduce the overall benefit of intelligent cache management. Misclassification in SVM-LRU manifests in two forms: false positives (classifying cold blocks as hot data) and false negatives (classifying hot blocks as cold data). False positives cause the cache to retain unnecessary blocks, thereby wasting limited memory resources and lowering cache efficiency. More critically, false negatives lead to premature eviction of hot blocks, forcing Hadoop to re-fetch them from disk or remote nodes, which increases I/O overhead and prolongs job execution times. This effect cascades into multiple components of the Hadoop ecosystem: the HDFS caching layer experiences higher block replacement frequency, MapReduce tasks suffer from increased shuffle and fetch delays, and YARN resource utilization is degraded due to extended waiting times. Therefore, reducing misclassification through balanced parameter tuning and representative training data is essential to maximizing the cache hit ratio and ensuring stable performance improvements in Hadoop clusters. To mitigate the impact of misclassification, hyperparameter tuning, including the regularization parameter C and the RBF kernel coefficient γ, was optimized using grid search combined with five-fold cross-validation to ensure an effective balance between model complexity and generalization. To address class imbalance, class weights were adjusted inversely proportional to class frequencies, and the Synthetic Minority Oversampling Technique (SMOTE) was applied to generate additional samples for underrepresented hot blocks. Experimental evaluation revealed that approximately 17% of data blocks were misclassified, with false negatives (hot blocks predicted as cold) accounting for most of the cost by causing a 5.2% increase in cache misses compared to the baseline LRU policy. In contrast, false positives (cold blocks predicted as hot) introduced only a minor overhead, corresponding to about a 1.3% increase in cache utilization. Importantly, the training process required only approximately 40 seconds and about 220 MB of memory for 500,000 access records, which is negligible relative to Hadoop job runtimes. Thus, the modest training overhead is easily amortized by the reduction in cache misses, making the H-SVM-LRU approach practical for real-world cluster environments. 5.3 Integration of H-SVM-LRU with Hadoop’s Caching Mechanism The proposed H-SVM-LRU framework was integrated into Hadoop by extending existing caching components to enhance block admission and eviction decisions, while maintaining compatibility with the default LRU policy. In Hadoop Distributed File System (HDFS), caching is coordinated by the CacheManager, which manages block placement and eviction across nodes. Although HDFS provides a centralized cache system that allows users to manually add files into memory for faster access, this mechanism is not intelligent; it must be manually adjusted, and thus cannot adapt to dynamic workload patterns. To address this limitation, H-SVM-LRU introduces a machine learning–driven advisory layer on top of Hadoop’s default caching mechanism. Specifically, an SVM-based decision module was integrated at the boundary pointer where blocks transition between cold and hot lists. Instead of replacing the entire eviction policy, the SVM serves as a predictive filter: blocks with high predicted utility are promoted to the hot list, while those with low utility are marked as eviction candidates. If no reliable prediction is available, the system naturally falls back to the default LRU strategy, thereby ensuring robustness and backward compatibility. A feature extraction layer was developed to collect runtime metadata, including block access frequency, recency, size, and node locality. This layer operates within the Hadoop I/O pipeline (like DFSClient and data access logs) and is designed to gather features with minimal overhead. The extracted features are periodically fed into the SVM decision engine, which determines whether a block should be admitted into or evicted from the cache. To reduce inference latency, recent predictions are cached in memory. To support scalability and adaptability, a model update interface was incorporated, enabling the SVM model to be retrained either offline or online using YARN-based training jobs. Updated models are synchronized across caching nodes through Hadoop’s configuration service. Several engineering challenges were addressed during integration: Feature collection overhead was minimized by using sampling and batch updates. Model inference latency was mitigated through prediction caching. System compatibility was ensured by embedding the SVM module as a pluggable advisory component rather than a core replacement of LRU. In addition to cache decision logic, a daemon server architecture was designed to manage job submissions and automate cache operations. The entry process collects user-specified Hadoop jobs from multiple sources (e.g., shell scripts, APIs, web interfaces) and forwards them to the daemon server via TCP/IP. The daemon server continuously runs in the background, receiving job commands, parsing the required files, and applying the cache replacement policy before invoking the Hadoop cluster to execute the job. This layer ensures that file access patterns are tracked and optimized in real time, providing an unattended mechanism to admit working files into the cache and evict unused ones. The overall workflow of HDFS caching with H-SVM-LRU can be summarized as follows: Job submission : The entry process forwards user commands to the daemon server. File parsing and caching : The daemon server identifies input files, applies the H-SVM-LRU policy, and issues cache commands. Cache execution: Namenode translates the files into blocks and dispatches cache commands to Datanodes. Block admission: Datanodes pin the blocks into the memory cache pool and return acknowledgment to the namenode. Job execution : Once caching is finalized, the Hadoop cluster executes the job with improved memory-based I/O performance. Through this modular integration, the SVM-enhanced cache manager operates transparently within Hadoop’s existing framework, significantly reducing cache pollution and improving job execution time. By combining Hadoop’s centralized caching functionality with machine learning–driven predictive intelligence, H-SVM-LRU provides a scalable, unattended, and robust solution that improves performance for data-intensive workloads, particularly those with repetitive file access patterns such as MapReduce-based WordCount, Sort, and Grep applications. 6. H-SVM-LRU evaluation In this section, we explain the experimental environment, including software and hardware configurations, and set some Hadoop configuration parameters. Our evaluation is divided into two sections: the H-SVM-LRU performance evaluation and the investigation of the impact of the proposed algorithm on Hadoop performance. We first evaluate the efficiency of the proposed algorithm by using the cache hit ratio as the performance metric. Finally, we perform experiments to present the impact of the H-SVM-LRU cache replacement policy on overall Hadoop performance. 6.1 Experimental setup For our experiments, we use a cluster consisting of a single NameNode and nine DataNodes located in the same rack. Hardware configuration : Nodes are connected via a 10 Gigabit Ethernet switch. Each node is configured with an Intel Core i7 with an 8-core processor, 64 GB memory, and a 1 TB hard disk. Software configuration : We use Ubuntu 20.04 as the operating system and JDK 11, Hadoop version 3.4.1, and Intel HiBench version 7.1.1. Hadoop configuration parameters : The block size of files in HDFS is chosen to be one of two values, 64 MB or 128 MB; the number of cache replicas is set to one, and data replication is set to 3. The memory sizes for the Map task, Reduce task, and node manager are 1GB, 2GB, and 8 GB, respectively. The maximum size of the cache is set to 1.5 GB, and we assume that each DataNode in the cluster has the same size cache. Table 8 presents the Hadoop configuration parameters with their values. The remaining Hadoop configuration parameters are set to the default values. MapReduce applications : As mentioned earlier, we use Intel HiBench as a Hadoop benchmark suite that contains the following applications: 1) WordCount is a CPU-intensive application that counts the number of occurrences of each word in a text file. 2) Sort is a typical I/O-bound application that sorts input data. 3) Grep is a mix of CPU-bound and I/O-bound operations that searches for a substring in a text file. These three applications are supported by Hadoop. 4) Join is a multiple-stage application where the results of the previous step are used as input for the next step. 5) Aggregation (supported by Hive) is used for the aggregation operation in a query. Dataset : For carrying out experiments, we use the Gutenberg dataset [ 32 ] as input data for the WordCount application to evaluate its execution time based on different input data. As we mentioned earlier in the implementation section, the ALOJA dataset is used as a training dataset for the SVM model. The applications use input files generated by a random text generator. Table 8 Hadoop parameters Hadoop property name Hadoop property value Dfs.replication 3 Dfs.blocksize 64M or 128M Mapreduce.map.memory.mb 4096 Mapreduce.reduce.memory.mb 8192 MapReduce.jobhistory.webapp.address Master:19888 MapReduce.reduce.speculative False MapReduce.map.speculative False Mapred.map.tasks.speculative.execution False Mapred.reduce.tasks.speculative.execution False 6.2 Metrics In these experiments, we consider three key metrics to evaluate our proposed algorithm. The first, the cache hit ratio is used to evaluate the performance of the proposed H-SVM-LRU cache replacement policy. The other two metrics, job execution time and normalized run time, are used for determining the impact on Hadoop performance. In the following, we explain these three metrics: Hit ratio and byte-hit ratio : These are two major factors to evaluate the performance of the cache replacement strategy. Hit ratio relates the number of cache hits to the total number of requests, and byte-hit ratio relates the number of bytes obtained from the cache to the total number of bytes requested. It is very difficult for a cache replacement strategy to simultaneously optimize these two metrics because improving the hit ratio usually favors small-sized items over large-sized items, leading to a reduced byte-hit ratio. In contrast, strategies that tend to increase the byte-hit ratio and prefer large-sized items typically decrease the hit ratio. In the experiments, we only consider the hit ratio because data blocks have the same size. Job execution time : This plays a vital role in Hadoop performance improvement, as it is related to data access time. The data access time decreases significantly if we can access data from the cache instead of the disk, reducing the job execution time. To calculate the average job execution time, we run each application five times. Normalized run time : For each application in a workload, its run time is normalized based on the Hadoop original (Hadoop no-cache). The average normalized time for applications is then calculated to evaluate overall Hadoop performance. 6.3 H-SVM-LRU performance In this section, we compare H-SVM-LRU with LRU by considering two metrics: cache hit ratio based on different cache sizes and job execution time in both cache cold and cache warm states. 6.3.1 Cache hit ratio For carrying out experiments to calculate the cache hit rates, we consider two data block sizes: 64MB and 128 MB. The input data size is 2GB with the same sequence of requested data for each mechanism, and the cache size is the same in all DataNodes (1.5 GB). We then calculate the cache size based on the maximum number of data blocks that can be cached, which was varied between 6–12 for a 128 MB block size and 6–24 for a 64 MB block size. Figure 3 presents cache hit ratios for block sizes of 64 MB and 128 MB. In Fig. 4 , we can observe that by increasing the cache size, the hit ratio is increased for all strategies as more requested data can be cached. Also, by increasing the data block size, the cache hit ratio increases, for instance, when the cache size is 6 and the data block size has increased from 64MB to 128MB, the cache hit ratio has approximately doubled because we can cache more data. Both diagrams demonstrate that the hit ratio of H-SVM-LRU is higher than for LRU, in particular when the cache size is small. H-SVM-LRU uses the SVM model to identify hot data more accurately by learning access patterns over time, prioritizing hot data over cold data in a cache placement policy, and avoiding unnecessary evictions of hot data. It is noteworthy that H-SVM-LRU and BN-LRU achieve comparable cache hit ratios when the cache size is sufficiently large, as both algorithms effectively address cache pollution by predicting the likelihood of data block reuse. Nevertheless, BN-LRU underperforms in scenarios with small cache sizes, exhibiting a lower cache hit ratio compared to H-SVM-LRU. This performance gap can be attributed to the Bayesian model’s reliance on historical data; when the available dataset is limited, the model lacks adequate information to generate accurate predictions, thereby reducing its effectiveness relative to H-SVM-LRU. Table 9 Improvement ratio based on hit ratio Cache size IR for Data block size (64 MB) IR for Data block size (128 MB) LRU BN-LRU LRU BN-LRU 6 63.63% 21.43% 20.83% 12.05% 8 64.70% 33.33% 15.15% 8.5% 10 33.33% 20.75% 10.25% 2% 12 33.33% 9.09% 6.81% 1% 14 22.58% 5.20% N/A N/A 16 14.28% 2.5% N/A N/A 18 7.89% 0% N/A N/A In order to investigate the performance improvement of H-SVM-LRU over LRU and BN-LRU, we calculate the improvement ratio of the performance (IR) based on the hit ratio for each cache size. Table 9 presents the relative improvement of H-SVM-LRU over LRU and NB-LRU for different cache sizes for both 64 MB and 128 MB block sizes. We observe that H-SVM-LRU has the greatest improvement ratio for small cache size and small data blocks, suggesting that H-SVM-LRU is suitable for small cache size because it contains more hot data. Although both H-SVM-LRU and BN-LRU mitigate cache pollution by predicting frequently accessed data blocks, BN-LRU requires sufficient historical information to estimate the probability of data reuse, which is often limited for smaller block sizes. Consequently, BN-LRU achieves comparable performance to SVM-LRU only when the block size is large. 6.3.2 Job execution time in both cache states In this experiment, we evaluate two cold cache and warm cache scenarios to compare the performance of LRU and H-SVM-LRU algorithms with respect to job execution time. To achieve this, we execute a variety of applications with low, medium, and high cache affinity, as well as different job types, including I/O-bound and CPU-bound jobs. In the cold cache scenario, LRU initially exhibits a low cache hit ratio, which gradually improves as frequently accessed data populates the cache over time. In contrast, H-SVM-LRU leverages predictive capabilities to identify potential hot data even during a cold start, resulting in a slightly higher cache hit ratio compared to LRU at early stages, though it remains low overall. Consequently, the job execution time for cold cache scenarios is lower with H-SVM-LRU than with LRU. In the warm cache scenario, LRU demonstrates susceptibility to evicting moderately accessed hot data, particularly during spikes in cold or hot data. This behavior can result in suboptimal cache hit ratios. On the other hand, H-SVM-LRU’s SVM classifier is able to predict the reuse potential of hot data, thereby minimizing unnecessary evictions. By adapting to workload changes using historical patterns, H-SVM-LRU enhances cache efficiency and reduces job execution time. Figure 5 : Job execution time in cold cache and warm cache scenarios As illustrated in Fig. 5 , H-SVM-LRU consistently achieves shorter execution times compared to LRU by effectively predicting reusable data blocks. The degree of improvement, however, varies across applications. In the cold cache scenario, the most significant performance gains are observed in applications such as join, sort, and aggregation, which are particularly sensitive to disk I/O. In the warm cache scenario, both algorithms exhibit improved performance, but H-SVM-LRU consistently outperforms LRU due to its more effective eviction policies. While the performance gap is narrower for simpler applications, such as Grep and Wordcount, it becomes more pronounced in complex operations, such as join. 6.4 Impact of H-SVM-LRU on Hadoop performance In this section, we carry out two separate experiments to investigate the impact of H-SVM-LRU on Hadoop performance: 1) Job execution time based on different input data sizes. 2) Normalized run time of multiple applications in a workload. For this purpose, we compare Hadoop performance in the following scenarios to extract the impact of the proposed replacement policy on Hadoop performance: H-NoCache: The Hadoop original does not utilize HDFS in-memory caching; it is used as a baseline. H-LRU: Hadoop uses traditional LRU as a cache replacement policy. H-NB-LRU: Integrates a probabilistic inference mechanism, based on a Bayesian network, with the LRU strategy to predict the likelihood of reusing data block in the future. H-SVM-LRU: H-SVM-LRU is used as a cache replacement policy. 6.4.1 Job execution time based on different input data sizes In this experiment, we consider the job execution time of the WordCount MapReduce application based on different input data sizes for two different data block sizes (64 MB and 128MB). Figure 4 presents job execution time based on input data size for our three scenarios. As illustrated in Fig. 5 , there is a significant reduction in job execution time when Hadoop is equipped with a caching mechanism compared to the original Hadoop. This improvement becomes more pronounced as the input data size increases, since the likelihood of accessing requested data directly from the cache grows accordingly. When comparing H-LRU and H-SVM-LRU, the results indicate that H-SVM-LRU consistently achieves lower execution times. This outcome is attributed to the higher cache hit ratio in H-SVM-LRU, which enables the cache to warm up with frequently accessed (“hot”) data that is more likely to be reused in subsequent operations. H-NB-LRU, while introducing moderate computational overhead for probabilistic prediction, can also reduce execution time significantly when prediction accuracy is high. However, for small cache sizes, H-NB-LRU sometimes underperforms relative to H-SVM-LRU due to the lack of sufficient historical data for accurate Bayesian estimation. In contrast, for larger input sizes and cache capacities, H-NB-LRU outperforms H-LRU by more effectively retaining frequently reused blocks and minimizing disk accesses. In the second experiment, with a data block size of 128 MB, the performance gap between the original Hadoop and Hadoop with caching widened substantially. Increasing the block size allows for more data to be cached, thereby improving the byte-hit ratio. Under this configuration, both H-SVM-LRU and H-NB-LRU achieved shorter execution times compared to H-LRU, as they are more effective at preventing cache pollution and increasing the likelihood of cache reuse. 6.4.2 Normalized run time of multiple applications in a workload In this experiment, we take into account various workloads, consisting of four concurrent MapReduce applications. We assume that all applications in one workload require an equal share of cluster resources. In addition, some applications use the same input data, and data is shared between them, for instance, Grep, WordCount, and Sort use the same input data that is generated by a random text generator, and the data is shared between aggregation and join. The cache affinity feature [ 14 ] determines how to utilize the benefit of cached data in each application such that it can be classified into three categories based on this feature: low cache affinity (Sort), medium cache affinity (WordCount, Join), and high cache affinity (Grep, Aggregation). Therefore, we provide various workloads composed of I/O-bound and CPU-bound applications by considering their cache affinity feature. Table 10 presents the list of workloads with their applications that are used in this experiment. Table 10 The list of workloads with their applications Workload App1 App2 App3 App4 Input data size (GB) W1 Aggregation Grep Join WordCount 257.3 W2 Aggregation Grep Sort WordCount 262.9 W3 Aggregation WordCount Grep Grep 376.2 W4 Aggregation Sort Grep Grep 446.7 W5 Grep Grep Sort WordCount 254.3 W6 Aggregation Grep Join Sort 377.1 In order to compare the performance for each workload, we calculate the normalized run time based on the Hadoop original (H-No-Cache). Figure 6 illustrates the experimental results. If we compare H-LRU with Hadoop, we observe that Hadoop-LRU improves performance by 11.33% and the average improvement of H-SVM-LRU is 16.16%, 4.83%, and 2.45% against Hadoop-original, H-LRU, and H-NB-LRU, respectively. The number of cache hits in H-SVM-LRU is higher than in Hadoop-LRU as a result of using cache space efficiently, resulting in a lower average execution time. Also, by leveraging machine learning to predict reusable data, H-SVM-LRU achieves lower execution times during both cold and warm cache states. It prioritizes frequently accessed blocks and avoids unnecessary evictions, making it effective in sort operations that rely heavily on sequential I/O. H-NB-LRU predicts reused data blocks based on a probabilistic model that needs historical data. While it performs well in warm cache states, its reliance on historical data makes it less adaptable to rapidly changing workloads, as probability estimates may lag behind dynamic access patterns, resulting in higher execution times compared to H-SVM-LRU. All of the studied algorithms have the best improvement in workload W3 and W5 due to the fact that workload W3 is composed of high cache affinity applications, and workload W5 has the most shared data between applications. Figure 7 provides the normalized run times of applications for each workload in the H-SVM-LRU scenario, in order to investigate the impact of H-SVM-LRU on the performance of each application in the workloads. To clarify, we use different colors to separate each application. We observe that some I/O-intensive applications like Grep and Sort show significant improvements in their performance because Sort can benefit from reusing cached data (the data also used by Grep and WordCount). I/O-bound jobs also spend most of their time on reading blocks, meaning they can have increased benefit from cached data. Therefore, the performance of I/O-intensive applications like Sort can be improved when they are combined with other applications with different resource usage patterns. Moreover, multiple-stage applications like Join have difficulty reusing the input files because the output of the previous stage is used as input for the next stage, and this usage is not well-suited for this caching mechanism. We conclude that H-SVM-LRU is appropriate for workloads with jobs that reuse a large amount of data. 7. Limitations and Future Work Although the proposed H-SVM-LRU approach demonstrates notable improvements in cache hit rates and overall Hadoop performance, several limitations must be acknowledged. First, despite achieving high prediction accuracy (83%), misclassifications can occur, particularly under workloads with highly dynamic or unpredictable access patterns. Such mispredictions may lead to premature eviction of hot blocks or retention of cold blocks, slightly degrading cache performance. Second, the effectiveness of the SVM model is heavily dependent on the representativeness of the training dataset; significant deviations in workload characteristics may reduce predictive accuracy. Third, the approach assumes that runtime features such as task type, job status, and progress are available and accurately captured, and incomplete or noisy data could compromise the model’s performance. Fourth, while the training phase of the SVM model is executed independently of runtime cache operations and thus does not directly affect Hadoop job execution, this work does not provide a detailed profiling of the computational resources required during training. Given that the training input consists only of cache metadata, the overhead is expected to be small relative to the overall workload. Nevertheless, a comprehensive analysis of CPU, memory, and energy consumption during training will be considered in future work to provide a more complete evaluation of system performance. Finally, integrating the SVM with Hadoop’s caching mechanism, while effective, adds engineering complexity for low-latency predictions in large-scale, production environments. To address these limitations, several directions for future work can be considered. Adaptive online learning methods could be employed to incrementally update the SVM model in response to evolving workload patterns without retraining from scratch, thereby reducing training overhead. Ensemble approaches, combining the SVM with lightweight predictors or heuristics, may reduce the impact of misclassifications and improve robustness. Automated feature selection techniques could dynamically identify or engineer features at runtime based on workload characteristics, enhancing predictive capability. Scalability may be further improved through distributed or parallelized training strategies, as well as GPU acceleration for large datasets. Finally, an extended evaluation for diverse workloads, including streaming data and highly volatile access patterns, would provide a more comprehensive understanding of H-SVM-LRU’s performance boundaries. 8. Conclusion In this paper, we demonstrated that H-SVM-LRU can efficiently remove cold items from the cache at an early stage to make space for new data blocks. By using this mechanism, cache pollution can be reduced, and the available cache space can be utilized more effectively. If all data blocks in the cache have the same class, the proposed algorithm is identical to LRU and only considers the recently used metric for data eviction. We propose H-SVM-LRU as an intelligent cache replacement strategy to warm up the cache and improve Hadoop performance. H-SVM-LRU combines SVM with LRU to use the limited cache capacity efficiently and avoids cache pollution by cold data. We classify cached data as either cold or warm by using an SVM classifier that predicts the likelihood of future reuse, and the evicted items are determined based on their class. Experimental results show that the cache hit ratio is improved for H-SVM-LRU by decreasing the frequency of eviction of hot data that is reused in the future, and we observe in our experiments that the average improvement of H-SVM-LRU is 16.16%, 4.83%, and 2.45% against Hadoop-original, H-LRU, and H-NB-LRU, respectively. Across all workloads, H-SVM-LRU delivers consistently lower execution times than NB-LRU, especially in I/O-intensive jobs and scenarios with limited cache sizes. NB-LRU, while lightweight, relies heavily on historical data and therefore shows performance improvements mainly in larger cache configurations. Overall, H-SVM-LRU provides more stable performance across diverse workloads, whereas NB-LRU is more sensitive to workload characteristics and cache availability. H-SVM-LRU is appropriate for small cache sizes to use its limited space efficiently, as well as being suitable for workloads composed of high cache affinity applications with varied resource usage and a large amount of shared data. The advantage of this policy includes increasing the number of cache hits which decreases data access time. This is turn reduces the job execution time, resulting in a positive impact on overall Hadoop performance. While the training time is a potential limitation of this approach, this is somewhat mitigated by the training time being independent of the execution time. Declarations Author Contribution study conception and design: R.Ghazali, A.Movaghar, S.Adabi, A.Rezaee, and D.Down; propose the subject and Supervisor: A.Movaghar, and D.Down;data collection: R.Ghazali; analysis and interpretation of results: R.Ghazali, D.Down;draft manuscript preparation: R.Ghazali, D.Down;All authors reviewed the results and approved the final version of the manuscript References Hadoop, A.: http://Hadoop.Apache.org/, last accessed 2021/02/15 Merceedi, K.J., Sabry, N.A.: A Comprehensive Survey for Hadoop Distributed File System. Asian J. Res. Comput. Sci. 46–57 (2021) Khezr, S.N., Navimipour, N.J.: MapReduce and Its Applications, Challenges, and Architecture: a Comprehensive Review and Directions for Future Research. J. Grid Comput. 15 , 295–321 (2017) Li, S., Maddah-Ali, M.A., Avestimehr, A.S.: Coded MapReduce 1–16 (2015) Zhao, Y., Wu, J., Dache: A Data Aware Caching for Big-Data Applications Using MapReduce Framew. 35–39 (2013) Bu, Y., Howe, B., Ernst, M.D.: HaLoop: Efficient Iterative Data Processing on Large Clusters. 3, (2010) Huang, T.C., Chu, K.C., Zeng, X.Y., Chen, J.R., Shieh, C.K.: CURT MapReduce: Caching and utilizing results of tasks for MapReduce on cloud computing. Proceedings – 2016 IEEE 2nd International Conference on Multimedia Big Data, BigMM 149–154 (2016). (2016) Luo, Y., Shi, J., Zhou, S., JeCache: Just-Enough Data Caching with Just-in-Time Prefetching for Big Data Applications. Proceedings - International Conference on Distributed Computing Systems 2405–2410 (2017) Ananthanarayanan, G., et al.: Pacman: Coordinated Memory Caching for Parallel Jobs Sangavi, S., et al.: An Enhanced DACHE model for the MapReduce Environment. Procedia - Procedia Comput. Sci. 50 , 579–584 (2015) Shrivastava, M., Bischof, H.: Hadoop-Collaborative Caching in Real-Time HDFS Floratou, A., Megiddo, N., Potti, N., Ozcan, F., Kale, U.: and J. Schmitz-Hermes. Adaptive Caching in Big SQL using the HDFS Cache. In Proc. of the 7th ACM Symp. on Cloud Computing (SoCC). ACM, (2016) Code, L.: Master’s Thesis Cache Affinity-aware In-memory Caching Management for Hadoop Jaewon Kwak Kwak, J., Hwang, E., Yoo, T., Nam, B., Choi, Y.: In-memory Caching Orchestration for Hadoop. (2016). 10.1109/CCGrid.2016.73 Herodotou, H., AutoCache: Employing machine learning to automate caching in distributed file systems. Proceedings – 2019 IEEE 35th International Conference on Data Engineering Workshops, ICDEW 133–139 (2019). (2019) Fabbro, R., Zhong, C., Jiang, S.M.L.-L.I.R.S.: Leveraging Machine Learning to Improve the LIRS Replacement Algorithm. 2021 International Conference on High-Performance Big Data and Intelligent Systems, HPBD and IS 74–78 (2021). (2021) Ali, W., Shamsuddin, S.M., Ismail, A.S.: Intelligent Web proxy caching approaches based on machine learning techniques. (2012) Ali, W.: Intelligent Bayesian Network-Based Approaches for Web Proxy Caching Imtongkhum, P., So-In, C., Sanguanpong, S., Phoemphon, S.: A two-level intelligent web caching scheme with a hybrid extreme learning machine and least frequently used. J. Internet Technol. 19 , 725–740 (2018) Wang, Y., Yang, Y., Wang, Q.: An efficient Intelligent Cache Replacement Policy Suitable for PACS. Int. J. Mach. Learn. Comput. 11 , 250–255 (2021) Herodotou, H.: Automating Distributed Tiered Storage Management in Cluster Computing. PVLDB. 13 (1), 43–56 (2019) Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification, (2009) Al-Qtiemat, E., Jafar, I.: Intelligent Cache Replacement Algorithm for Web Proxy Caching based on Multi-level K-means Clustering. 2021 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology, JEEIT 2021 - Proceedings 278–282 (2021) Abdo, L., Ahmad, I., Abed, S.: A smart admission control and cache replacement approach in content delivery networks. Cluster Comput. 27 , 2427–2445 (2024) Zhang, J., Wu, G., Hu, X., Wu, X.: A distributed cache for Hadoop Distributed File System in real-time cloud Services. Proceedings - IEEE/ACM International Workshop on Grid Computing 12–21 (2012) Bok, K., Lim, J., Oh, H., Yoo, J.: An efficient cache management scheme for accessing small files in Distributed File Systems. 2017 IEEE International Conference on Big Data and Smart Computing, BigComp 2017 151–155 (2017) Apache Hadoop centralized cache management in HDFS: http://hadoop.apache.org/docs/r2.4.1/hadoop-project- dist/hadoophdfs/CentralizedCacheManagement.html ALOJA: Big Data Benchmark Repository and Performance Analysis, hhtps://aloja.bsc.es Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis. (2014). 10.1109/ICDEW.2010.5452747 Hibench: http://GitHub.com/ Intel- bigdata/ HiBench MapReduce, H.: server. https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html Gutenberg: dataset. http://www.gutenberg.org/ Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Revision Version 1 posted Editorial decision: Revision requested 30 Apr, 2026 Reviews received at journal 17 Feb, 2026 Reviews received at journal 12 Feb, 2026 Reviewers agreed at journal 02 Feb, 2026 Reviewers agreed at journal 29 Jan, 2026 Reviewers invited by journal 03 Nov, 2025 Editor assigned by journal 03 Nov, 2025 Submission checks completed at journal 09 Oct, 2025 First submitted to journal 07 Oct, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7801155","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":539462470,"identity":"73da99ce-d2da-433d-981d-70743d280eac","order_by":0,"name":"Rana Ghazali","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA6ElEQVRIie3PsarCMBSA4UAgLpWuR6z6CqcEqvdterlwXQIKLk5aKDjpXvAlBOHOiuAjSEonEZ0c6tYhg6l3VKJuDvmHA4HzcQghNtsH5ibl7ANxXyYgy4lAatHtjc+JXv5fxNXLpB4f8gI7Pb7bHBeFUs12RPe5kXhb9KcIX3/Zb5DOJsi9FeNgIr4MMXQQMMgEk9UIvxPiECNB2c3XShM+FyxVCsea0MJMhB+XV7AuWOYwDIE4zHgFpBhQTxPQf8kaE+4nlAUdE3GT7vJyHo7Qnf+c0rNqtqASH6SJPIi+uW+z2Wy2+65lxkQzGcEldwAAAABJRU5ErkJggg==","orcid":"","institution":"Islamic Azad University","correspondingAuthor":true,"prefix":"","firstName":"Rana","middleName":"","lastName":"Ghazali","suffix":""},{"id":539462471,"identity":"bda73682-375e-41dd-927a-8fbe35e6a6b5","order_by":1,"name":"Ali Rezaee","email":"","orcid":"","institution":"Islamic Azad University","correspondingAuthor":false,"prefix":"","firstName":"Ali","middleName":"","lastName":"Rezaee","suffix":""},{"id":539462473,"identity":"4bfc2119-1f7e-496f-8440-4c8d1608970c","order_by":2,"name":"Sahar Adabi","email":"","orcid":"","institution":"Islamic Azad University","correspondingAuthor":false,"prefix":"","firstName":"Sahar","middleName":"","lastName":"Adabi","suffix":""},{"id":539462475,"identity":"0ab4fc8d-f8f1-46d2-9377-f2f9d7bcaedd","order_by":3,"name":"Douglas G. Down","email":"","orcid":"","institution":"McMaster University","correspondingAuthor":false,"prefix":"","firstName":"Douglas","middleName":"G.","lastName":"Down","suffix":""},{"id":539462476,"identity":"17c27a97-08ce-4cb7-81f4-7dd646765c7a","order_by":4,"name":"Ali Movaghar","email":"","orcid":"","institution":"Sharif University of Technology","correspondingAuthor":false,"prefix":"","firstName":"Ali","middleName":"","lastName":"Movaghar","suffix":""}],"badges":[],"createdAt":"2025-10-07 15:53:20","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7801155/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7801155/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":96239853,"identity":"b9403c01-149c-42da-8d6a-4adec0eb14c5","added_by":"auto","created_at":"2025-11-19 07:07:49","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1359495,"visible":true,"origin":"","legend":"","description":"","filename":"SVMLRU8.docx","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/6d60356bf58adc2d7ff7d266.docx"},{"id":95830497,"identity":"ab983314-c96e-46ee-a359-a73d9a6890bf","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6431,"visible":true,"origin":"","legend":"","description":"","filename":"4d1823d691f0421cb2058bd0c7257faa.json","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/6b824a4673289c526221af97.json"},{"id":96239626,"identity":"148ac06e-97b2-4df7-9df6-47763b72123c","added_by":"auto","created_at":"2025-11-19 07:07:11","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":169957,"visible":true,"origin":"","legend":"","description":"","filename":"4d1823d691f0421cb2058bd0c7257faa1enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/bd3114dcd40b78d59dd1134a.xml"},{"id":96239546,"identity":"fbd669ca-aadc-4646-b246-5fbb13dea50c","added_by":"auto","created_at":"2025-11-19 07:06:56","extension":"eps","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":86889,"visible":true,"origin":"","legend":"","description":"","filename":"drawingimage1.eps","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/2fc886f5911c73199f726415.eps"},{"id":96239411,"identity":"6523a73d-2a4c-44d9-8e5c-0b8e9c8ef673","added_by":"auto","created_at":"2025-11-19 07:06:34","extension":"jpeg","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":192708,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/8032482076f6fb6fe4576a9f.jpeg"},{"id":96239536,"identity":"c93d96a8-f7d1-4c64-9f03-b540e9f74b7d","added_by":"auto","created_at":"2025-11-19 07:06:53","extension":"jpeg","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1074,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage10.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/a2de1c5dff7190fc4244a0b6.jpeg"},{"id":95830505,"identity":"f543e841-f441-4e21-b350-00eea055d4c6","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"jpeg","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":332679,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage11.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/df8087fddd4eafab1151e898.jpeg"},{"id":95830506,"identity":"9c815517-4b43-4e3f-9f1a-3b4f032872f9","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"jpeg","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":256822,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage12.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/de282ebb62b60d3e6bf175f6.jpeg"},{"id":95830525,"identity":"c117af1e-e305-4f64-ac91-f5ed707d23dd","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"jpeg","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":397676,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage13.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/f7cdba96735c40aca9515bb4.jpeg"},{"id":96239941,"identity":"b8d17e67-7603-414f-9023-b106568a11d2","added_by":"auto","created_at":"2025-11-19 07:08:02","extension":"jpeg","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":245818,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage14.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/81e8de7b54ebbc4d8cc98cce.jpeg"},{"id":96239544,"identity":"95d0755e-a1c8-45c5-b603-a801b1b86be0","added_by":"auto","created_at":"2025-11-19 07:06:55","extension":"jpeg","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":263759,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage15.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/21f4bd9dc7121108b1fa2c94.jpeg"},{"id":96239344,"identity":"3e14a1de-e5a5-47af-b1be-a80a5be7503f","added_by":"auto","created_at":"2025-11-19 07:06:15","extension":"jpeg","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":4100,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/df09dc15f8bb5837524ece31.jpeg"},{"id":95830515,"identity":"32b32569-8291-41bb-830d-38d58f617a7c","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"jpeg","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":20486,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/586ceb5a0d436ad71c888e6d.jpeg"},{"id":96239525,"identity":"d51c65cb-f888-4d1f-9814-9642ca73cb0e","added_by":"auto","created_at":"2025-11-19 07:06:53","extension":"jpeg","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":228838,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/d512fc3a552f82bfc2cfcb02.jpeg"},{"id":96239560,"identity":"8a7e74a2-306f-4b2f-a898-2dcf53b84b9f","added_by":"auto","created_at":"2025-11-19 07:06:58","extension":"jpeg","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":22408,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/271418a9322f34da3cee677f.jpeg"},{"id":96239512,"identity":"f1bdf102-7349-45ef-9d93-66b07c26f0ab","added_by":"auto","created_at":"2025-11-19 07:06:51","extension":"jpeg","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":23638,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage6.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/cc6e35bfd87d52cf2361659c.jpeg"},{"id":96239201,"identity":"4e746e53-ca6b-4cba-85d6-8cefe82de1cc","added_by":"auto","created_at":"2025-11-19 07:05:23","extension":"jpeg","order_by":16,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":18352,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage7.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/f3cdbf41ad3d61ff0011c4aa.jpeg"},{"id":95830513,"identity":"d963118d-03d7-4e57-bc9a-69b97b74d085","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"jpeg","order_by":17,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":20151,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage8.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/6f80adf69f8f83e1ab4c3b8d.jpeg"},{"id":95830503,"identity":"64de211c-d215-4103-a487-1c68bd91a87b","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"jpeg","order_by":18,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":33061,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage9.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/1d22d0f0a538f57908a47e8e.jpeg"},{"id":95830509,"identity":"53fc1559-6067-4572-8dee-9062cf81b356","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"png","order_by":19,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":13212,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/9838d6604daecf9cba1a86a9.png"},{"id":96239249,"identity":"d036344a-4fcb-4412-becf-76ca15d7a430","added_by":"auto","created_at":"2025-11-19 07:05:54","extension":"png","order_by":20,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":935,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage10.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/8a4ad72a87e10d3051480701.png"},{"id":95830521,"identity":"a338df09-e1b1-40bd-9837-c357dd40cabc","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"png","order_by":21,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":38688,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage11.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/e3a1259e9b8f12af78a9f339.png"},{"id":96239416,"identity":"9138a18b-684f-4742-b583-f7aa912d3493","added_by":"auto","created_at":"2025-11-19 07:06:35","extension":"png","order_by":22,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":31966,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage12.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/21e5ebd0fc3bf70354aef6fe.png"},{"id":95830516,"identity":"a21d7e44-9455-493d-8cf9-8fcae826c576","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"png","order_by":23,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":65119,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage13.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/be9d39db8964c92fc8f0e9f9.png"},{"id":96239405,"identity":"dbcd3eea-c551-4e63-8572-15ee59e82930","added_by":"auto","created_at":"2025-11-19 07:06:33","extension":"png","order_by":24,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":29176,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage14.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/73f4e9edc27b3bc49f2d0be3.png"},{"id":96239234,"identity":"01a8418a-22b6-4ef7-a063-468d662bea2b","added_by":"auto","created_at":"2025-11-19 07:05:45","extension":"png","order_by":25,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":35931,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage15.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/c66e3cf7ca03bec802740851.png"},{"id":96239301,"identity":"8a3ca29c-bf49-480f-b23e-0c8449394dfc","added_by":"auto","created_at":"2025-11-19 07:06:03","extension":"png","order_by":26,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1495,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/d6af5c5c6610c3add47ae7dc.png"},{"id":96239486,"identity":"c582e32b-531a-45af-bfe7-867513dca6c2","added_by":"auto","created_at":"2025-11-19 07:06:46","extension":"png","order_by":27,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":4290,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/78c62ea86d94fe6f76ebb1ac.png"},{"id":95830523,"identity":"ab56c687-3478-4822-a20d-bcb85c5b499e","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"png","order_by":28,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":42348,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/e4808130845b684ca76c5352.png"},{"id":95830528,"identity":"7a4c3257-c149-4c6c-9dca-867b3497b63b","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"png","order_by":29,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6542,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/0af7c188ff823192071f9e4d.png"},{"id":96239349,"identity":"0b9d581d-79a4-497e-97e8-0ba636ca5ea5","added_by":"auto","created_at":"2025-11-19 07:06:17","extension":"png","order_by":30,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":5069,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/3b17aae28627734b2cf3a526.png"},{"id":95830531,"identity":"088d930b-8205-48f8-82a3-c40b240558be","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"png","order_by":31,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":3938,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/0ed7a9a3fb62fd6ee8f340e0.png"},{"id":96239330,"identity":"96cda198-6981-4def-83ec-a2a59ecc0b06","added_by":"auto","created_at":"2025-11-19 07:06:08","extension":"png","order_by":32,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":4286,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/2f91c314d5a426b29c3c2341.png"},{"id":96239268,"identity":"876ca70d-24ba-491e-89ed-342416aae583","added_by":"auto","created_at":"2025-11-19 07:05:57","extension":"png","order_by":33,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6213,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/bbb91a3557fd1057a238fed5.png"},{"id":95830533,"identity":"03c633a8-a3ad-41d1-b9de-88b9f6aa1240","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"xml","order_by":34,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":168345,"visible":true,"origin":"","legend":"","description":"","filename":"4d1823d691f0421cb2058bd0c7257faa1structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/a65b9dd4d140617e4bb33bf2.xml"},{"id":95830534,"identity":"c30d0824-3b08-4791-abf1-579dd821b455","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"html","order_by":35,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":180640,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/a0d2454ed90812417c99a1e1.html"},{"id":95830493,"identity":"15d66837-67de-4e4c-abad-613cad469648","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":35382,"visible":true,"origin":"","legend":"\u003cp\u003eThe impact of cold cache and warm cache on job execution time\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/7ecb56418131d9711b98659a.png"},{"id":95830492,"identity":"33b2c3ea-6d86-41f9-bcec-9bc3aaee8291","added_by":"auto","created_at":"2025-11-13 12:17:49","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":50674,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eThe proposed H-SVM-LRU intelligent cache replacement strategy framework\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/cee557c9eb720461212fbdc0.png"},{"id":96239566,"identity":"e0509f8d-a9e0-4fee-bf5a-94f11373e940","added_by":"auto","created_at":"2025-11-19 07:06:59","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":54718,"visible":true,"origin":"","legend":"\u003cp\u003eFig. 4: Cache hit ratio for different cache sizes\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/ee1f9f64784b5ec067b3cfaf.png"},{"id":95830494,"identity":"bc775e2d-abee-4f5d-8c64-f3d444edeb39","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":49007,"visible":true,"origin":"","legend":"\u003cp\u003eFig. 5: Job execution time in cold cache and warm cache scenarios\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/156fed02078dc7fb3b5ca47d.png"},{"id":96240459,"identity":"b7d76607-897b-47d7-af39-4a677eb94a47","added_by":"auto","created_at":"2025-11-19 07:08:56","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":103351,"visible":true,"origin":"","legend":"\u003cp\u003eFig. 5: Job execution time for different input data sizes\u003c/p\u003e","description":"","filename":"05.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/dbd7df430e3c3e56f4dfba60.png"},{"id":95830498,"identity":"e1df5774-1837-4ba4-a970-4b33b62537f5","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":41372,"visible":true,"origin":"","legend":"\u003cp\u003eFig. 6: Normalized run time of different workloads\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/ec67622ca8dd6fb6f8fa9354.png"},{"id":95830501,"identity":"00f8bfbf-e07c-4cfd-aec0-59c416497ea2","added_by":"auto","created_at":"2025-11-13 12:17:50","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":39681,"visible":true,"origin":"","legend":"\u003cp\u003eFig. 7: Normalized run time of applications in each workload\u003c/p\u003e","description":"","filename":"7.png","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/57802bfd33b40f68a313c5e5.png"},{"id":96452887,"identity":"7bf67f6f-b33c-4202-b2e4-3bcb9c62d42e","added_by":"auto","created_at":"2025-11-21 09:52:35","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1875434,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7801155/v1/0f0c26b5-89f7-4443-99b6-7860567599ff.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"A Support Vector Machine-Based Cache Replacement Method for Efficient MapReduce Execution","fulltext":[{"header":"1 Introduction","content":"\u003cp\u003eHadoop [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e] is an open-source framework designed to facilitate the processing of large datasets within a distributed computing environment. It consists of two primary components: (1) the Hadoop Distributed File System (HDFS) [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e], which provides storage for substantial volumes of data, and (2) MapReduce [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e], a parallel programming model that processes large amounts of data across a cluster of machines in a distributed setting. While Hadoop offers several advantages over traditional Relational Database Management Systems (RDBMS), such as enhanced flexibility, high throughput, cost efficiency, and concurrent processing capabilities, it also encounters notable challenges. Specifically, issues like high data access latency and prolonged data transmission can negatively impact performance. The reliance on hard disk drive (HDD) systems for HDFS storage exacerbates this problem, as the high access times associated with HDDs can significantly affect overall execution time [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eIn recent years, many researchers have proposed the use of caching mechanisms to address these challenges [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e], [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e], [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e], [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. The caching mechanisms involve storing frequently accessed data (hot data) in temporary storage and closer to the requesting entity (such as a worker node, or user), which reduces both waiting time and the resources required to fetch the data from the HDFS. Frequently accessed or high-priority data that demands fast retrieval is known as hot data; by contrast, cold data is rarely accessed and does not need to be retrieved immediately. In this paper, we categorize data blocks into either hot data or cold data based on their likelihood of being reused in the near future.\u003c/p\u003e\u003cp\u003eThe concept of cache states is crucial to understanding cache performance. A cold cache refers to an initial state where the cache contains little or no data, leading to a higher incidence of cache misses as requested data must be fetched from HDFS, resulting in slower access times. Conversely, a warm cache has been populated with a substantial amount of hot data over time, increasing the likelihood of cache hits and reducing data access latency. The transition from a cold to a warm cache state can significantly enhance job execution, as demonstrated in our experimental analysis. We conducted experiments on various applications combining I/O-bound and CPU-bound jobs under three scenarios: (1) using the original Hadoop without a caching mechanism, (2) with a cold cache, and (3) with a warm cache. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, the normalized execution time decreases notably in the warm cache state across all tested applications. In this case, the cache contains more hot data that is frequently demanded and has a positive effect on the cache hit ratio resulting in improving job execution time.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eCache warming refers to the process of preloading the cache with hot data to expedite the transition from a cold to a warm state. This process involves identifying hot data based on historical access patterns and making strategic decisions about which data to cache. More importantly, an effective cache replacement strategy is required when the cache reaches capacity, to ensure optimal data retention. In this paper, we propose an HDFS-based SVM-LRU (H-SVM-LRU) approach that integrates an intelligent cache replacement algorithm, SVM-LRU, to optimize cache utilization and minimize cache pollution. The machine learning component of our approach classifies cached data into hot and cold categories, predicting whether the data is likely to be reused in future computations. This classification allows the system to decide which data should be retained in the cache and which should be replaced, thus maximizing the efficiency of the limited cache space and improving the cache-hit ratio. As a result, the proposed method is well-suited for iterative programs, reducing data access time from disk and avoiding unnecessary recomputation by retrieving more intermediate data from the cache.\u003c/p\u003e\u003cp\u003eOur contributions in this paper are:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eWe provide an overview of intelligent cache replacement methods used in web proxy caches.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eDifferent cache replacement strategies are investigated in the Hadoop environment, and we discuss their advantages and disadvantages.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eWe introduce the intelligent caching mechanism, H-SVM-LRU, in the Hadoop environment, which uses a Support Vector Machine (SVM) for classifying data into two groups: hot data and cold data based on the likelihood of future reuse.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eWe evaluate H-SVM-LRU\u0026rsquo;s performance (hit ratio) and compare it with LRU.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eWe carry out experiments to investigate the impact of H-SVM-LRU on Hadoop\u0026rsquo;s execution time performance.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe rest of the paper is organized as follows: We discuss existing cache replacement strategies for Hadoop and intelligent caching methods for web proxy caches in Section \u003cspan refid=\"Sec2\" class=\"InternalRef\"\u003e2\u003c/span\u003e. Next, Section \u003cspan refid=\"Sec5\" class=\"InternalRef\"\u003e3\u003c/span\u003e defines the problem and our solution approach. We then describe the H-SVM-LRU framework and present our H-SVM-LRU algorithm for the Hadoop environment in Section \u003cspan refid=\"Sec6\" class=\"InternalRef\"\u003e4\u003c/span\u003e. Next, we explain the details of the H-SVM-LRU implementation in Section 5. We evaluate the performance of H-SVM-LRU via different experiments in Section \u003cspan refid=\"Sec15\" class=\"InternalRef\"\u003e6\u003c/span\u003e. Finally, Section \u003cspan refid=\"Sec24\" class=\"InternalRef\"\u003e7\u003c/span\u003e contains conclusions and suggestions for future work.\u003c/p\u003e"},{"header":"2 Related work","content":"\u003cp\u003eIn this section, we first provide an overview of cache replacement algorithms that have been proposed for improving Hadoop performance. We then discuss different intelligent caching mechanisms that use machine learning methods.\u003c/p\u003e\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003e2.1 Cache replacement policies in Hadoop\u003c/h2\u003e\u003cp\u003eIn this section, we investigate different cache replacement strategies in Hadoop, including their advantages and disadvantages.\u003c/p\u003e\u003cp\u003e\u003cem\u003eLIFE\u003c/em\u003e and \u003cem\u003eLFU-F\u003c/em\u003e are two replacement strategies used in \u003cem\u003ePacMan\u003c/em\u003e [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e] for an in-memory coordinated caching mechanism with data-intensive parallel jobs. In \u003cem\u003ePacMan\u003c/em\u003e, parallel jobs run numerous tasks concurrently in a wave, with the all-or-nothing property. The \u003cem\u003eLIFE\u003c/em\u003e algorithm evicts data blocks of files with the largest wave-width and results in reducing the average job completion time. \u003cem\u003eLFU-F\u003c/em\u003e aims to maximize cluster efficiency. For this purpose, it evicts data blocks with less frequent access. Both strategies prioritize incomplete files over completed files for eviction and use a window-based aging mechanism to avoid cache pollution. They first check whether there are data blocks that have not been accessed within a given time window. Among these files, the one with the least number of accesses is chosen for eviction.\u003c/p\u003e\u003cp\u003e\u003cem\u003eEnhanced Data-Aware Cache (EDACHE)\u003c/em\u003e [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e] was introduced for caching intermediate results to accelerate MapReduce job execution times. In this strategy, \u003cem\u003eWSClock\u003c/em\u003e is used as a cache replacement algorithm in which cached items are maintained in a circular list, and a clock hand advances around this ring. This algorithm replaces cached items based on their reference bit and the last time used. It first checks the reference bit. If its value is one, it means this item is used. The item's reference bit is then reset, its last time used is updated, and the clock hand is advanced. Otherwise, an item with an age greater than a threshold value is evicted. The bottleneck of this mechanism is related to the fact that large blocks lead to long search times for requested content.\u003c/p\u003e\u003cp\u003eIn \u003cem\u003ecollaborative caching\u003c/em\u003e, a \u003cem\u003eModified ARC replacement algorithm\u003c/em\u003e [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e] was proposed in order to increase the cache hit ratio and improve efficiency. In this strategy, the cache is divided into recent cache, recent history, frequent cache, and frequent history sections such that the cache sections contain data blocks and the history sections include references to evicted items. Initially, on a request for a block, the references in the history caches are checked. If present, the corresponding block is placed in the recent or frequent cache; otherwise, the cache references and serves the request from either of the history caches, which helps in faster caching as well as locating files for initial checks. If references are found in recent history, then the block is placed in the recent cache. If the block is found in the recent cache, then it is moved to the frequent cache; hence, a hit in either of the history caches removes the reference and places the corresponding block in one of the caches (recent or frequent). Caching a block also involves caching metadata. When either of the caches is fully utilized, then a block is evicted from the recent or frequent cache, but its reference is placed into its corresponding history. When either of the history caches is fully utilized, the references simply drop out of the cache.\u003c/p\u003e\u003cp\u003eAn adaptive cache algorithm [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e] was designed to cache the partition of tables into the HDFS cache for Big SQL. Selective \u003cem\u003eLRU-K (SLRU-K)\u003c/em\u003e and \u003cem\u003eExponential-Decay (EXD)\u003c/em\u003e are used as online caching algorithms and for selective cache insertion to reduce the overhead of inserting items into the HDFS cache. \u003cem\u003eSLRU-K\u003c/em\u003e takes into account the variable size of the partition and uses a weighted heuristic to selectively place the partitions into the cache. It keeps a list of the K most recent access times for each partition. However, \u003cem\u003eEXD\u003c/em\u003e maintains only the time of the most recent access to compute the score for each partition to determine a weight between access frequency versus recently used. In \u003cem\u003eAdaptive SLRU-K and EXD\u003c/em\u003e, the adaptor adjusts its behavior with access patterns of various workloads by automatically adjusting the value of its parameters. Maximizing the byte-hit ratio and minimizing the byte-insertion ratio are the primary aims of the adaptor.\u003c/p\u003e\u003cp\u003eThe \u003cem\u003eblock goodness aware cache replacement strategy\u003c/em\u003e [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e] uses two metrics for cache management: cache affinity (CA), which depends on resources used by the application, and block goodness (BG), which measures how much a cached data block is worth. For each cached item, this strategy first calculates the BG value based on the data block access count and MapReduce application cache affinity then selects a data block with the lowest BG value for eviction. A data block with the oldest access time will be discarded if there is more than one data block with the same lowest BG value.\u003c/p\u003e\u003cp\u003eThe \u003cem\u003eCache Affinity Aware Cache Replacement Algorithm\u003c/em\u003e [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] categorizes MapReduce applications based on their cache affinity. This algorithm prioritizes caching input data of applications with high cache affinity. It takes into account the cache affinity of a MapReduce application and data access frequency to calculate the benefit of caching for input data. As a result, it evicts a data block with the lowest caching benefit. If there are some data blocks with identical lowest benefits, it evicts a block based on the LRU policy.\u003c/p\u003e\u003cp\u003e\u003cem\u003eAutoCache\u003c/em\u003e [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e] employs a lightweight gradient-boosted tree (XGBoost) to predict file access patterns on the HDFS cache. In this mechanism, the probability of accessing a file is measured by a probability score, which is used as a metric by the cache replacement policy to avoid cache pollution. When the free space of the cache is less than 10%, the eviction operation is started, and it continues until the cache capacity becomes lower than 85%. This cache replacement algorithm has low overhead by limiting computation to a fixed number of files.\u003c/p\u003e\u003cp\u003eIn [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e], Newberry et al. evaluate four advanced cache replacement policies: Two Queue (2Q), Adaptive Replacement Cache (ARC), Multi-Queue (MQ), and Low Inter-reference Recency Set (LIRS) that are designed to handle larger reuse distances and reduce cache thrashing, particularly in big data environments. 2Q employs numerous queues to distinguish between frequently and infrequently accessed data, effectively filtering single-use blocks, while ARC enhances this by dynamically adjusting queue sizes to better capture temporal locality. MQ extends 2Q by using multiple queues to categorize data based on access frequency, offering greater precision but performing similarly to 2Q when access patterns are uniform. LIRS excels by prioritizing data based on reuse distance, efficiently managing temporal locality without polluting the cache. Their evaluation revealed that LIRS outperforms other strategies for smaller cache sizes, while ARC is superior for larger caches. Combining LIRS and ARC in a layered architecture further improved performance, highlighting their complementary strengths in big data caching.\u003c/p\u003e\u003cp\u003eIn Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, we compare these cache replacement strategies in terms of their criteria for eviction and mitigating cache pollution and summarize their advantages and disadvantages.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eHadoop cache replacement comparisons\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"5\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eReplacement strategy\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eMetrics for eviction\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCache pollution\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eAdvantages\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eDisadvantages\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLIFE\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLargest wave width and incomplete file\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eWindow age strategy\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eReduces average completion time\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eEffective for short jobs\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLFU-F\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFrequency access and\u003c/p\u003e\u003cp\u003eincomplete file\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eWindow age strategy\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eMaximizes cluster efficiency\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eEffective for short jobs\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eWSClock\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLast time used\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eNo\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eDecreases execution time\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eLong search times for large data blocks\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModified ARC\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eRecency and frequency of access\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eNo\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eIncreases cache hit ratio\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eNeeds space for storing history\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAdaptive cache\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eThe score of each partition\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eNo\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eAdapts to various workload characteristics\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eSignificant overhead\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBlock goodness aware\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eBlock goodness value and access time\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eNo\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eEffective for multiple concurrent applications\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eNeeds to calculate block goodness\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCache Affinity aware\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eCache affinity of application and recency\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eNo\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eEffective use of cache space\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eNeeds to know the cache affinity of applications\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAutoCache\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eProbability of accessing the file\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCalculating probability score\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eReduces average completion time and improves cluster efficiency\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eFile-oriented cache\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLIRS\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eReuse distance\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eReuse distance\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003ePerforms best for smaller cache sizes\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eOverhead of a large number of network requests\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e2.2 Intelligent caching mechanisms\u003c/h2\u003e\u003cp\u003eIn a related application domain, several intelligent caching strategies have been presented that use different machine learning techniques to enhance the performance of web proxy caches. Ali et al. proposed \u003cem\u003eSVM\u0026ndash;LRU, SVM\u0026ndash;GDSF\u003c/em\u003e, and \u003cem\u003eC4.5\u0026ndash;GDS\u003c/em\u003e [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e] that combine a support vector machine (SVM) and a decision tree (C4.5) with Least-Recently-Used (LRU), Greedy-Dual-Size (GDS), and Greedy-Dual-Size Frequency (GDSF) replacement strategies. In these techniques, web objects are classified into two groups: revisited later or not. These methods use a web proxy log file as a training dataset and different features of web objects, such as recency, frequency, size, access latency, and type of object, are considered for classification. Experimental results show that SVM-LRU appears to have the best hit ratio.\u003c/p\u003e\u003cp\u003eEmploying a Bayesian network, \u003cem\u003eBN-GDS\u003c/em\u003e, and \u003cem\u003eBN-LRU\u003c/em\u003e [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e] were introduced to improve the performance of cache replacement strategies like Greedy-Dual-Size (GDS) and Least-Recently-Used (LRU). In these strategies, the probability of web objects belonging to the revisited class (and so should be cached) is calculated based on features such as retrieval time, frequency, size, and type. Experimental results suggest that BN-GDS achieves the best hit ratio, while BN-LRU has the best byte-hit ratio.\u003c/p\u003e\u003cp\u003eHybrid ELM-LFU [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e] is a two-level caching scheme for web proxies. In the first level, LFU is used for fast caching replacement (due to its low complexity) and thus is suitable for real-time communication. An Extreme Learning Machine (ELM) is used in the second level, applying a single hidden-layer feed-forward network, where there is no need to adjust the weights. In this mechanism, the chosen web object for eviction in the first level will be placed in the second-level cache. This method features low training times.\u003c/p\u003e\u003cp\u003eIn [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e], Herotodos et al. designed a framework for moving data automatically through tiered storage in the distributed file system via a set of pluggable policies. For this purpose, they employ incremental learning to find which data should be downgraded or upgraded, allowing for adaptation to workload changes over time. For downgrading, this method uses different caching replacement strategies like LRU, LFU, Least Recently \u0026amp; Frequently Used (LRFU), LIFE, LFU-F, Exponential Decay (EXD), and XGBoost-based Modeling (XGB). Also, On Single Access (OSA), LRFU, EXD, and XGB are used for the upgrade policy. Experimental results show that XGB appears to be preferable because it requires minimal storage, has low training overhead, makes useful predictions, and can learn incrementally over time.\u003c/p\u003e\u003cp\u003ePACS-oriented SVM-LRU [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e] was proposed for picture archiving and communication systems. This algorithm calculates the probability of future access to cached items. In this strategy, SVM-LRU has brought some benefits like low training time, low computation, high prediction accuracy, and high hit ratio.\u003c/p\u003e\u003cp\u003eIn [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e], Al-Qtiemat et al. introduced an intelligent web proxy caching algorithm that leverages K-means clustering to improve cache performance by categorizing objects into prioritized groups for efficient replacement. The algorithm operates in two phases: an offline phase, where proxy server logs are analyzed to group objects hierarchically based on individual features, and an online phase, where these priorities guide the replacement of low-priority objects to optimize cache content. The proposed multi-level clustering (MC) approach simplifies priority assignment by considering one feature at a time, enhancing decision-making efficiency. By factoring in object size and prioritization, the MC algorithm achieves significant improvements in hit rate and byte hit rate. This demonstrates its superior performance for diverse cache sizes and its effectiveness in managing large, low-priority objects.\u003c/p\u003e\u003cp\u003eIn [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e], a smart caching system for Content Delivery Networks (CDNs) was proposed that uses a Long-Short-Term Memory (LSTM) model to predict the future popularity of content and optimize cache replacement decisions. The methodology consists of four phases: admission control, popularity prediction, filtering probabilities, and applying a caching policy to the top-k list. By caching the most popular content, the system aims to enhance the cache hit ratio, reduce latency, and improve user experience. Experimental results on a synthetic dataset showed that the LSTM-based model outperformed conventional algorithms like FIFO, LRU, and LFU, as well as other machine learning-based approaches, including DeepCache, in terms of total hits across different cache sizes.\u003c/p\u003e\u003cp\u003eEven though using a caching mechanism has yielded some benefits in the Hadoop environment, some challenges remain. For instance, cache management imposes a heavy load on the NameNode, both in terms of required storage and computational load, potentially degrading performance. Moreover, existing cache replacement policies in Hadoop do not take into account cache pollution and effective use of cache space, and they do not apply intelligent caching mechanisms. In this paper, we design a cache replacement mechanism as an approach for overcoming these problems, using SVM to classify data, resulting in improved performance. We choose SVM because its generalization ability can be maximized when training data are scarce, and it can control the misclassification error.\u003c/p\u003e\u003c/div\u003e"},{"header":"3 Problem definition","content":"\u003cp\u003eReducing job execution time is an effective means for improving Hadoop performance. The execution time is composed of two components: I/O operation time and processing time. Since input data are stored in an HDD-based system, HDFS, access time for I/O operations can be high, adversely affecting execution time. One approach to tackle this problem is to use a caching mechanism to store hot data in the cache.\u003c/p\u003e\u003cp\u003eA major challenge in employing caching mechanisms is accurately identifying hot data based on historical access patterns. Conventional caching algorithms often struggle to adapt to dynamic changes in data access patterns, resulting in inefficient caching decisions. To optimize caching, an effective approach must differentiate between hot and cold data, ensuring that the cache primarily stores data that is frequently requested, thus accelerating data access. As the cache is populated with hot data, it reaches a warm state, characterized by an increased likelihood of cache hits. However, sustaining this warm state necessitates robust cache replacement strategies that prevent the eviction of hot data in favor of less frequently accessed cold data. The goal is to maintain a high cache hit ratio by minimizing cache pollution, which occurs when cold data occupies cache space, reducing the availability of space for hot data. Cache pollution can be quantified using the ratio of cold data blocks in the cache to the total cache size, reflecting the extent of cache space occupied by infrequently accessed data. Thus, an intelligent cache replacement policy is essential to mitigate cache pollution, prioritize the retention of hot data, and facilitate the transition of the cache from a cold to a warm state. This approach ensures efficient data retrieval, ultimately improving the performance of Hadoop.\u003c/p\u003e"},{"header":"4 H-SVM-LRU cache replacement","content":"\u003cp\u003eA Support Vector Machine (SVM) [\u003cspan class=\"CitationRef\"\u003e24\u003c/span\u003e] is a supervised machine learning technique that is used for binary classification, dividing data into two classes: positive and negative. In this section, we provide H-SVM-LRU implementation details and explain our replacement algorithm that combines LRU with an SVM to classify data into two classes: hot data or cold data based on the likelihood of future reuse. This algorithm aims to avoid cache pollution and to effectively use cache space by cache warming, leading to decreased execution times. In this section, we present a framework for intelligent cache replacement based on machine learning and explain the H-SVM-LRU algorithm with an example to illustrate its operation.\u003c/p\u003e\n\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\n \u003ch2\u003e4.1 The proposed H-SVM-LRU framework\u003c/h2\u003e\n \u003cp\u003eIn this section, we propose a framework for an intelligent LRU approach using an SVM and in-memory cache [\u003cspan class=\"CitationRef\"\u003e25\u003c/span\u003e], [\u003cspan class=\"CitationRef\"\u003e26\u003c/span\u003e], [\u003cspan class=\"CitationRef\"\u003e27\u003c/span\u003e] for Hadoop which consists of two functional components: the classification component is responsible for training the data classification by using an SVM classifier and the Hadoop in-memory cache component uses this trained classifier to manage the cache space. Figure \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e gives the system structure and its components which we now explain in detail.\u003c/p\u003e\n \u003cp\u003eThe classification component consists of Hadoop job history, training dataset preparation, SVM model training, and SVM classifier deployment. The job history server allows the user to retrieve log information on completed applications. This information can be exploited as a source to extract training data. Next, preprocessing is employed to normalize data and eliminate outliers. After preparing training data, an SVM classifier is trained. Finally, the classifier is deployed, and SVM-LRU uses this data classification in its cache replacement decision.\u003c/p\u003e\n \u003cp\u003eThe Hadoop in-memory cache component is composed of NameNode, DataNodes, Application Master, and containers. The NameNode is responsible for coordinating all the DataNode caches in the cluster and stores two types of metadata: block metadata includes the location of data blocks on DataNodes, and cache metadata maps the locations of cached data. The NameNode periodically receives a cache report from each DataNode describing all the blocks cached on a given DataNode. The cache report is used to update cache metadata. For better utilization of the large distributed in-memory caches in Hadoop clusters, each Hadoop container always sends a request to cache the accessed block to the NameNode; the NameNode then controls which data blocks are added and evicted to and from in-memory caches. We use centralized cache management; as a result, H-SVM-LRU is located on the NameNode. In our system, a container is launched to run a task (either a Map task or a Reduce task) and always sends a request to find cached data blocks. Application Master manages the user job lifecycle and resource needs of individual applications. Each application has an associated unique, framework-specific Application Master. It coordinates an application\u0026rsquo;s execution in the cluster and also manages faults.\u003c/p\u003e\n \u003cp\u003eAssume that a MapReduce application requires two data blocks A and B, where data block A is located on DataNode X and data block B is cached on DataNode Y. The Application Master communicates with the NameNode (which contains the cache metadata) to query the locations of the input blocks and their availability in the cache. A cache miss occurs when looking for data block A and a cache hit occurs for data block B. In the cache miss state, the NameNode looks for block metadata to find the DataNode that contains data block A. Although there are multiple replicas of a given data block that can be accessed by the query, we choose the first one to reduce search time. We could cache all replicas of this data block in the DataNodes that contain them. In this case, cache replication is identical to data replication and can increase the cache hit rate ratio. In this case, excessive cache space is occupied, conflicting with the goal of our proposed method.\u003c/p\u003e\n \u003cp\u003eWe then use the H-SVM-LRU algorithm and the PutCache(A, X) method to place this data block in the cache. After caching, DataNode X piggybacks the cache report with a heartbeat message and sends it to the NameNode to update the cache metadata. The NameNode informs the Application Master of the location of cached data by using GetCache(A, X) and GetCache(B, Y). When a cache hit occurs, the GetCache method of the H-SVM-LRU algorithm is called to retrieve the cached data. The result is that not only do applications wait for the data block to be cached, but also the probability of accessing cached data has increased via effectively using cache space. It is important to note that applications do not necessarily access all their demand data from the cache; it is not necessary to wait for data to be cached.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\n \u003ch2\u003e4.2 The proposed H-SVM-LRU algorithm\u003c/h2\u003e\n \u003cp\u003eThe proposed H-SVM-LRU algorithm is designed to mitigate cache pollution by intelligently distinguishing between frequently accessed (hot) and infrequently accessed (cold) data. The cache is logically divided into two lists: the Cold list, which stores less frequently used blocks (cold data), and the Hot list, which stores frequently accessed blocks (hot data). Both lists are maintained as doubly linked structures with head and tail pointers. A boundary pointer separates these two lists and always points to the head of the Cold list, ensuring that the least valuable block (the oldest cold item) is the primary candidate for eviction.\u003c/p\u003e\n \u003cp\u003eThe algorithm is composed of three main procedures: GetCache, CachePlacement, and PutCache which collectively regulate data retrieval, placement, and insertion into the cache based on a block\u0026rsquo;s predicted class. For each requested data block, the system checks whether the block resides in the cache. If the block exists (cache hit), the GetCache procedure retrieves it and repositions it within the appropriate list according to its updated classification. If the block is absent (cache miss), its metadata is used to locate it on the corresponding DataNode, after which the PutCache procedure inserts it into the cache, potentially evicting a candidate block if the cache is full.\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003ctable id=\"Taba\" border=\"1\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAlgorithm \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e: H-SVM-LRU on Hadoop\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eInput: Sequence of requested data blocks R = {DB\u003csub\u003e1\u003c/sub\u003e, DB\u003csub\u003e2\u003c/sub\u003e, \u0026hellip;, DB\u003csub\u003en\u003c/sub\u003e}\u003c/p\u003e\n \u003cp\u003eOutput: Updated cache contents with reduced pollution\u003c/p\u003e\n \u003cp\u003eData Structures:\u003c/p\u003e\n \u003cp\u003eColdListHead \u0026rarr; pointer to first (oldest) cold item\u003c/p\u003e\n \u003cp\u003eColdListTail \u0026rarr; pointer to last (newest) cold item\u003c/p\u003e\n \u003cp\u003eHotListHead \u0026rarr; pointer to first (oldest) hot item\u003c/p\u003e\n \u003cp\u003eHotListTail \u0026rarr; pointer to last (newest) hot item\u003c/p\u003e\n \u003cp\u003eBoundaryPtr \u0026rarr; pointer marking the boundary between the Cold list and the Hot list\u003c/p\u003e\n \u003cp\u003e1- for each data block DB\u003csub\u003ex\u003c/sub\u003e requested by task T\u003csub\u003ei\u003c/sub\u003e do\u003c/p\u003e\n \u003cp\u003e2- DN\u003csub\u003ey\u003c/sub\u003e \u0026larr; lookup(DB\u003csub\u003ex\u003c/sub\u003e) in cache metadata\u003c/p\u003e\n \u003cp\u003e3- if DB\u003csub\u003ex\u003c/sub\u003e \u0026isin; Cache then\u003c/p\u003e\n \u003cp\u003e4- cache hit\u003c/p\u003e\n \u003cp\u003e5- call GetCache(DB\u003csub\u003ex\u003c/sub\u003e, DN\u003csub\u003ey\u003c/sub\u003e)\u003c/p\u003e\n \u003cp\u003e6- else\u003c/p\u003e\n \u003cp\u003e7- cache miss\u003c/p\u003e\n \u003cp\u003e8- DN\u003csub\u003ez\u003c/sub\u003e \u0026larr; lookup(DB\u003csub\u003ex\u003c/sub\u003e) in block metadata\u003c/p\u003e\n \u003cp\u003e9- call PutCache(DB\u003csub\u003ex\u003c/sub\u003e, DN\u003csub\u003ez\u003c/sub\u003e)\u003c/p\u003e\n \u003cp\u003e10- end if\u003c/p\u003e\n \u003cp\u003e11- end for\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003eAlgorithm 1 outlines the general mechanism of H-SVM-LRU. The input is a sequence of requested data blocks R={DB\u003csub\u003e1\u003c/sub\u003e,DB\u003csub\u003e2\u003c/sub\u003e,\u0026hellip;,DB\u003csub\u003en\u003c/sub\u003e}, and the output is an updated cache state with reduced pollution. The procedure first attempts to find each requested block in the cache metadata. In the case of a cache hit, GetCache is invoked to detach and reclassify the block before re-insertion into the appropriate list based on its predicted class. In the event of a cache miss, PutCache is called to fetch and insert the block, while evicting the oldest cold block whenever necessary. If the Cold list is empty, eviction is performed from the Hot list. By maintaining explicit pointers for list management and eviction boundaries, H-SVM-LRU efficiently removes cold data at an early stage, reduces cache pollution, and improves space utilization. If all cached items belong to the same class, the algorithm reduces to conventional LRU, where eviction decisions rely solely on recency of access.\u003c/p\u003e\n \u003cp\u003eThe GetCache procedure (Algorithm 2) is responsible for retrieving and repositioning a cached block. The procedure begins by removing the block from its current location to preserve structural integrity. Pointers of neighboring nodes are updated accordingly, and the head or tail references of the respective lists are adjusted when the block being removed occupies these positions. Once removed, the block is classified using a SVM classifier, which predicts whether it should be placed in the hot or cold list. The CachePlacement procedure is then invoked to reinsert the block based on its updated classification.\u003c/p\u003e\n \u003cp\u003eThe PutCache procedure (Algorithm 3) manages the insertion of new blocks into the cache. When the cache is at full capacity, it first attempts to evict the head of the Cold list. If the Cold list is empty, eviction is performed from the head of the Hot list. After sufficient space is created, the incoming block is classified using the SVM model, and the CachePlacement procedure is called to insert the block into the corresponding list. This approach ensures that eviction prioritizes less frequently used items, while hot data is retained for longer periods.\u003c/p\u003e\n \u003cp\u003eFinally, the CachePlacement procedure (Algorithm 4) inserts the block into the cache based on its predicted class. If the block is classified as hot, it is placed at the tail of the Hot list (or as the first element if the list is empty). If the block is classified as cold, it is inserted into the Cold list, with the boundary pointer updated to reflect the new head of the Cold list. This classification-driven strategy dynamically adapts cache content to evolving workload characteristics, thereby reducing pollution and improving cache efficiency.\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003ctable id=\"Tabb\" border=\"1\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAlgorithm 2: Procedure GetCache(DB\u003csub\u003ex\u003c/sub\u003e, DN\u003csub\u003ey\u003c/sub\u003e)\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e// retrieve DB\u003csub\u003ex\u003c/sub\u003e from current list\u003c/p\u003e\n \u003cp\u003e1- if DB\u003csub\u003ex\u003c/sub\u003e.prev\u0026thinsp;\u0026ne;\u0026thinsp;NULL then\u003c/p\u003e\n \u003cp\u003e2- DB\u003csub\u003ex\u003c/sub\u003e.prev.next \u0026larr; DB\u003csub\u003ex\u003c/sub\u003e.next\u003c/p\u003e\n \u003cp\u003e3- else // DB\u003csub\u003ex\u003c/sub\u003e was a head node\u003c/p\u003e\n \u003cp\u003e4- if DB\u003csub\u003ex\u003c/sub\u003e is in the Hot list then\u003c/p\u003e\n \u003cp\u003e5- HeadHot \u0026larr; DB\u003csub\u003ex\u003c/sub\u003e.next\u003c/p\u003e\n \u003cp\u003e6- else\u003c/p\u003e\n \u003cp\u003e7- HeadCold \u0026larr; DB\u003csub\u003ex\u003c/sub\u003e.next\u003c/p\u003e\n \u003cp\u003e8- BoundaryPtr \u0026larr; HeadCold\u003c/p\u003e\n \u003cp\u003e9- end if\u003c/p\u003e\n \u003cp\u003e10- end if\u003c/p\u003e\n \u003cp\u003e11- if DB\u003csub\u003ex\u003c/sub\u003e.next\u0026thinsp;\u0026ne;\u0026thinsp;NULL then\u003c/p\u003e\n \u003cp\u003e12- DBx.next.prev \u0026larr; DB\u003csub\u003ex\u003c/sub\u003e.prev\u003c/p\u003e\n \u003cp\u003e13- else // DB\u003csub\u003ex\u003c/sub\u003e was a tail node\u003c/p\u003e\n \u003cp\u003e14- if DB\u003csub\u003ex\u003c/sub\u003e is in the Hot list then\u003c/p\u003e\n \u003cp\u003e15- TailHot \u0026larr; DB\u003csub\u003ex\u003c/sub\u003e.prev\u003c/p\u003e\n \u003cp\u003e16- else\u003c/p\u003e\n \u003cp\u003e17- TailCold \u0026larr; DB\u003csub\u003ex\u003c/sub\u003e.prev\u003c/p\u003e\n \u003cp\u003e18- end if\u003c/p\u003e\n \u003cp\u003e19- end if\u003c/p\u003e\n \u003cp\u003e20- DB\u003csub\u003ex\u003c/sub\u003e.prev \u0026larr; NULL\u003c/p\u003e\n \u003cp\u003e21- DB\u003csub\u003ex\u003c/sub\u003e.next \u0026larr; NULL\u003c/p\u003e\n \u003cp\u003e22- Class-DB\u003csub\u003ex\u003c/sub\u003e \u0026larr; Apply-SVM(features)\u003c/p\u003e\n \u003cp\u003e23- Call CachePlacement(DB\u003csub\u003ex\u003c/sub\u003e, Class-DB\u003csub\u003ex\u003c/sub\u003e)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eAlgorithm 3: Procedure PutCache(DB\u003c/strong\u003e\u003csub\u003e\u003cstrong\u003ex\u003c/strong\u003e\u003c/sub\u003e, \u003cstrong\u003eDN\u003c/strong\u003e\u003csub\u003e\u003cstrong\u003ez\u003c/strong\u003e\u003c/sub\u003e\u003cstrong\u003e)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1- if cache_size\u0026thinsp;=\u0026thinsp;=\u0026thinsp;N then // Cache size is full\u003c/p\u003e\n \u003cp\u003e2- if HeadCold\u0026thinsp;\u0026ne;\u0026thinsp;NULL then // Evict from cold list\u003c/p\u003e\n \u003cp\u003e3- evict(HeadCold)\u003c/p\u003e\n \u003cp\u003e4- HeadCold \u0026larr; HeadCold.next\u003c/p\u003e\n \u003cp\u003e5- BoundaryPtr \u0026larr; HeadCold\u003c/p\u003e\n \u003cp\u003e6- else\u003c/p\u003e\n \u003cp\u003e7- evict(HeadHot) // Cold list is empty and evict from hot list\u003c/p\u003e\n \u003cp\u003e8- HeadHot \u0026larr; HeadHot.next\u003c/p\u003e\n \u003cp\u003e9- end if\u003c/p\u003e\n \u003cp\u003e10- end if\u003c/p\u003e\n \u003cp\u003e11- Class-DB\u003csub\u003ex\u003c/sub\u003e \u0026larr; Apply-SVM(features)\u003c/p\u003e\n \u003cp\u003e12- Call CachePlacement(DB\u003csub\u003ex\u003c/sub\u003e, Class-DB\u003csub\u003ex\u003c/sub\u003e)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003cdiv align=\"left\" class=\"colspec\"\u003e\u003cbr\u003e\u003c/div\u003e\n \u003ctable id=\"Tabc\" border=\"1\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAlgorithm 4: Procedure CachePlacement(DB\u003csub\u003ex\u003c/sub\u003e, Class-DB\u003csub\u003ex\u003c/sub\u003e)\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e===== Case 1: Hot data =====\u003c/p\u003e\n \u003cp\u003e1-if class(DB\u003csub\u003ex\u003c/sub\u003e)\u0026thinsp;=\u0026thinsp;1 then // Insert at Hot list\u003c/p\u003e\n \u003cp\u003e2- if HeadHot\u0026thinsp;=\u0026thinsp;=\u0026thinsp;NULL then //Hot list is empty\u003c/p\u003e\n \u003cp\u003e3- HeadHot \u0026larr; DB\u003csub\u003ex\u003c/sub\u003e\u003c/p\u003e\n \u003cp\u003e4- TailHot \u0026larr; DB\u003csub\u003ex\u003c/sub\u003e\u003c/p\u003e\n \u003cp\u003e5- DB\u003csub\u003ex\u003c/sub\u003e.prev \u0026larr; NULL\u003c/p\u003e\n \u003cp\u003e6- DB\u003csub\u003ex\u003c/sub\u003e.next \u0026larr; BoundaryPtr\u003c/p\u003e\n \u003cp\u003e7- else // Insert at the tail of hot list\u003c/p\u003e\n \u003cp\u003e8- DB\u003csub\u003ex\u003c/sub\u003e.prev \u0026larr; TailHot\u003c/p\u003e\n \u003cp\u003e9- DB\u003csub\u003ex\u003c/sub\u003e.next \u0026larr; BoundaryPtr\u003c/p\u003e\n \u003cp\u003e10- TailHot.next \u0026larr; DB\u003csub\u003ex\u003c/sub\u003e\u003c/p\u003e\n \u003cp\u003e11- TailHot \u0026larr; DB\u003csub\u003ex\u003c/sub\u003e\u003c/p\u003e\n \u003cp\u003e12- end if\u003c/p\u003e\n \u003cp\u003e// ===== Case 2: Cold data =====\u003c/p\u003e\n \u003cp\u003e13- else\u003c/p\u003e\n \u003cp\u003e14- if HeadCold\u0026thinsp;=\u0026thinsp;=\u0026thinsp;NULL then //Cold list is empty\u003c/p\u003e\n \u003cp\u003e15- HeadCold \u0026larr; DB\u003csub\u003ex\u003c/sub\u003e\u003c/p\u003e\n \u003cp\u003e16- TailCold \u0026larr; DB\u003csub\u003ex\u003c/sub\u003e\u003c/p\u003e\n \u003cp\u003e17- DBx.prev \u0026larr; TailHot\u003c/p\u003e\n \u003cp\u003e18- DB\u003csub\u003ex\u003c/sub\u003e.next \u0026larr; NULL\u003c/p\u003e\n \u003cp\u003e19- if TailHot\u0026thinsp;\u0026ne;\u0026thinsp;NULL then\u003c/p\u003e\n \u003cp\u003e20- TailHot.next \u0026larr; DB\u003csub\u003ex\u003c/sub\u003e\u003c/p\u003e\n \u003cp\u003e21- end if\u003c/p\u003e\n \u003cp\u003e22- else\u003c/p\u003e\n \u003cp\u003e23- DB\u003csub\u003ex\u003c/sub\u003e.prev \u0026larr; TailCold\u003c/p\u003e\n \u003cp\u003e24- DB\u003csub\u003ex\u003c/sub\u003e.next \u0026larr; NULL\u003c/p\u003e\n \u003cp\u003e25- TailCold.next \u0026larr; DB\u003csub\u003ex\u003c/sub\u003e\u003c/p\u003e\n \u003cp\u003e26- TailCold \u0026larr; DB\u003csub\u003ex\u003c/sub\u003e\u003c/p\u003e\n \u003cp\u003e27- end if\u003c/p\u003e\n \u003cp\u003e28- BoundaryPtr \u0026larr; HeadCold\u003c/p\u003e\n \u003cp\u003e29- end if\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Sec9\" class=\"Section3\"\u003e\n \u003ch2\u003e4.2.1 Illustrative Example of Algorithm Behavior\u003c/h2\u003e\n \u003cp\u003eTo further demonstrate the advantages of the proposed SVM-assisted cache management strategy, we present a comparative trace analysis against the baseline LRU policy. In this example, the cache capacity is fixed at N\u0026thinsp;=\u0026thinsp;3, and the request sequence is defined as:{A,B,C,A,D,B,C,C,A,E, C}. Initially, the classifier labels blocks A and B as hot, while blocks C, D, and E are identified as cold. To emulate the dynamic nature of real workloads, we introduce a reclassification event at step 8, where the SVM model updates its prediction for block C, changing its label from cold to hot. This reflects scenarios where previously infrequent blocks become popular due to shifting access patterns.\u003c/p\u003e\n \u003cp\u003eUnder the LRU policy, eviction decisions rely solely on recency information. As shown in Table \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e, LRU achieves only three cache hits over the eleven requests (steps 4, 8, and 11), yielding a hit ratio of approximately 27%. The policy prematurely evicts hot items when cold blocks occupy the cache, failing to adapt when block C becomes frequently accessed later in the trace.\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003ctable id=\"Tab2\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eThe sample of LRU\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eStep\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eRequest\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eCache state\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAction\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eHit/Miss\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[A]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eInsert A\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMiss\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[B, A]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eInsert B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMiss\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[C, B, A]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eInsert C\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMiss\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[A, C, B]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMove A\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHit\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eD\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[D, A, C]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEvict B, insert D\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMiss\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[B, D, A]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEvict C, insert B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMiss\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[C, B, D]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEvict A, insert C\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMiss\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[C, B, D]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMove C\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHit\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[A, C, B]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEvict D, insert A\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMiss\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e10\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eE\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[E, A, C]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEvict B, insert E\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMiss\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e11\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[C, A, B]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMove C\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHit\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003eIn contrast, the proposed H-SVM-LRU policy incorporates classification prior to reinsertion. As can be seen in Table \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e, during the early stages (steps 1\u0026ndash;7), block C is placed in the Cold list and thus becomes the primary target for eviction, while hot blocks A and B remain protected in the Hot list. At step 8, when the classifier re-labels C as hot, the algorithm immediately migrates the block into the Hot list. From that point onward, cold blocks (like D) are preferentially evicted, preserving the newly promoted hot block C in the cache. At step 10, the cold block E arrives while the cache is full, the algorithm evicts the hot data to insert E into the Cold list. In this case, when the cache is full and a newly classified cold block arrives, H-SVM-LRU first attempts eviction from the Cold list. If the Cold list is empty, the algorithm evicts from the Hot list to admit the new block. While this appears to contradict the principle of pollution reduction, it ensures adaptability by maintaining a dynamic balance between hot and cold regions of the cache. This design prevents cache monopolization by hot data and allows cold blocks the opportunity to be reclassified into the Hot list if their access frequency increases. Thus, the mechanism trades a small risk of pollution for long-term adaptability and robustness. This adaptive behavior results in five cache hits (steps 4, 6, 8, 9, and 11), a hit ratio of approximately 45%.\u003c/p\u003e\n \u003cp\u003eThis example highlights two critical properties of the proposed scheme. First, the classification-driven separation of hot and cold data ensures that eviction targets are selected based on predicted utility rather than recency alone, thereby reducing cache pollution. Second, the design inherently supports dynamic reclassification, enabling the system to adapt to evolving workloads without requiring additional mechanisms. These capabilities allow H-SVM-LRU to outperform classical LRU in scenarios where data access distributions are non-stationary.\u003c/p\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Sec10\" class=\"Section3\"\u003e\n \u003ch2\u003e4.2.2 Algorithm complexity\u003c/h2\u003e\n \u003cp\u003eSince the Hot and Cold lists are maintained as doubly linked lists, insertion, deletion, and pointer updates require O(1) operations. Cache lookup is performed via a metadata index, which also requires O(1) time. The space complexity remains O(n), where n is the cache capacity, with a negligible constant overhead for maintaining head, tail, and boundary pointers.\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003ctable id=\"Tab3\" border=\"1\" class=\"fr-table-selection-hover\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eThe sample of H-SVM-LRU\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eStep\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eRequest\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eClass\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eHot\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eCold\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAction\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eHit/Miss\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ehot\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[A]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eInsert into Hot list\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMiss\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ehot\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[A, B]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eInsert into Hot list\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMiss\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ecold\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[A, B]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[C]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eInsert into Cold list\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMiss\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ehot\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[B, A]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[C]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMove A to Hot tail\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHit\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eD\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ecold\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[B, A]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[D]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEvict C, insert D into Cold list\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMiss\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ehot\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[A, B]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[D]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMove B to the tail\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHit\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ecold\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[A, B]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[C]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEvict D, insert C into Cold list\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMiss\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ehot\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[A, B, C]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMove C to Hot tail\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHit\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ehot\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[B, C, A]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMove A to Hot tail\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHit\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e10\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eE\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ecold\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[C, A]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[E]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEvict B, insert E into Cold list\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMiss\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e11\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ehot\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[A, C]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e[E]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMove C to Hot tail\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHit\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003e\u003cbr\u003e\u003c/p\u003e\n \u003c/div\u003e\n\u003c/div\u003e"},{"header":"5. H-SVM-LRU implementation","content":"\u003cp\u003eIn this section, we provide details for the three phases of H-SVM-LRU implementation: training data preparation, model training, and Integration of H-SVM-LRU with Hadoop\u0026rsquo;s Caching Mechanism.\u003c/p\u003e\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003e5.1 Training data preparation phase\u003c/h2\u003e\u003cp\u003eThis phase consists of four steps: data collection, feature selection, providing target labels, and data preprocessing. We now explain each step in more detail:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eData collection\u003c/em\u003e: We consider two independent scenarios: request awareness and non-request awareness.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eIn the first scenario, the sequence of requested data is determined by the tasks, and we consider the following data features: size, recency, frequency, and type (input data of Map tasks, intermediate data, and output of Reduce tasks). Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e describes these features.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eFeatures for the request-awareness scenario\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"2\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFeature name\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eDescription\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eType\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eThe input of the Map task, intermediate data, and the output of the Reduce task\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSize\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSize of data blocks in MB\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRecency\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTime last used\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFrequency\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNumber of uses\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eIn the second scenario, we use the ALOJA [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e] Hadoop dataset that gathers training data by executing various workloads from the Intel Hi-Bench benchmark suite [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e], [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e], a comprehensive benchmark suite for Hadoop consisting of a set of Hadoop programs, including both synthetic micro-benchmarks and real-world Hadoop applications. We then extract some useful features from the Hadoop job history [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e], consisting of log information of MapReduce jobs.\u003c/p\u003e\u003cp\u003eIn the request awareness scenario, the demand data are predefined. In other words, training data have a label; therefore, target labels do not need to be generated. This allows us to consider fewer features than in the second scenario, where the requirement to generate target labels may require the consideration of more features.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eFeature selection\u003c/em\u003e: It is important to choose suitable features to ensure model performance while reducing overfitting and computational demand. There are different metrics to select features.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eSelecting features based on missing values\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eSelecting features based on variance\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eSelecting features based on correlation with other features\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eSelecting features based on model performance\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eFeatures for the non-request-awareness scenario to provide the target label\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFeature name\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFeature type\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eDescription\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eJobName\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eJob\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eName of job: WordCount, Sort, Grep, Sort, etc\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMapsTotal\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eJob\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eThe total number of Map tasks\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMapsCompleted\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eJob\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eThe number of completed Map Tasks\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eReducesTotal\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eJob\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eThe total number of Reduce tasks\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eReducesCompleted\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eJob\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eThe number of completed Reduce tasks\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eJob-Status\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eJob\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eValid values of job state are: New, Initiated, Running, Succeeded, Failed, Killed, and Error.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCache Affinity\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eJob\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCache affinity of application: Low, High, Medium\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eStart time\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eJob\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eThe time the job started (in ms)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFinish time\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eJob\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eThe time the job finished (in ms)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTask type\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTask\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMap or Reduce task\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTask status\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTask\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eThe states of the task are: New, Scheduled, Running, Succeeded, Failed, and Killed\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAvgMapTime\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTask\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eThe average time of a Map task (in ms)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAvgReduceTime\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTask\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eThe average time of a Reduce task (in ms)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eProgress\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTask\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eThe progress of the task as a percentage\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eIn this step, we select the features given in Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e. For simplicity, we ignore the size of the data and recently used data features because input data are split into data blocks of the same size, and the recently used data feature is taken into account by the LRU policy.\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eProviding target labels\u003c/em\u003e: Since the training dataset does not have target labels, we should provide them. For this purpose, we use a scenario that is based on job status and the status of its Map and Reduce tasks to provide a label for the requested task data. Table\u0026nbsp;\u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e6\u003c/span\u003e describes different cases for this scenario.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eDataset preprocessing\u003c/em\u003e: The last step of preparation of the training dataset is data preprocessing, which includes the elimination of irrelevant data, unnecessary fields, and data normalization.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab6\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 6\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eGuidelines to provide target labels\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"6\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eJob\u003c/p\u003e\u003cp\u003estatus\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eMap task status\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eReduce task status\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eInput Map task label\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eInput Reduce task label\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eRationale\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eNew\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNew\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eNew\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eCold\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eCold\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eThe job is waiting to be processed.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eInitiated\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eScheduling\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eWaiting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eHot\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eCold\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eThe outputs of the Map tasks have not been generated yet.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRunning\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eRunning\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eWaiting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eHot\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eCold\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eThe outputs of the Map tasks have not been generated yet.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRunning\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSucceeded\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eScheduling\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eCold\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eHot\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eIf the input of Reduce is the output of the completed Map task.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRunning\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSucceeded\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eRunning\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eCold\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eHot\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eThe map task has been completed.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRunning\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFailed\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eWaiting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eCold\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eCold\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eThe Map task failed and cannot generate intermediate data.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRunning\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSucceeded\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eFailed\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eCold\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eCold\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eThe map task has been completed, and the Reduce task has failed and cannot continue.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRunning\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eKilled\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eWaiting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eHot\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eCold\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eThe killed task may execute on another node (speculative task)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRunning\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSucceeded\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eKilled\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eCold\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eHot\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eThe failed task may execute on another node (speculative task)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSucceeded\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSucceeded\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eSucceeded\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eCold\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eCold\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eThe job has been completed, and we do not consider the relationship between jobs and\u0026nbsp;repetitive and recurring jobs.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFailed\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eDon\u0026rsquo;t care\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eDon\u0026rsquo;t care\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eCold\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eCold\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eJob status has a higher priority than task status\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\u003ch2\u003e5.2 Model training phase\u003c/h2\u003e\u003cp\u003eIn this phase, we use the Scikit-Learn library in Python to implement an SVM for classifying data. The training process includes two steps: choosing the best kernel function and evaluating the trained model.\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eChoosing the best kernel function\u003c/em\u003e: SVM has several available kernel functions that can be used for training, including polynomial, sigmoid, linear, and RBF. We evaluate the performance of kernel functions by using the confusion matrix to choose an appropriate kernel function for the training dataset. A confusion matrix is a table that is often used to describe the performance of a classification model. The metrics that are used in the confusion matrix method for investigating the correctness of classification are:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eRecall\u003c/em\u003e: The ability of a classification model to identify all relevant instances.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003ePrecision\u003c/em\u003e: The ability of a classification model to return only relevant instances.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eF1 score\u003c/em\u003e: This metric combines recall and precision using the harmonic mean.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe formulas for calculating these metrics are as follows:\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$\\:\\text{R}\\text{e}\\text{c}\\text{a}\\text{l}\\text{l}=\\frac{TP}{TP+FN}\\:\\text{P}\\text{r}\\text{e}\\text{c}\\text{i}\\text{s}\\text{i}\\text{o}\\text{n}=\\frac{TP}{TP+FP}\\:\\text{F}1\\:\\text{S}\\text{c}\\text{o}\\text{r}\\text{e}=2*\\frac{Precision*Recall}{Precision+Recall}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eWe chose the RBF function as the kernel function for our dataset because it demonstrated the best performance. The experimental results are reported in Table\u0026nbsp;\u003cspan refid=\"Tab7\" class=\"InternalRef\"\u003e7\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab7\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 7\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eEvaluation of different kernel functions\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"8\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eKernel function\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c3\" namest=\"c2\"\u003e\u003cp\u003ePrecision\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c5\" namest=\"c4\"\u003e\u003cp\u003eRecall\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c7\" namest=\"c6\"\u003e\u003cp\u003eF1-score\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c8\"\u003e\u003cp\u003eAccuracy\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eLinear\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.67\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.8\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003e0.71\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0.33\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.5\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eRBF\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.8\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.81\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003e0.85\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.65\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.75\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eSigmoid\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.57\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.73\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003e0.57\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eo\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eEvaluating the trained model\u003c/em\u003e: The dataset is divided randomly into training data (75%) and testing data (25%). In this phase, we use testing data and cross-validation methods to evaluate the training model and its prediction accuracy. The resulting prediction accuracy is 83%.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eImpact of misclassification in H-SVM-LRU\u003c/em\u003e: The effectiveness of the proposed H-SVM-LRU approach is highly dependent on the accuracy of the underlying SVM classifier, which determines whether a cached block is hot (frequently accessed) or cold (a candidate for eviction). When the SVM employs an RBF kernel, the parameters C (regularization) and γ (kernel width) play a critical role in shaping the decision boundary. A low value of C allows the model to tolerate more misclassifications, resulting in premature eviction of hot blocks and an increased cache miss ratio. Conversely, a very high C enforces strict error minimization but risks overfitting transient workload patterns, reducing adaptability in dynamic Hadoop environments. Similarly, a low γ produces a smoother boundary that may fail to capture short-term locality of reference, while an excessively high γ yields a highly complex boundary that memorizes recent access patterns without generalizing to future workloads. Thus, improper tuning of these parameters may increase misclassification rates and reduce the overall benefit of intelligent cache management.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eMisclassification in SVM-LRU manifests in two forms: false positives (classifying cold blocks as hot data) and false negatives (classifying hot blocks as cold data). False positives cause the cache to retain unnecessary blocks, thereby wasting limited memory resources and lowering cache efficiency. More critically, false negatives lead to premature eviction of hot blocks, forcing Hadoop to re-fetch them from disk or remote nodes, which increases I/O overhead and prolongs job execution times. This effect cascades into multiple components of the Hadoop ecosystem: the HDFS caching layer experiences higher block replacement frequency, MapReduce tasks suffer from increased shuffle and fetch delays, and YARN resource utilization is degraded due to extended waiting times. Therefore, reducing misclassification through balanced parameter tuning and representative training data is essential to maximizing the cache hit ratio and ensuring stable performance improvements in Hadoop clusters.\u003c/p\u003e\u003cp\u003eTo mitigate the impact of misclassification, hyperparameter tuning, including the regularization parameter C and the RBF kernel coefficient γ, was optimized using grid search combined with five-fold cross-validation to ensure an effective balance between model complexity and generalization. To address class imbalance, class weights were adjusted inversely proportional to class frequencies, and the Synthetic Minority Oversampling Technique (SMOTE) was applied to generate additional samples for underrepresented hot blocks. Experimental evaluation revealed that approximately 17% of data blocks were misclassified, with false negatives (hot blocks predicted as cold) accounting for most of the cost by causing a 5.2% increase in cache misses compared to the baseline LRU policy. In contrast, false positives (cold blocks predicted as hot) introduced only a minor overhead, corresponding to about a 1.3% increase in cache utilization. Importantly, the training process required only approximately 40 seconds and about 220 MB of memory for 500,000 access records, which is negligible relative to Hadoop job runtimes. Thus, the modest training overhead is easily amortized by the reduction in cache misses, making the H-SVM-LRU approach practical for real-world cluster environments.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\u003ch2\u003e5.3 Integration of H-SVM-LRU with Hadoop\u0026rsquo;s Caching Mechanism\u003c/h2\u003e\u003cp\u003eThe proposed H-SVM-LRU framework was integrated into Hadoop by extending existing caching components to enhance block admission and eviction decisions, while maintaining compatibility with the default LRU policy. In Hadoop Distributed File System (HDFS), caching is coordinated by the CacheManager, which manages block placement and eviction across nodes. Although HDFS provides a centralized cache system that allows users to manually add files into memory for faster access, this mechanism is not intelligent; it must be manually adjusted, and thus cannot adapt to dynamic workload patterns. To address this limitation, H-SVM-LRU introduces a machine learning\u0026ndash;driven advisory layer on top of Hadoop\u0026rsquo;s default caching mechanism. Specifically, an SVM-based decision module was integrated at the boundary pointer where blocks transition between cold and hot lists. Instead of replacing the entire eviction policy, the SVM serves as a predictive filter: blocks with high predicted utility are promoted to the hot list, while those with low utility are marked as eviction candidates. If no reliable prediction is available, the system naturally falls back to the default LRU strategy, thereby ensuring robustness and backward compatibility. A feature extraction layer was developed to collect runtime metadata, including block access frequency, recency, size, and node locality. This layer operates within the Hadoop I/O pipeline (like DFSClient and data access logs) and is designed to gather features with minimal overhead. The extracted features are periodically fed into the SVM decision engine, which determines whether a block should be admitted into or evicted from the cache. To reduce inference latency, recent predictions are cached in memory.\u003c/p\u003e\u003cp\u003eTo support scalability and adaptability, a model update interface was incorporated, enabling the SVM model to be retrained either offline or online using YARN-based training jobs. Updated models are synchronized across caching nodes through Hadoop\u0026rsquo;s configuration service. Several engineering challenges were addressed during integration:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eFeature collection overhead\u003c/em\u003e was minimized by using sampling and batch updates.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eModel inference latency\u003c/em\u003e was mitigated through prediction caching.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eSystem compatibility\u003c/em\u003e was ensured by embedding the SVM module as a pluggable advisory component rather than a core replacement of LRU.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eIn addition to cache decision logic, a daemon server architecture was designed to manage job submissions and automate cache operations. The entry process collects user-specified Hadoop jobs from multiple sources (e.g., shell scripts, APIs, web interfaces) and forwards them to the daemon server via TCP/IP. The daemon server continuously runs in the background, receiving job commands, parsing the required files, and applying the cache replacement policy before invoking the Hadoop cluster to execute the job. This layer ensures that file access patterns are tracked and optimized in real time, providing an unattended mechanism to admit working files into the cache and evict unused ones. The overall workflow of HDFS caching with H-SVM-LRU can be summarized as follows:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eJob submission\u003c/em\u003e: The entry process forwards user commands to the daemon server.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eFile parsing and caching\u003c/em\u003e: The daemon server identifies input files, applies the H-SVM-LRU policy, and issues cache commands.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eCache execution: Namenode translates the files into blocks and dispatches cache commands to Datanodes.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eBlock admission: Datanodes pin the blocks into the memory cache pool and return acknowledgment to the namenode.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eJob execution\u003c/em\u003e: Once caching is finalized, the Hadoop cluster executes the job with improved memory-based I/O performance.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003eThrough this modular integration, the SVM-enhanced cache manager operates transparently within Hadoop\u0026rsquo;s existing framework, significantly reducing cache pollution and improving job execution time. By combining Hadoop\u0026rsquo;s centralized caching functionality with machine learning\u0026ndash;driven predictive intelligence, H-SVM-LRU provides a scalable, unattended, and robust solution that improves performance for data-intensive workloads, particularly those with repetitive file access patterns such as MapReduce-based WordCount, Sort, and Grep applications.\u003c/p\u003e\u003c/div\u003e"},{"header":"6. H-SVM-LRU evaluation","content":"\u003cp\u003eIn this section, we explain the experimental environment, including software and hardware configurations, and set some Hadoop configuration parameters. Our evaluation is divided into two sections: the H-SVM-LRU performance evaluation and the investigation of the impact of the proposed algorithm on Hadoop performance. We first evaluate the efficiency of the proposed algorithm by using the cache hit ratio as the performance metric. Finally, we perform experiments to present the impact of the H-SVM-LRU cache replacement policy on overall Hadoop performance.\u003c/p\u003e\u003cdiv id=\"Sec16\" class=\"Section2\"\u003e\u003ch2\u003e6.1 Experimental setup\u003c/h2\u003e\u003cp\u003eFor our experiments, we use a cluster consisting of a single NameNode and nine DataNodes located in the same rack.\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eHardware configuration\u003c/em\u003e: Nodes are connected via a 10 Gigabit Ethernet switch. Each node is configured with an Intel Core i7 with an 8-core processor, 64 GB memory, and a 1 TB hard disk.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eSoftware configuration\u003c/em\u003e: We use Ubuntu 20.04 as the operating system and JDK 11, Hadoop version 3.4.1, and Intel HiBench version 7.1.1.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eHadoop configuration parameters\u003c/em\u003e: The block size of files in HDFS is chosen to be one of two values, 64 MB or 128 MB; the number of cache replicas is set to one, and data replication is set to 3. The memory sizes for the Map task, Reduce task, and node manager are 1GB, 2GB, and 8 GB, respectively. The maximum size of the cache is set to 1.5 GB, and we assume that each DataNode in the cluster has the same size cache. Table\u0026nbsp;\u003cspan refid=\"Tab8\" class=\"InternalRef\"\u003e8\u003c/span\u003e presents the Hadoop configuration parameters with their values. The remaining Hadoop configuration parameters are set to the default values.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eMapReduce applications\u003c/em\u003e: As mentioned earlier, we use Intel HiBench as a Hadoop benchmark suite that contains the following applications: 1) WordCount is a CPU-intensive application that counts the number of occurrences of each word in a text file. 2) Sort is a typical I/O-bound application that sorts input data. 3) Grep is a mix of CPU-bound and I/O-bound operations that searches for a substring in a text file. These three applications are supported by Hadoop. 4) Join is a multiple-stage application where the results of the previous step are used as input for the next step. 5) Aggregation (supported by Hive) is used for the aggregation operation in a query.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eDataset\u003c/em\u003e: For carrying out experiments, we use the Gutenberg dataset [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e] as input data for the WordCount application to evaluate its execution time based on different input data. As we mentioned earlier in the implementation section, the ALOJA dataset is used as a training dataset for the SVM model. The applications use input files generated by a random text generator.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab8\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 8\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eHadoop parameters\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"2\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eHadoop property name\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eHadoop property value\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDfs.replication\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e3\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDfs.blocksize\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e64M or 128M\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMapreduce.map.memory.mb\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e4096\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMapreduce.reduce.memory.mb\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e8192\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMapReduce.jobhistory.webapp.address\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eMaster:19888\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMapReduce.reduce.speculative\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFalse\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMapReduce.map.speculative\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFalse\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMapred.map.tasks.speculative.execution\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFalse\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMapred.reduce.tasks.speculative.execution\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFalse\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e\u003ch2\u003e6.2 Metrics\u003c/h2\u003e\u003cp\u003eIn these experiments, we consider three key metrics to evaluate our proposed algorithm. The first, the cache hit ratio is used to evaluate the performance of the proposed H-SVM-LRU cache replacement policy. The other two metrics, job execution time and normalized run time, are used for determining the impact on Hadoop performance. In the following, we explain these three metrics:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eHit ratio and byte-hit ratio\u003c/em\u003e: These are two major factors to evaluate the performance of the cache replacement strategy. Hit ratio relates the number of cache hits to the total number of requests, and byte-hit ratio relates the number of bytes obtained from the cache to the total number of bytes requested. It is very difficult for a cache replacement strategy to simultaneously optimize these two metrics because improving the hit ratio usually favors small-sized items over large-sized items, leading to a reduced byte-hit ratio. In contrast, strategies that tend to increase the byte-hit ratio and prefer large-sized items typically decrease the hit ratio. In the experiments, we only consider the hit ratio because data blocks have the same size.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eJob execution time\u003c/em\u003e: This plays a vital role in Hadoop performance improvement, as it is related to data access time. The data access time decreases significantly if we can access data from the cache instead of the disk, reducing the job execution time. To calculate the average job execution time, we run each application five times.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eNormalized run time\u003c/em\u003e: For each application in a workload, its run time is normalized based on the Hadoop original (Hadoop no-cache). The average normalized time for applications is then calculated to evaluate overall Hadoop performance.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec18\" class=\"Section2\"\u003e\u003ch2\u003e6.3 H-SVM-LRU performance\u003c/h2\u003e\u003cp\u003eIn this section, we compare H-SVM-LRU with LRU by considering two metrics: cache hit ratio based on different cache sizes and job execution time in both cache cold and cache warm states.\u003c/p\u003e\u003cdiv id=\"Sec19\" class=\"Section3\"\u003e\u003ch2\u003e6.3.1 Cache hit ratio\u003c/h2\u003e\u003cp\u003eFor carrying out experiments to calculate the cache hit rates, we consider two data block sizes: 64MB and 128 MB. The input data size is 2GB with the same sequence of requested data for each mechanism, and the cache size is the same in all DataNodes (1.5 GB). We then calculate the cache size based on the maximum number of data blocks that can be cached, which was varied between 6\u0026ndash;12 for a 128 MB block size and 6\u0026ndash;24 for a 64 MB block size. Figure\u0026nbsp;3 presents cache hit ratios for block sizes of 64 MB and 128 MB.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eIn Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e4\u003c/span\u003e, we can observe that by increasing the cache size, the hit ratio is increased for all strategies as more requested data can be cached. Also, by increasing the data block size, the cache hit ratio increases, for instance, when the cache size is 6 and the data block size has increased from 64MB to 128MB, the cache hit ratio has approximately doubled because we can cache more data. Both diagrams demonstrate that the hit ratio of H-SVM-LRU is higher than for LRU, in particular when the cache size is small. H-SVM-LRU uses the SVM model to identify hot data more accurately by learning access patterns over time, prioritizing hot data over cold data in a cache placement policy, and avoiding unnecessary evictions of hot data. It is noteworthy that H-SVM-LRU and BN-LRU achieve comparable cache hit ratios when the cache size is sufficiently large, as both algorithms effectively address cache pollution by predicting the likelihood of data block reuse. Nevertheless, BN-LRU underperforms in scenarios with small cache sizes, exhibiting a lower cache hit ratio compared to H-SVM-LRU. This performance gap can be attributed to the Bayesian model\u0026rsquo;s reliance on historical data; when the available dataset is limited, the model lacks adequate information to generate accurate predictions, thereby reducing its effectiveness relative to H-SVM-LRU.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab9\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 9\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eImprovement ratio based on hit ratio\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"5\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCache size\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c3\" namest=\"c2\"\u003e\u003cp\u003eIR for Data block size\u003c/p\u003e\u003cp\u003e(64 MB)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c5\" namest=\"c4\"\u003e\u003cp\u003eIR for Data block size\u003c/p\u003e\u003cp\u003e(128 MB)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLRU\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eBN-LRU\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eLRU\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eBN-LRU\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e6\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e63.63%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e21.43%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e20.83%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e12.05%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e8\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e64.70%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e33.33%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e15.15%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e8.5%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e10\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e33.33%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e20.75%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e10.25%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e2%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e12\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e33.33%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e9.09%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e6.81%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e1%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e14\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e22.58%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e5.20%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eN/A\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eN/A\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e16\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e14.28%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e2.5%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eN/A\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eN/A\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e18\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e7.89%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eN/A\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eN/A\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eIn order to investigate the performance improvement of H-SVM-LRU over LRU and BN-LRU, we calculate the improvement ratio of the performance (IR) based on the hit ratio for each cache size. Table\u0026nbsp;\u003cspan refid=\"Tab9\" class=\"InternalRef\"\u003e9\u003c/span\u003e presents the relative improvement of H-SVM-LRU over LRU and NB-LRU for different cache sizes for both 64 MB and 128 MB block sizes. We observe that H-SVM-LRU has the greatest improvement ratio for small cache size and small data blocks, suggesting that H-SVM-LRU is suitable for small cache size because it contains more hot data. Although both H-SVM-LRU and BN-LRU mitigate cache pollution by predicting frequently accessed data blocks, BN-LRU requires sufficient historical information to estimate the probability of data reuse, which is often limited for smaller block sizes. Consequently, BN-LRU achieves comparable performance to SVM-LRU only when the block size is large.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec20\" class=\"Section3\"\u003e\u003ch2\u003e6.3.2 Job execution time in both cache states\u003c/h2\u003e\u003cp\u003eIn this experiment, we evaluate two cold cache and warm cache scenarios to compare the performance of LRU and H-SVM-LRU algorithms with respect to job execution time. To achieve this, we execute a variety of applications with low, medium, and high cache affinity, as well as different job types, including I/O-bound and CPU-bound jobs.\u003c/p\u003e\u003cp\u003eIn the cold cache scenario, LRU initially exhibits a low cache hit ratio, which gradually improves as frequently accessed data populates the cache over time. In contrast, H-SVM-LRU leverages predictive capabilities to identify potential hot data even during a cold start, resulting in a slightly higher cache hit ratio compared to LRU at early stages, though it remains low overall. Consequently, the job execution time for cold cache scenarios is lower with H-SVM-LRU than with LRU.\u003c/p\u003e\u003cp\u003eIn the warm cache scenario, LRU demonstrates susceptibility to evicting moderately accessed hot data, particularly during spikes in cold or hot data. This behavior can result in suboptimal cache hit ratios. On the other hand, H-SVM-LRU\u0026rsquo;s SVM classifier is able to predict the reuse potential of hot data, thereby minimizing unnecessary evictions. By adapting to workload changes using historical patterns, H-SVM-LRU enhances cache efficiency and reduces job execution time.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eFigure\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e5\u003c/span\u003e: Job execution time in cold cache and warm cache scenarios\u003c/p\u003e\u003cp\u003eAs illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e5\u003c/span\u003e, H-SVM-LRU consistently achieves shorter execution times compared to LRU by effectively predicting reusable data blocks. The degree of improvement, however, varies across applications. In the cold cache scenario, the most significant performance gains are observed in applications such as join, sort, and aggregation, which are particularly sensitive to disk I/O. In the warm cache scenario, both algorithms exhibit improved performance, but H-SVM-LRU consistently outperforms LRU due to its more effective eviction policies. While the performance gap is narrower for simpler applications, such as Grep and Wordcount, it becomes more pronounced in complex operations, such as join.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec21\" class=\"Section2\"\u003e\u003ch2\u003e6.4 Impact of H-SVM-LRU on Hadoop performance\u003c/h2\u003e\u003cp\u003eIn this section, we carry out two separate experiments to investigate the impact of H-SVM-LRU on Hadoop performance: 1) Job execution time based on different input data sizes. 2) Normalized run time of multiple applications in a workload. For this purpose, we compare Hadoop performance in the following scenarios to extract the impact of the proposed replacement policy on Hadoop performance:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eH-NoCache: The Hadoop original does not utilize HDFS in-memory caching; it is used as a baseline.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eH-LRU: Hadoop uses traditional LRU as a cache replacement policy.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eH-NB-LRU: Integrates a probabilistic inference mechanism, based on a Bayesian network, with the LRU strategy to predict the likelihood of reusing data block in the future.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eH-SVM-LRU: H-SVM-LRU is used as a cache replacement policy.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cdiv id=\"Sec22\" class=\"Section3\"\u003e\u003ch2\u003e6.4.1 Job execution time based on different input data sizes\u003c/h2\u003e\u003cp\u003eIn this experiment, we consider the job execution time of the WordCount MapReduce application based on different input data sizes for two different data block sizes (64 MB and 128MB). Figure\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e4\u003c/span\u003e presents job execution time based on input data size for our three scenarios.\u003c/p\u003e\u003cp\u003eAs illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e5\u003c/span\u003e, there is a significant reduction in job execution time when Hadoop is equipped with a caching mechanism compared to the original Hadoop. This improvement becomes more pronounced as the input data size increases, since the likelihood of accessing requested data directly from the cache grows accordingly. When comparing H-LRU and H-SVM-LRU, the results indicate that H-SVM-LRU consistently achieves lower execution times. This outcome is attributed to the higher cache hit ratio in H-SVM-LRU, which enables the cache to warm up with frequently accessed (\u0026ldquo;hot\u0026rdquo;) data that is more likely to be reused in subsequent operations. H-NB-LRU, while introducing moderate computational overhead for probabilistic prediction, can also reduce execution time significantly when prediction accuracy is high. However, for small cache sizes, H-NB-LRU sometimes underperforms relative to H-SVM-LRU due to the lack of sufficient historical data for accurate Bayesian estimation. In contrast, for larger input sizes and cache capacities, H-NB-LRU outperforms H-LRU by more effectively retaining frequently reused blocks and minimizing disk accesses.\u003c/p\u003e\u003cp\u003eIn the second experiment, with a data block size of 128 MB, the performance gap between the original Hadoop and Hadoop with caching widened substantially. Increasing the block size allows for more data to be cached, thereby improving the byte-hit ratio. Under this configuration, both H-SVM-LRU and H-NB-LRU achieved shorter execution times compared to H-LRU, as they are more effective at preventing cache pollution and increasing the likelihood of cache reuse.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec23\" class=\"Section3\"\u003e\u003ch2\u003e6.4.2 Normalized run time of multiple applications in a workload\u003c/h2\u003e\u003cp\u003eIn this experiment, we take into account various workloads, consisting of four concurrent MapReduce applications. We assume that all applications in one workload require an equal share of cluster resources. In addition, some applications use the same input data, and data is shared between them, for instance, Grep, WordCount, and Sort use the same input data that is generated by a random text generator, and the data is shared between aggregation and join.\u003c/p\u003e\u003cp\u003eThe cache affinity feature [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] determines how to utilize the benefit of cached data in each application such that it can be classified into three categories based on this feature: low cache affinity (Sort), medium cache affinity (WordCount, Join), and high cache affinity (Grep, Aggregation). Therefore, we provide various workloads composed of I/O-bound and CPU-bound applications by considering their cache affinity feature. Table\u0026nbsp;\u003cspan refid=\"Tab10\" class=\"InternalRef\"\u003e10\u003c/span\u003e presents the list of workloads with their applications that are used in this experiment.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab10\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 10\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eThe list of workloads with their applications\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"6\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eWorkload\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eApp1\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eApp2\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eApp3\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eApp4\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eInput data size (GB)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eW1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAggregation\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eGrep\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eJoin\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eWordCount\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e257.3\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eW2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAggregation\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eGrep\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eSort\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eWordCount\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e262.9\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eW3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAggregation\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eWordCount\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eGrep\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eGrep\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e376.2\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eW4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAggregation\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eSort\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eGrep\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eGrep\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e446.7\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eW5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eGrep\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eGrep\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eSort\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eWordCount\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e254.3\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eW6\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAggregation\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eGrep\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eJoin\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eSort\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e377.1\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eIn order to compare the performance for each workload, we calculate the normalized run time based on the Hadoop original (H-No-Cache). Figure\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e6\u003c/span\u003e illustrates the experimental results. If we compare H-LRU with Hadoop, we observe that Hadoop-LRU improves performance by 11.33% and the average improvement of H-SVM-LRU is 16.16%, 4.83%, and 2.45% against Hadoop-original, H-LRU, and H-NB-LRU, respectively. The number of cache hits in H-SVM-LRU is higher than in Hadoop-LRU as a result of using cache space efficiently, resulting in a lower average execution time. Also, by leveraging machine learning to predict reusable data, H-SVM-LRU achieves lower execution times during both cold and warm cache states. It prioritizes frequently accessed blocks and avoids unnecessary evictions, making it effective in sort operations that rely heavily on sequential I/O. H-NB-LRU predicts reused data blocks based on a probabilistic model that needs historical data. While it performs well in warm cache states, its reliance on historical data makes it less adaptable to rapidly changing workloads, as probability estimates may lag behind dynamic access patterns, resulting in higher execution times compared to H-SVM-LRU. All of the studied algorithms have the best improvement in workload W3 and W5 due to the fact that workload W3 is composed of high cache affinity applications, and workload W5 has the most shared data between applications.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e7\u003c/span\u003e provides the normalized run times of applications for each workload in the H-SVM-LRU scenario, in order to investigate the impact of H-SVM-LRU on the performance of each application in the workloads. To clarify, we use different colors to separate each application. We observe that some I/O-intensive applications like Grep and Sort show significant improvements in their performance because Sort can benefit from reusing cached data (the data also used by Grep and WordCount). I/O-bound jobs also spend most of their time on reading blocks, meaning they can have increased benefit from cached data. Therefore, the performance of I/O-intensive applications like Sort can be improved when they are combined with other applications with different resource usage patterns. Moreover, multiple-stage applications like Join have difficulty reusing the input files because the output of the previous stage is used as input for the next stage, and this usage is not well-suited for this caching mechanism.\u003c/p\u003e\u003cp\u003eWe conclude that H-SVM-LRU is appropriate for workloads with jobs that reuse a large amount of data.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e"},{"header":"7. Limitations and Future Work","content":"\u003cp\u003eAlthough the proposed H-SVM-LRU approach demonstrates notable improvements in cache hit rates and overall Hadoop performance, several limitations must be acknowledged. First, despite achieving high prediction accuracy (83%), misclassifications can occur, particularly under workloads with highly dynamic or unpredictable access patterns. Such mispredictions may lead to premature eviction of hot blocks or retention of cold blocks, slightly degrading cache performance. Second, the effectiveness of the SVM model is heavily dependent on the representativeness of the training dataset; significant deviations in workload characteristics may reduce predictive accuracy. Third, the approach assumes that runtime features such as task type, job status, and progress are available and accurately captured, and incomplete or noisy data could compromise the model\u0026rsquo;s performance. Fourth, while the training phase of the SVM model is executed independently of runtime cache operations and thus does not directly affect Hadoop job execution, this work does not provide a detailed profiling of the computational resources required during training. Given that the training input consists only of cache metadata, the overhead is expected to be small relative to the overall workload. Nevertheless, a comprehensive analysis of CPU, memory, and energy consumption during training will be considered in future work to provide a more complete evaluation of system performance. Finally, integrating the SVM with Hadoop\u0026rsquo;s caching mechanism, while effective, adds engineering complexity for low-latency predictions in large-scale, production environments.\u003c/p\u003e\u003cp\u003eTo address these limitations, several directions for future work can be considered. Adaptive online learning methods could be employed to incrementally update the SVM model in response to evolving workload patterns without retraining from scratch, thereby reducing training overhead. Ensemble approaches, combining the SVM with lightweight predictors or heuristics, may reduce the impact of misclassifications and improve robustness. Automated feature selection techniques could dynamically identify or engineer features at runtime based on workload characteristics, enhancing predictive capability. Scalability may be further improved through distributed or parallelized training strategies, as well as GPU acceleration for large datasets. Finally, an extended evaluation for diverse workloads, including streaming data and highly volatile access patterns, would provide a more comprehensive understanding of H-SVM-LRU\u0026rsquo;s performance boundaries.\u003c/p\u003e"},{"header":"8. Conclusion","content":"\u003cp\u003eIn this paper, we demonstrated that H-SVM-LRU can efficiently remove cold items from the cache at an early stage to make space for new data blocks. By using this mechanism, cache pollution can be reduced, and the available cache space can be utilized more effectively. If all data blocks in the cache have the same class, the proposed algorithm is identical to LRU and only considers the recently used metric for data eviction. We propose H-SVM-LRU as an intelligent cache replacement strategy to warm up the cache and improve Hadoop performance. H-SVM-LRU combines SVM with LRU to use the limited cache capacity efficiently and avoids cache pollution by cold data. We classify cached data as either cold or warm by using an SVM classifier that predicts the likelihood of future reuse, and the evicted items are determined based on their class. Experimental results show that the cache hit ratio is improved for H-SVM-LRU by decreasing the frequency of eviction of hot data that is reused in the future, and we observe in our experiments that the average improvement of H-SVM-LRU is 16.16%, 4.83%, and 2.45% against Hadoop-original, H-LRU, and H-NB-LRU, respectively. Across all workloads, H-SVM-LRU delivers consistently lower execution times than NB-LRU, especially in I/O-intensive jobs and scenarios with limited cache sizes. NB-LRU, while lightweight, relies heavily on historical data and therefore shows performance improvements mainly in larger cache configurations. Overall, H-SVM-LRU provides more stable performance across diverse workloads, whereas NB-LRU is more sensitive to workload characteristics and cache availability. H-SVM-LRU is appropriate for small cache sizes to use its limited space efficiently, as well as being suitable for workloads composed of high cache affinity applications with varied resource usage and a large amount of shared data. The advantage of this policy includes increasing the number of cache hits which decreases data access time. This is turn reduces the job execution time, resulting in a positive impact on overall Hadoop performance. While the training time is a potential limitation of this approach, this is somewhat mitigated by the training time being independent of the execution time.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003estudy conception and design: R.Ghazali, A.Movaghar, S.Adabi, A.Rezaee, and D.Down; propose the subject and Supervisor: A.Movaghar, and D.Down;data collection: R.Ghazali; analysis and interpretation of results: R.Ghazali, D.Down;draft manuscript preparation: R.Ghazali, D.Down;All authors reviewed the results and approved the final version of the manuscript\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eHadoop, A.: http://Hadoop.Apache.org/, last accessed 2021/02/15\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMerceedi, K.J., Sabry, N.A.: A Comprehensive Survey for Hadoop Distributed File System. Asian J. Res. Comput. Sci. 46\u0026ndash;57 (2021)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKhezr, S.N., Navimipour, N.J.: MapReduce and Its Applications, Challenges, and Architecture: a Comprehensive Review and Directions for Future Research. J. Grid Comput. \u003cb\u003e15\u003c/b\u003e, 295\u0026ndash;321 (2017)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLi, S., Maddah-Ali, M.A., Avestimehr, A.S.: Coded MapReduce 1\u0026ndash;16 (2015)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhao, Y., Wu, J., Dache: A Data Aware Caching for Big-Data Applications Using MapReduce Framew. 35\u0026ndash;39 (2013)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBu, Y., Howe, B., Ernst, M.D.: HaLoop: Efficient Iterative Data Processing on Large Clusters. 3, (2010)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHuang, T.C., Chu, K.C., Zeng, X.Y., Chen, J.R., Shieh, C.K.: CURT MapReduce: Caching and utilizing results of tasks for MapReduce on cloud computing. Proceedings \u0026ndash;\u0026thinsp;2016 IEEE 2nd International Conference on Multimedia Big Data, BigMM 149\u0026ndash;154 (2016). (2016)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLuo, Y., Shi, J., Zhou, S., JeCache: Just-Enough Data Caching with Just-in-Time Prefetching for Big Data Applications. Proceedings - International Conference on Distributed Computing Systems 2405\u0026ndash;2410 (2017)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAnanthanarayanan, G., et al.: Pacman: Coordinated Memory Caching for Parallel Jobs\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSangavi, S., et al.: An Enhanced DACHE model for the MapReduce Environment. Procedia - Procedia Comput. Sci. \u003cb\u003e50\u003c/b\u003e, 579\u0026ndash;584 (2015)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eShrivastava, M., Bischof, H.: Hadoop-Collaborative Caching in Real-Time HDFS\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFloratou, A., Megiddo, N., Potti, N., Ozcan, F., Kale, U.: and J. Schmitz-Hermes. Adaptive Caching in Big SQL using the HDFS Cache. In Proc. of the 7th ACM Symp. on Cloud Computing (SoCC). ACM, (2016)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCode, L.: Master\u0026rsquo;s Thesis Cache Affinity-aware In-memory Caching Management for Hadoop Jaewon Kwak\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKwak, J., Hwang, E., Yoo, T., Nam, B., Choi, Y.: In-memory Caching Orchestration for Hadoop. (2016). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/CCGrid.2016.73\u003c/span\u003e\u003cspan address=\"10.1109/CCGrid.2016.73\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHerodotou, H., AutoCache: Employing machine learning to automate caching in distributed file systems. Proceedings \u0026ndash;\u0026thinsp;2019 IEEE 35th International Conference on Data Engineering Workshops, ICDEW 133\u0026ndash;139 (2019). (2019)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFabbro, R., Zhong, C., Jiang, S.M.L.-L.I.R.S.: Leveraging Machine Learning to Improve the LIRS Replacement Algorithm. 2021 International Conference on High-Performance Big Data and Intelligent Systems, HPBD and IS 74\u0026ndash;78 (2021). (2021)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAli, W., Shamsuddin, S.M., Ismail, A.S.: Intelligent Web proxy caching approaches based on machine learning techniques. (2012)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAli, W.: Intelligent Bayesian Network-Based Approaches for Web Proxy Caching\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eImtongkhum, P., So-In, C., Sanguanpong, S., Phoemphon, S.: A two-level intelligent web caching scheme with a hybrid extreme learning machine and least frequently used. J. Internet Technol. \u003cb\u003e19\u003c/b\u003e, 725\u0026ndash;740 (2018)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWang, Y., Yang, Y., Wang, Q.: An efficient Intelligent Cache Replacement Policy Suitable for PACS. Int. J. Mach. Learn. Comput. \u003cb\u003e11\u003c/b\u003e, 250\u0026ndash;255 (2021)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHerodotou, H.: Automating Distributed Tiered Storage Management in Cluster Computing. PVLDB. \u003cb\u003e13\u003c/b\u003e(1), 43\u0026ndash;56 (2019)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification, (2009)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAl-Qtiemat, E., Jafar, I.: Intelligent Cache Replacement Algorithm for Web Proxy Caching based on Multi-level K-means Clustering. 2021 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology, JEEIT 2021 - Proceedings 278\u0026ndash;282 (2021)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAbdo, L., Ahmad, I., Abed, S.: A smart admission control and cache replacement approach in content delivery networks. Cluster Comput. \u003cb\u003e27\u003c/b\u003e, 2427\u0026ndash;2445 (2024)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhang, J., Wu, G., Hu, X., Wu, X.: A distributed cache for Hadoop Distributed File System in real-time cloud Services. Proceedings - IEEE/ACM International Workshop on Grid Computing 12\u0026ndash;21 (2012)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBok, K., Lim, J., Oh, H., Yoo, J.: An efficient cache management scheme for accessing small files in Distributed File Systems. 2017 IEEE International Conference on Big Data and Smart Computing, BigComp 2017 151\u0026ndash;155 (2017)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eApache Hadoop centralized cache management in HDFS: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://hadoop.apache.org/docs/r2.4.1/hadoop-project- dist/hadoophdfs/CentralizedCacheManagement.html\u003c/span\u003e\u003cspan address=\"http://hadoop.apache.org/docs/r2.4.1/hadoop-project- dist/hadoophdfs/CentralizedCacheManagement.html\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eALOJA: Big Data Benchmark Repository and Performance Analysis, hhtps://aloja.bsc.es\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHuang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis. (2014). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/ICDEW.2010.5452747\u003c/span\u003e\u003cspan address=\"10.1109/ICDEW.2010.5452747\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHibench: http://GitHub.com/ Intel- bigdata/ HiBench\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMapReduce, H.: server. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html\u003c/span\u003e\u003cspan address=\"https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGutenberg: dataset. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://www.gutenberg.org/\u003c/span\u003e\u003cspan address=\"http://www.gutenberg.org/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"cluster-computing","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"Learn more about [Cluster Computing](https://www.springer.com/journal/10586)","snPcode":"10586","submissionUrl":"https://submission.nature.com/new-submission/10586/3","title":"Cluster Computing","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Caching mechanism, Cache replacement algorithm, SVM-LRU, Hadoop performance","lastPublishedDoi":"10.21203/rs.3.rs-7801155/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7801155/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eModern applications can generate a large amount of data from different sources with high velocity, a combination that is difficult to store and process via traditional tools. Hadoop is one framework that is used for the parallel processing of a large amount of data in a distributed environment, however, various challenges can lead to poor performance. Two particular issues that can limit performance are the high access time for I/O operations and the recomputation of intermediate data. The combination of these two issues can result in resource wastage. In recent years, there have been attempts to overcome these problems by using caching mechanisms. Due to cache space limitations, it is crucial to use this space efficiently and avoid cache pollution (the cache contains data that is not used in the future). We propose HDFS-based SVM-LRU (H-SVM-LRU) to improve Hadoop performance. For this purpose, we use an intelligent cache replacement algorithm, SVM-LRU, that combines the well-known Least-Recently-Used (LRU) mechanism with a machine learning algorithm, Support Vector Machine (SVM), to classify cached data into two groups based on their future usage. Experimental results show a significant decrease in execution time as a result of an increased cache hit ratio, leading to a positive impact on Hadoop performance.\u003c/p\u003e","manuscriptTitle":"A Support Vector Machine-Based Cache Replacement Method for Efficient MapReduce Execution","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-11-13 12:17:45","doi":"10.21203/rs.3.rs-7801155/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-04-30T20:42:11+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-02-17T09:27:20+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-02-12T05:02:29+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"168694726627149003791817736389882630447","date":"2026-02-02T10:21:45+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"8325939867762675626521370830630558222","date":"2026-01-29T15:02:21+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-11-03T12:28:49+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-11-03T12:23:08+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-10-10T01:50:26+00:00","index":"","fulltext":""},{"type":"submitted","content":"Cluster Computing","date":"2025-10-07T15:47:10+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"cluster-computing","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"Learn more about [Cluster Computing](https://www.springer.com/journal/10586)","snPcode":"10586","submissionUrl":"https://submission.nature.com/new-submission/10586/3","title":"Cluster Computing","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"d6dc4526-35de-4bf7-a73c-289089f5f8da","owner":[],"postedDate":"November 13th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"in-revision","subjectAreas":[],"tags":[],"updatedAt":"2026-04-30T20:54:11+00:00","versionOfRecord":[],"versionCreatedAt":"2025-11-13 12:17:45","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7801155","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7801155","identity":"rs-7801155","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.