Deep neural network with local-global context-aware feature fusion for crack detection

doi:10.21203/rs.3.rs-8892244/v1

Deep neural network with local-global context-aware feature fusion for crack detection

2026 · doi:10.21203/rs.3.rs-8892244/v1

preprint OA: closed

Full text JSON View at publisher

Full text 281,717 characters · extracted from preprint-html · click to expand

Deep neural network with local-global context-aware feature fusion for crack detection | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Deep neural network with local-global context-aware feature fusion for crack detection HATİCE ÇATAL REİS, Veysel Turk, Kourosh Khoshelham This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8892244/v1 This work is licensed under a CC BY 4.0 License Status: Under Revision Version 1 posted 5 You are reading this latest preprint version Abstract Early and accurate detection of cracks in concrete structures is crucial for maintaining structural integrity and ensuring the safety of the structure. However, traditional visual inspection methods are limited in their application, especially with large datasets. In this area, deep learning-based approaches offer high potential for the automatic detection of micro- and macro-damage due to their large data processing capacity and ability to model complex structural patterns in this data. In recent years, among deep learning-based approaches, the Convolutional Neural Network (CNN) has become prominent in crack detection. These models hold significant potential for identifying small cracks and micro-damage due to their ability to extract local features effectively. However, due to their limited ability to represent global context and long-range relationships, these models may be limited in detecting complex structural patterns where micro- and macro-cracks coexist. In this study, an advanced lightweight deep learning model called the Local-Global Context-Aware Feature Fusion Network (LG-CAFFNet) was developed to minimize the limitations of existing crack detection methods. The model focuses on comprehensively representing crack morphology at micro and macro scales with its multilayered structure that integrates local morphological details and global contextual relationships. In the model, local textural features are extracted through CNN-based layers. At the same time, the self-attention mechanism represents large-scale contextual relationships, and bidirectional recurrent neural network layers represent sequential structural dependencies. This multilayer contextual fusion-based approach, addressing the limitations observed in previous studies, contributes to a more comprehensive modeling of the morphological diversity of crack patterns, their multi-scale representation, and the contextual relationships between them. The proposed model was tested on four different concrete crack datasets, achieving accuracies of 97.61%, 99.44%, 99.23%, and 98.28%, respectively. Experimental results demonstrate that the proposed method offers competitive accuracy and computational efficiency in concrete crack detection, surpassing existing technologies and providing effective solutions for practical applications. Context-aware crack detection Light-weight deep neural network architecture Local–global feature fusion Multi-scale feature integration Structural health monitoring Surface crack detection in concrete and pavements Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 1. Introduction With the population rapidly increasing daily, infrastructure has become more critical. Having safe structures has now become even more vital (Gopalakrishnan, 2018 ). Infrastructure and superstructure (public or government) health is vital for humanity. Environmental monitoring is required to protect healthy structures and contribute to their sustainability. Pavement, road, building, and bridge defects are sometimes on the surface and sometimes so small that they cannot be seen with the naked eye. Early detection of these defects/cracks is crucial. It is a proactive measure that can prevent potential risks to public safety (Chen et al., 2020 ). Additionally, crack detection is essential in various industrial applications (Fang et al., 2020 ). Data for building, road, pavement, and bridge crack detection must be provided for monitoring, health, and management. Cameras, smartphone images, and satellite or unmanned aerial vehicle data can provide important information. However, it is difficult to automatically detect cracks in buildings, roads, and pavements using non-standard data. Although technology has developed today, in most cases, humans perform visual inspections. This is based on the expert's knowledge and experience. Moreover, what we ultimately seek from visual inspections is reliability and the ability to generate repeatable data consistently. Manual inspection and interpretation are costly, time-consuming, and subject to human error (Mohan and Poobal, 2018 , Hoang et al., 2018 ). Therefore, automatic crack/defect detection continues to attract researchers' interest. Machine (Hoang et al., 2018 ) and deep learning algorithms (Faghih-Roohi et al., 2016 ) can also be used along with image processing (Zhao et al., 2010 , Mohan and Poobal, 2018 , Li et al., 2018 , Li and Sun, 2019 ) steps for cracks on surfaces. Artificial intelligence development has led to increased interest in deep learning-based crack/fracture detection methods as a possible solution to the problems caused by manual inspection (Matarneh et al., 2024 ). Early detection of cracks allows the detection of early signs of road, asphalt, concrete, and pavement deterioration. This detection, early detection, and classification of deterioration are essential for maintenance strategy and decision-making (Cubero-Fernandez et al., 2017 , Matarneh et al., 2024 ). In recent years, artificial intelligence-based Computer-Aided Diagnosis (CAD) systems have emerged with the potential to increase detection accuracy in concrete structures by quantitatively analyzing the morphology, distribution, and size of cracks. In the literature, deep learning models based on Convolutional Neural Networks (CNNs) structured within the framework of supervised learning are widely used in CAD systems (e.g., MultiScaleCrackNet (Russel and Selvaraj, 2024 ), lightweight CNN (Chang and Zheng, 2024 ), and CNN + U-Net (Nyathi et al., 2024 )). Among the most common approaches in this field, CNN-based methods structured under the supervised learning paradigm are widely used. However, while the inductive bias of convolution operations in these models allows for effective learning of local features, this structure limits the capacity to represent long-distance dependencies and broad contextual relationships, making it difficult to capture the global context. The CNN architecture processes the patterns in the image by analyzing them hierarchically from low to high levels. The first convolutional layers extract the basic features of the image (e.g., edges and corners), while deeper layers make these features more abstract and semantic. This structure is highly successful in recognizing local patterns in image processing applications, such as crack detection. However, CNN's local learning approach focuses mainly on recognizing local patterns, which may limit understanding of their relationships in the global context of crack data. However, increasing the layer depth allows the receptive field to expand and more complex features to be extracted. Increasing the model depth leads to a polynomial growth in the number of parameters and computational cost. This situation creates significant limitations on the model's scalability, real-time performance, and computational efficiency, especially in fracture and crack detection applications performed with large datasets. Accordingly, developing CNN architectures that are computationally efficient and low-cost for fracture and crack detection is an important research priority in this field. In this context, the Depthwise Separable Convolution (DSC) technique (e.g., YOLO v5s + DE(+ CA)+Slim-Neck+RFEM (Ma et al., 2024 ) and CrackScopeNet (Zhang et al., 2024b )) are among the alternative approaches used to increase computational efficiency in modern CNN-based deep learning methods. Although this method reduces processing costs by reducing the parameter density compared to classical convolutions, limited inter-channel interaction may limit the model's representative power. In recent years, approaches based on the attention mechanism Vision Transformer (ViT) (Dosovitskiy et al., 2020 ) have emerged as an alternative to CNNs in fracture detection (Shamsabadi et al., 2022 , Nasimov and Cho, 2025 ). These models can model long-range dependencies and global relationships more effectively than CNNs. However, high computational costs and a lack of inductive bias limit the ability to learn local crack details, especially under data-limited conditions, resulting in performance losses on small-scale datasets (Zhang and Zhang, 2025 ). Finally, the studies on fracture and crack detection conducted by Ahmed et al. ( 2019 ) and Gandhi et al. ( 2023 ) focus on the potential advantages of hybrid models based on CNN and Recurrent Neural Networks (RNN). When the existing methods in the literature are examined, it is observed that RNNs are not widely used in classification processes, especially in image-based datasets, for fracture and crack detection; and when they are used, they are usually applied only in the last layers. This limitation prevents the effective processing of multi-layer spatial features extracted by CNN in the early layers by RNN. In particular, the accurate detection of complex geometric structures such as fractures and cracks requires evaluating local and global features in an integrated learning process. Therefore, developing methods that will allow the spatial features obtained by CNN to be processed more deeply by RNN can significantly improve the accuracy and generalization performance in this area. Limitations of studies on crack detection in the literature include: (i) the inadequacy of CNN-based methods in capturing global context and long-range dependencies, (ii) limited research on the integrated use of bidirectional networks that can learn the sequential relationships of these features with convolutional layers that extract local features, and the potential of this approach has not been sufficiently discussed, and (iii) While recently proposed ViT-based models have strong global representation capabilities, they are limited in capturing local crack details due to high computational costs and a lack of inductive bias. In this study, a lightweight and computationally efficient deep learning model, the Local-Global Context-Aware Feature Fusion Network (LG-CAFFNet), is proposed to minimize the limitations of existing approaches to fracture and crack detection. The proposed model is designed by integrating CNN blocks consisting of DSC and standard convolution layers, Bidirectional Long Short-Term Memory (BiLSTM) and Bidirectional Gated Recurrent Unit (BiGRU) layers that handle bidirectional sequential dependencies, and the Multi-Head Attention (MHA) mechanism, which can model long-range spatial relationships. In the model, DSC blocks provide computational efficiency by extracting local features with a low number of parameters. At the same time, standard convolutions represent multi-scale spatial relationships with a wide receptive field. Furthermore, BiLSTM and BiGRU layers process local features obtained from convolution blocks in a sequential context and learn past and future context relationships bidirectionally. The MHA mechanism captures long-range contextual dependencies by calculating relationships between all locations in the input feature maps and enhances the model's representation capacity by integrating context information learned from different topics. Despite its 669-layer deep structure, the LG-CAFFNet model has only 1.48 million trainable parameters and a computational complexity of 0.75 GFLOPs. The proposed multi-component deep learning model offers a solution to the limited global context and long-range dependency learning capacity of CNN-based methods in the literature, as well as the integration shortcomings of bidirectional RNNs in modeling sequential spatial relationships. However, the model's lightweight structure and CNN-MHA fusion minimize the limitations of pure transformer-based models, which exhibit limited performance in capturing local crack details due to their high computational cost and lack of convolutional inductive bias. This approach provides high computational efficiency in crack detection. The proposed model was developed using early fusion, multi-layer feature fusion (supported by residual/skip connections), and late fusion strategies to optimize information flow within the network. The integration of these strategies, primarily through multi-layer fusion supported by residual/skip connections, stabilizes the information flow by integrating feature representations at different levels, makes gradient propagation more stable, and increases the generalization capacity of the model by enriching its representations. LG-CAFFNet combines features at different levels, providing a comprehensive capacity for analysis from micro-level local details to macro-level global context. This way, it can detect complex surface deteriorations and morphological structures of cracks with high precision. In addition, its optimized design provides a more efficient solution by reducing the high computational costs frequently encountered in traditional deep-learning models. LG-CAFFNet introduces a new perspective to deep learning by enhancing accuracy and generalizability in complex tasks, such as fracture and crack detection. In the experimental study, the proposed method's performance was compared with CNN, Multilayer Perceptron (MLP), and ViT-based deep learning models, and their effectiveness in crack detection was analyzed comprehensively. During the evaluation process, each model's strengths and weaknesses were determined, and their generalization ability and performance in real-world conditions were examined in detail. In this study, in addition to the Cracks in Concrete Structures Dataset, Concrete & Pavement Crack Dataset, and Crack Dataset, which are widely used in the literature, Concrete Cracks Image Dataset collected by the authors and containing different surface textures were used. The diversity offered by this new dataset provides the opportunity to examine the robustness and generalization capacity of the proposed model and modern algorithms in real-world conditions in more depth. Experimental findings show that the proposed deep learning model can produce successful results in real-world applications by exhibiting high accuracy and generalization ability in studies conducted on datasets with different scales and various characteristics. The main contributions can be summarized as : LG-CAFFNet deep learning model : An innovative deep learning model has been developed that integrates CNN, BiLSTM/BiGRU, and MHA mechanisms. This model is capable of achieving high accuracy in crack detection by effectively learning features that include local, sequential, global, and long-range relationships. Lightweight and computationally efficient design : Despite its 669-layer deep structure, the proposed model achieves high computational efficiency in crack detection, utilizing only 1.48 million trainable parameters and a computational complexity of 0.75 GFLOPs. Optimizing information flow with fusion strategies : The proposed model was developed using early fusion, multi-layer feature fusion, and late fusion strategies. These strategies effectively combine feature representations at different levels, minimizing information loss during feature propagation and providing consistent and reliable generalization performance across various data samples and scenarios. Comprehensive analysis at micro and macro levels for crack detection : LG-CAFFNet combines features at different levels, providing analysis capabilities from local micro-level details to the overall macro-level context, enabling high-precision analysis of complex surface deteriorations. Generalization and adaptation capacity : The model demonstrates both generalization and adaptation capabilities by consistently performing well across different crack types and diverse data samples. This paper is organized as follows: In Section 2 , the methods that have been prominent in crack detection in recent years have been comprehensively reviewed. In Section 3 , a detailed analysis of a deep learning model developed for detecting cracks in concrete structures is presented. Section 4 presents the datasets and pre-processing phase used in the study, the implementation details of the deep learning models, and the metrics used to evaluate existing and proposed deep learning approaches. Section 5 presents the experimental results and discussion. Finally, in Section 6 , the conclusion of this article is given. 2. Related work Machine learning stands out as a powerful tool for early detection of road, concrete and asphalt cracks. It offers higher accuracy compared to traditional methods, reducing maintenance costs and increasing infrastructure safety. In this section, machine learning approaches used in crack detection and findings in related studies are summarized. Kamaliardakani et al. ( 2016 ) developed a new algorithm for detecting covered cracks on the road surface. The algorithm has three main components: preprocessing, segmentation, and posterior enhancements. In the preprocessing step, the effects of non-uniform background and pavement markings that may negatively affect the detection accuracy were reduced. In the segmentation step, cracks in the road surface were detected using four different thresholding methods (Otsu, maximum triangle distance, minimum error, and local minimum). In the postprocessing step, the noise was cleaned with opening and closing morphology operations, and the accuracy of the detected cracks was increased by filling the gaps. The algorithm was tested with 110 sample images; 55 of these images (20 longitudinal sealed cracks, 20 transverse sealed cracks, 12 diagonal, and three alligator sealed cracks) contained cracks, while the other 55 (a maintenance hole cover, potholes, and discoloration spots) were crack-free regions. Experimental results showed that the algorithm performed well and was consistent with recall (87%), precision (98%), and accuracy (93%) values. Liu et al. ( 2022a ) used DenseNet, ResNet, and EfficientNet models and infrared thermography methods to classify asphalt pavement crack severity. In the experimental process, different crack levels and image types (visible, infrared, fusion) were evaluated on the dataset consisting of 2316 images. Experimental analysis showed that the EfficientNet-B3 model achieved the highest accuracy in all scenarios. In particular, the fusion image achieved 94.14% accuracy, the visible image 93.28%, and the infrared image 86.55%. In the transfer learning process, the pre-trained EfficientNet-B3 model was the most successful, with an accuracy rate of 95.88%. In general, deep learning models classified low-severity cracks better, while misclassifications increased in medium and high-severity cracks. Guo et al. ( 2023 ) developed the Swin Transformer-based Crack Transformer (CT) model for detecting pavement surface cracks. The model aims to reduce environmental noise using a Swin Transformer-based encoder and MLP-based decoder. The proposed model has been extensively evaluated on CFD, Crack500, and CrackSC datasets. According to the experimental results, it has been observed that the CT model generally achieves more successful results compared to other models by producing 94.60% mF1, 92.94% mPrecision, 96.41% mRecall values with CFD dataset; 88.73% mF1, 87.45% mPrecision, 90.12% mRecall values with Crack500 dataset; and 90.01% mF1, 90.09% mPrecision, 89.93% mRecall values with CrackSC dataset, respectively. These findings show that the proposed method provides an effective solution for pavement crack detection. Matarneh et al. ( 2024 ) evaluated the performance of ten different pre-trained CNN architectures for detecting and classifying asphalt pavement cracks. In the study, various optimization techniques were compared, and an optimized CNN model for crack classification was developed, with DenseNet201 being determined as the most effective model. In addition, it was observed that the ShuffleNet and ResNet101 models also achieved successful results. In contrast, VGG16 and VGG19 models showed lower accuracy rates. DenseNet201 optimized with Grey Wolf optimization was tested on images containing different types and levels of noise, and its robustness and accuracy were proven. According to the experimental results, the optimized DenseNet201 model produced the most successful result with an accuracy rate of 98.73%. Yeung and Lam ( 2024 ) proposed the Contrastive Decoupling Network (CDNet) model for pavement crack detection. CDNet is developed with a contrastive learning framework that extracts global and local features separately to minimize challenges such as crack diversity, background complexity, and generalization ability. The Global Semantic Enhancement (GSE) module, Local Detail Refinement (LDR) module, and Dynamic Dependency-Aware Feature Aggregation (DDFA) method are added to improve the model's performance. In addition, three different contrastive loss functions are designed to optimize the global, local, and output features. CDNet was tested on Crack500, CrackTree200, CFD, and AEL datasets and obtained 0.683–0.912 ODS, 0.724–0.920 OIS, and 0.413–0.903 AP values, respectively. The test results show that CDNet is more successful than existing methods. Teng et al. ( 2024 ) performed image enhancement and augmentation operations using an Unsupervised Image-to-Image Translation (UNIT) network developed to solve the problems of low resolution and insufficient data volume in underwater concrete crack images. The UNIT network provides high-quality image transformation using an encoder-decoder structure. The network was developed using deep learning components such as Swin Transformer and ResNet-18. Swin Transformer provides effective extraction of local and global features. ResNet-18 has a lighter structure and provides faster and more efficient performance by reducing the computational requirements of the network. In addition, self-attention layers used in the network provide a more accurate capture of contextual information and long-distance dependencies, which enables the model to obtain more accurate results. In the experimental process, the clarity of low-resolution images in muddy water conditions was increased, and 45.2%, 40.4%, and 69.1% improvements were achieved in BRISQUE, NIQE, and PIQE metrics, respectively. In addition, converting the crack images from clean water and waterless environments to muddy water environments increased the number of images and improved the quality by at least 61.2%. The proposed method exhibited high performance despite difficulties such as low contrast and low illumination, and it was emphasized that it has the potential to provide a more comprehensive solution by combining it with sonar data in the future. Fu et al. ( 2024 ) proposed the YOLO-Crack model, an improved version of the YOLOv3 model optimized for real-time detection of concrete cracks. The proposed model reduced its size by 97.4% and increased its detection rate by 50.5% compared to the original model, thanks to the new block designs enhanced with attention mechanisms and DSC. YOLO-Crack achieved 72.22% mAP in crack detection at 48.11 FPS, 1.3% higher than YOLOv3. Experimental results show that the YOLOv5l model offers higher accuracy but has a 14.2 times larger model size and lower detection rate than YOLO-Crack. On the other hand, YOLO-Crack stands out with its compact structure and fast detection ability compared to other models, such as YOLOv4 and YOLOv5l. Zhang et al. ( 2024a ) proposed an improved Swin-Transformer-UNet (I-ST-UNet) model for detecting concrete cracks and calculating their widths. The proposed model improved the semantic segmentation performance by integrating Swin-Transformer blocks into the UNet architecture. The model's performance was tested with a dataset consisting of 2030 images. The model improved the semantic segmentation performance by providing 0.7% accuracy, 2.25% mean accuracy, 5.77% mean intersection-over-union, and 1% frequency weight intersection-over-union improvements. In the crack width calculations, the relative error remained below 5% for cracks between 0.1 and 0.2 mm and over 0.2 mm; 98.35% accuracy was achieved in safety warnings for cracks exceeding 0.2 mm. The experimental results demonstrated the effectiveness of the I-ST-UNet model in segmentation performance. They showed that the model can be used in various applications, from road maintenance to infrastructure monitoring. Shi et al. ( 2024 ) proposed a deep-learning model that combines infrared and visible light images to segment road cracks at night. First, a fusion technique was developed to integrate infrared and visible light data to enhance crack visibility in low-light conditions. Then, a network enhanced with a dynamic sparse attention mechanism was used to segment these enhanced images. Experimental results show that the proposed model provides higher accuracy (97.74%), mIoU (77.89%) and mPA (85.68%) than existing methods such as Unet, PSPNet and DeepLabv3+. Wang et al. ( 2024 ) proposed the Swin-Transformer-based SwinCrack model for automatic and accurate detection of asphalt cracks. The model is enhanced with convolution modules to solve traditional CNN methods' limited receptive field problem. Experiments with Crack500, CrackTree260, CrackLS315, Stone331, CRKWH100, and CFD datasets show that SwinCrack performs particularly well detecting long and thin cracks. The model achieved OIS values of 0.781–0.880 on different datasets and achieved a 4.4% improvement in AP score compared to its closest competitor. Furthermore, ablation studies showed that convolution modules improved performance by better modeling local contexts, reducing the number of parameters of the model by 22.1% and the computational load by 18%. 3. Proposed model LG-CAFFNet model is an advanced deep learning framework that performs deep and comprehensive contextual information extraction in image processing tasks. The model is powered by a combination of structures, such as standard convolutional layers, DSC layers, MHA, BiLSTM, and BiGRU. In addition, three different data integration strategies, such as late fusion, multi-layer feature fusion, and early fusion, are applied to optimize the model's overall performance. This section presents a detailed review of the proposed LG-CAFFNet model's basic components and network structure. The network structure of the LG-CAFFNet architecture is presented in Fig. 1 . LG-CAFFNet's architectural design is focused on high accuracy and efficiency, optimized by combining multi-scale feature extraction and hierarchical data processing mechanisms. The model's initialization phase starts with an input block that processes the input data of 224×224×3 and extracts the basic features. At this stage, the two Conv2D layers (with 5×5 filters), Batch Normalization (BN), and Rectified Linear Unit (ReLU) activations ensure that the data is processed effectively. The MaxPooling2D (3×3) layer supports scaling the features extracted from the input and transferring these features to higher-level learning stages. This module is designed to extract feature maps from the input data and pass these features to higher-level learning layers hierarchically. This iterative process facilitates the detection of complex patterns by learning the model's more abstract (high-level) and specific features. The feature matrix obtained at the initial stage is transferred to the Module A and Module B blocks, which form the model's basic processing units. These blocks process different feature groups over a parallel architecture, and each group supports the other to perform more detailed and comprehensive feature extraction. The features obtained from Module A and Module B are combined via the Concatenate (Concat) process, which is carried out within the scope of the early fusion technique. This technique provides information integration at the early stage of the model. This process ensures that the information obtained from different modules is effectively integrated, significantly increasing the model's learning capacity and overall performance. The integrated features obtained are transferred to Convolution Adaptive Feature Fusion Block (CaffBlock) blocks in the later stages of the model to support advanced feature extraction and relational learning processes. CaffBlock blocks consist of three main components: Module A, Module B, and Transition blocks. These structures are structured with Parallel Hybrid Convolutional Attentional Recurrent (PHCAR) blocks to increase the model's multi-scale learning capacity. PHCAR blocks process the data received through the features obtained with the early fusion method and transition blocks and analyze the long-distance dependencies of these data, contextual information extraction, and relationships from different feature levels to create a richer and more abstract representation. This process ensures the integration of information from multiple contexts and the effective learning of long-term relationships in the model's learning process. In the final stages of the model, the features obtained from the CaffBlock and PHCAR blocks are combined with the late fusion technique. They are then forwarded to the global average pooling layer and converted into a compact vector, which increases parameter efficiency and reduces the risk of overfitting. Finally, the abstract features obtained from the previous layers are processed with the dense layer, which has the Softmax activation function in the output layer, and the classification is performed. One of the proposed model's most striking features is the optimal balance between its deep structure and parameter efficiency. Although the model consists of 669 layers, it only contains 1.48 million trainable parameters and 0.75 GFLOPs, representing significant success in model optimization. This method shows that increasing parameter efficiency in deep learning can achieve high performance with a low number of parameters in large-scale networks. It provides a significant advantage, especially when computational resources are limited or real-time applications are required. The design of LG-CAFFNet, together with optimized multi-scale feature fusion, provides a structure that significantly increases the ability to learn meaningful and abstract representations from visual data. This architecture improves the model's learning capacity in large data sets and complex tasks by effectively integrating information at different resolution levels. LG-CAFFNet provides a significant performance advantage with a low number of parameters, especially in applications requiring high precision and accuracy, such as fracture and crack detection. This enables the model to work faster and more efficiently while achieving high-accuracy performance. Thus, while the model's computational efficiency is optimized, a more practical and effective solution is provided in real-world applications. In this section, the primary structural components of the architecture are examined in detail. 3.1. Convolution adaptive feature fusion block The CaffBlock block, the basic structural unit of the LG-CAFFNet model, has a sophisticated architecture consisting of various submodules. This block consists of three Module A, one Module B, one transition block, two convolutional layers, BN and one addition (add) layer. The network structure of the CaffBlock block architecture is presented in Fig. 2 a. In the CaffBlock block, the hierarchical feature extraction ability of the model is improved with the modules used. While the first three modules (Module A) enable the model to learn basic features effectively, the last module (Module B) improves the model's ability to capture more complex and abstract features. This structure allows the model to learn features at different levels in stages. In this way, a gradual transition from low-level features to high-level features is provided, making it possible to learn deeper and more complex information. This progressive learning approach optimizes the model's generalization ability and performance, especially in high-dimensional data sets and complex tasks, while also preserving computational efficiency. In addition, the multi-layer feature fusion technique was applied in the design of the CaffBlock block. This approach enables the integration of feature maps at different depths, effectively combining information at different model abstraction levels, allowing the model to perform richer and more comprehensive feature extraction. Module A block is designed to extract features at different scales using convolution kernels of different sizes (1×1 and 3×3). This approach increases the model's ability to analyze complex data structures by enabling the model to learn low- and high-level features effectively. The basis of the Module A block is DSC technology. This technique increases the model's computational efficiency compared to standard convolutions while significantly reducing the number of model parameters. In traditional CNN, convolutional filters process spatial and channel-level features of the input data together. While spatial features define local patterns and structures, channel-level features model the interactions between channels, combining different filters to create more abstract and meaningful representations. This process allows obtaining higher-level data representations by learning relationships at both levels. However, the computational cost of this approach increases in proportion to the filter sizes and channel numbers at O \(\:\left({D}_{k}^{2}\times\:{C}_{in}\times\:{C}_{out}\right)\) level ( \(\:{D}_{k}\) represents the filter size, \(\:{C}_{in}\) represents the number of input channels, and \(\:{C}_{out}\) represents the number of output channels), which creates a significant computational burden, especially for large datasets and deep network structures. Therefore, standard convolutions are limited in terms of efficiency and scalability due to their high parameter density and computational requirements. More efficient alternative methods, such as DSC, can be preferred to minimize these limitations and optimize the CNN model's computational costs and parameter counts. This technique reduces the computational cost to O \(\:\left({D}_{k}^{2}\times\:{C}_{in}+{C}_{in}\times\:{C}_{out}\right)\) by separating the classical convolution process into two stages: depthwise convolution and pointwise convolution. In this technique, the depthwise convolution stage independently extracts spatial features for each channel. In contrast, the pointwise convolution stage creates more comprehensive and rich feature representations by modeling channel relationships using 1×1 filters. This structure significantly reduces the number of model parameters and the processing volume, resulting in a more efficient and lightweight model. In addition, optimizing spatial and channel-level features can increase the efficiency of the model's learning process. Although the DSC technique offers significant computational advantages over traditional convolutional methods, it is limited modeling of interactions between channels may limit learning more complex feature relationships. These limitations may limit the learning and generalization capacity of the model and, therefore, its overall performance. Therefore, in order to eliminate potential problems that may be caused by DSC techniques in the deep learning model proposed in this study, a standard convolution layer is also used in specific layers (initial block, skip connection, module b block (stage 2), transition block). This hybrid approach increases the computational efficiency of the model while also providing deep feature extraction and rich feature representations. The basic structural components of Module A include SeparableConv2D, BN, ReLU activation function, and Concat layer. In this module, the SeparableConv2D layer consists of different filters (e.g., 48, 36, 32, etc.), different kernel sizes (1×1, 3×3), and “same” padding parameters and values. The network structure of the Module A block architecture is presented in Fig. 2 b. Module B block consists of two stages. In the first stage, a network architecture is developed similarly to the Module A block but with a deeper structure. The network in this stage was developed using DSC technology. In the second stage, there is a Residual Feed-Forward Network (RFFN) block consisting of a feed-forward neural network. The design process of this block includes a skip connection structure inspired by the residual network architecture. Skip connection improves gradient propagation by optimizing the gradient flow in the deep neural network. In this way, the network's overall learning capacity and training performance are increased. The first stage of Module B performs multi-scale feature extraction. The second stage focuses on extracting higher-level and abstract features. This hybrid structure facilitates the detection of complex patterns by combining features at different levels and optimizing the gradient flow. Thus, gradient vanishing/exploding problems in the model are reduced. As a result, the integration of these two approaches can increase the model's adaptive capabilities and learning capacity, providing stronger generalization performance on different datasets and various task types. Module B's structural components include SeparableConv2D, Conv2D, BN, the ReLU activation function, the Concat layer, 2D upsampling, and 2D max pooling. In this module, the SeparableConv2D layer consists of different filters (e.g., 36, 32, 28, etc.), different kernel sizes (1×1, 3×3), and “same” padding parameters and values. The conv2D layer consists of different filters (e.g., 10, 8, 6, etc.), 1×1 kernel size, and “same” padding parameters, and values. The network structure of Module B block architecture is presented in Fig. 2 c. The Transition block is the last structural component of the CaffBlock block and is developed based on the basic principles of feedforward neural networks. This block also includes a skip connection structure. The Transition block is included in the layers of the network to increase the ability of deep neural networks to extract higher-level, more abstract, and more complex features. The basic structural components of the Transition block include Conv2D, BN, ReLU activation function, and add layer. In this module, the Conv2D layer consists of different filters (e.g., 64, 48, 24, etc.), different kernel sizes (1×1, 3×3), “same” padding, and kernel initializer (“he_normal”) parameters and values. The network structure of the Transition block architecture is presented in Fig. 2 d. 3.2. Parallel hybrid convolution attentional recurrent block The PHCAR block was created by integrating CNN, MHA, BiLSTM, and BiGRU technologies. In this block, the feature matrices from the early fusion and Transition block (located in the CaffBlock blocks) are first passed through the convolutional layer to extract local features in the image. The feature matrices obtained from this process are converted to a two-dimensional format with the reshape operation after passing through the BN and ReLU activation functions. After the resizing process, these features are processed through the MHA, BiLSTM, and BiGRU mechanisms in parallel. The MHA mechanism optimizes the parallel context learning capacity in detecting fracture and crack regions in the LG-CAFFNet model, enabling more precise and in-depth analysis. This mechanism evaluates the importance levels of different regions in the image, making it possible to model the morphological features, fine details, and continuity of cracks more accurately. Thus, in addition to correctly analyzing local features, global contexts are also processed effectively. In particular, the evaluation of irregular fracture and crack structures, together with the environmental context, enables the model to precisely determine the starting and ending points of these structures and examine the relationships between regions in detail. The MHA mechanism is initialized by transforming the query ( \(\:Q\) ), key \(\:(K\) ) and value ( \(\:V)\) matrices with different weights for each topic: \(\:{{Q}_{i}=XW}_{i}^{Q}\) , \(\:{{K}_{i}=XW}_{i}^{K}\:,\:\:{{V}_{i}=XW}_{i}^{V}\) here \(\:X\) standart convolution (Conv2D) represents the feature maps coming from the layer and \(\:\:{W}_{i}^{Q},\:{W}_{i}^{K}\:and\:{W}_{i}^{V}\) are the weight matrices learned for each topic. In the next step, attention calculation is performed for each title and the outputs of the titles are calculated as follows: \(\:{head}_{i}=Attention\:\left({Q}_{i},\:{K}_{i},\:{V}_{i}\right)=\:Softmax\:\left(\frac{{Q}_{i}{K}_{i}^{T}}{\sqrt{{d}_{k}}}\right){V}_{i}\) , here \(\:{d}_{k}\) represents the size of key vectors. Finally, each header output is concatenated: \(\:MultiHead\:\left(Q,\:K,\:V\right)=Concat\:\left({head}_{1},\:\dots\:,{head}_{h}\right){W}^{0}\) . \(\:{W}^{0}\) is the weight matrix that projects the output. Thanks to their bidirectional architectures, BiLSTM and BiGRU provide an effective solution for contextual information extraction, allowing for forward and backward information flow. In applications where contextual information extraction is critical, such as fracture and crack detection, the structural differences of Long Short-Term Memory and Gated Recurrent Unit play a decisive role in selecting modeling strategies appropriate for the task type. With its bidirectional architecture, BiLSTM effectively models long-term dependencies, such as cracks' start and end points, enabling detailed structural information to be extracted. In contrast, BiGRU's computational efficiency allows it to quickly learn the general features of fractured regions, thus offering advantages, especially when time and resource constraints are present. The combined use of these two structures allows for more in-depth and multifaceted modeling of contextual relationships, enabling local and global contexts to be represented holistically and effectively in feature maps. This integration increases the capacity of models such as LG-CAFFNet to more comprehensively analyze the environmental context of fractured areas. It optimizes the model's computational processes, thus improving the model's analytical accuracy and environmental context representativeness in critical tasks such as fracture and crack detection. The features obtained from MHA and RNN technologies are collected through the added layer after passing through BN and ReLU and then resized in three-dimensional format with the reshape operation. In the final stage, the output of the PHCAR block is generated by applying Layer Normalization. This hybrid architecture can effectively analyze complex data structures by combining the advantages of different feature extraction and processing techniques. The detailed diagram of the PHCAR block architecture is presented in Fig. 3 . 4. Experimental setup This section discusses in detail the datasets used in crack classification and the applied preprocessing techniques. In addition, specific details regarding the implementation of the methodologies are presented, and the evaluation metrics used to evaluate the classification performance are specified. 4.1. Datasets and preprocessing In this study, the effectiveness of the proposed models and modern deep learning algorithms in the classification process of cracks was evaluated on four different datasets: Cracks in Concrete Structures Dataset, Concrete & Pavement Crack Dataset, Crack Dataset and Concrete Cracks Image Dataset. The number of data used in the training and testing processes of the deep learning algorithms and other details about the datasets are presented in Table 1 . Example images of the datasets are shown in Fig. 4 . Table 1 Data distribution statistics. Dataset Image type Train Test No. of samples Image size Multi-Binary values Total instances Cracks in Concrete Structures Dataset Without Crack 2839 1161 4000 224 × 224 0 12.000 Simple Cracks 2794 1206 4000 224 × 224 1 Multibranched Crack 2767 1233 4000 224 × 224 2 Concrete & Pavement Crack Dataset Negative 5276 2224 7500 224 × 224 0 15.000 Positive 5224 2276 7500 224 × 224 1 Crack Dataset Clear 892 408 1300 224 × 224 0 3900 Shallow 937 363 1300 224 × 224 1 Deep 901 399 1300 224 × 224 2 Concrete Cracks Image Dataset No cracks 758 317 1075 224 × 224 0 2126 Cracks 730 321 1051 224 × 224 1 4.1.1. Cracks in the Concrete Structures Dataset The Cracks in the Concrete Structures Dataset (Jabbari et al., 2023) were obtained from concrete structures in the Imam Khomeini International University campus and Qazvin City in Iran. A Phantom 4 Pro drone with 20 megapixel and Full HD resolution cameras was used to obtain the images. The data provided by the drone was recorded in video format and as color images. As a result, 900 color images with 4K resolution were obtained. These images were converted to grayscale and divided into 12,000 small images of 330×330 pixels. The images were divided into three categories, each with 4000 images: Without Crack, Simple Cracks, and Multibranched Crack. In this study, all images were resized to 224×224 pixels in the dataset's pre-processing stage, which was used in the deep learning models' input layer. The bicubic interpolation technique was used in the resizing process of the images. This method uses a cubic polynomial function to calculate pixel values. It is generally preferred for enlarging low-resolution images or reducing high-resolution images. After resizing the images, data normalization was performed; at this stage, pixel values were scaled between 0 and 1. After the normalization process, the data was divided into categories and labeled 0 for the “Without Crack” category, 1 for the “Simple Cracks” category, and 2 for the “Multibranched Crack” category. In the experimental process, 70% of the dataset was divided into training and 30% as test datasets. 30% of the training dataset was used in the validation phase of the model. This dataset can be accessed via the link: https://data.mendeley.com/datasets/9brnm3c39k/1 . 4.1.2. Concrete & Pavement Crack Dataset The Concrete & Pavement Crack Dataset was collected by Oluwaseun (2023). This dataset contains concrete and pavement surface images collected at the Nigerian Army University Biu in Borno State, Nigeria. The images were collected using a DJI Mavic 2 Enterprise drone and a smartphone and saved as JPEG in RGB format. The dataset has two categories: Negative and Positive. The images have a resolution of 170×227 pixels. In this study, 15,000 visual data were used, with 7500 data in each category (Negative and Positive). In the pre-processing stage of the dataset, resizing, data normalization, and labeling processes were performed. In the resizing process, all images were resized to 224×224 pixels using the bicubic interpolation technique. After this stage, the image pixel values were scaled from 0–1. In the labeling process, label 0 was defined for the “Negative” category, and label 1 was defined for the “Positive” category. In the experimental process, 70% of the dataset was divided into training and 30% into test datasets. 30% of the training dataset was used in the model's validation phase. This dataset can be accessed via the link: https://www.kaggle.com/datasets/oluwaseunad/concrete-and-pavement-crack-images . 4.1.3. Crack Dataset The Crack Dataset (Kassem, 2023) dataset consists of (1) Clear, (2) Shallow, and (3) Deep categories. It contains 3900 data in total, 1300 data in each category. In the pre-processing phase of the dataset, resizing, data normalization, and labeling processes were performed. In the resizing process, all images were resized to 224×224 pixels. In this process, the bicubic interpolation technique was used. After the resizing phase, the image pixel values were scaled from 0 to 1. In the labeling process, 0 was defined for the “Clear” category, 1 for the “Shallow” category, and 2 for the “Deep” category. In the experimental process, 70% of the dataset was separated as training and 30% as test datasets. 30% of the training dataset was used in the model's validation phase. This dataset can be accessed via the link: https://www.kaggle.com/datasets/reemkassem/crack-dataset . 4.1.4. Concrete Cracks Image Dataset The authors collected the Concrete Cracks Image Dataset (Reis and Turk, 2024) at Gümüşhane University Faculty of Engineering and Natural Sciences in Turkey. This dataset contains concrete crack images. The images were collected using Samsung Galaxy M31 and Samsung Galaxy A50 smartphones with Android operating system and saved as JPEG in RGB format. The dataset has two categories: “No Cracks” and “Cracks.” The original images have different pixel resolutions, such as 1504×3264 and 1860×4032. The dataset contains 1075 data in the “No Cracks” category and 1051 data in the “Cracks” category. In the pre-processing stage of the dataset, resizing, data normalization, and labeling processes were performed. In the resizing process, all images were resized to 224×224 pixels using the bicubic interpolation technique. After this stage, the image pixel values were scaled in the range of 0–1. In the labeling process, 0 was defined for the “No Cracks” category, and 1 was defined for the “Cracks” category. In the experimental process, 70% of the dataset was separated as training and 30% as test datasets. 30% of the training data set was used in the validation phase of the model. This dataset can be accessed via the link: https://data.mendeley.com/datasets/fgjy2s3nk7/2 . 4.2. Implementation details In this study, CNN, Transformer, and MLP-based deep learning algorithms trained from scratch were used to detect crack. Among the deep learning models, in addition to the proposed LG-CAFFNet deep learning algorithm, there are MLP-Mixer (Tolstikhin et al., 2021 ), EfficientNetB2 (Tan and Le, 2019 ), MobileNet (Howard et al., 2017 ), FasterNet (Chen et al., 2023a ), CMT (Guo et al., 2022 ), Swin Transformer V2 (Liu et al., 2022b ), and FlexiViT (Beyer et al., 2023 ) models. The TensorFlow v2.15.0 framework 1 was used to apply the LG-CAFFNet, EfficientNetB2, and MobileNet deep learning models. The Keras CV Attention GitHub repository 2 was used to implement FasterNet, CMT, Swin Transformer V2, FlexiViT, and MLP-Mixer deep learning models. The experimental process was carried out in the Google Colab Pro environment. The system features in this version are Intel(R) Xeon(R) CPU @ 2.20GHz, driver version 535.104.05, CUDA version 12.2, NVIDIA L4 GPU, 22.5 GB of graphics memory, 53.0 GB RAM and 78.2 GB hard disk space. The study was carried out using the Python programming language. In this study, a comprehensive methodology was applied for the training and evaluation of deep learning models. Models were trained under the same conditions for 50 epochs within the framework of the training-validation-test paradigm. Backpropagation and optimization processes were used in the training process of deep learning models. The backpropagation algorithm was used to optimize the model's weights and biases, and the parameters were updated by calculating the gradient of the loss function. The optimization process was carried out using the Adam algorithm (Kingma and Ba, 2014 ). This algorithm provided fast convergence and balanced performance by using adaptive learning. Another important hyperparameter used in training the deep learning model is the batch size (32), which provides a balance between stochastic and deterministic approaches in gradient calculations, helping the training process be stable and fast. The models' training process aimed to have minimum training loss and validation loss values. For this purpose, the categorical cross-entropy loss function was used. Categorical cross-entropy is a loss function that measures the difference between the actual labels and the one-hot coding of the class probabilities predicted by the model. In this study, when the performance of the deep learning models became stagnant or decreased, the initial learning rate (1.0e-3) was dynamically adjusted. In this process, when no improvement was observed in the validation loss for two (patience value) consecutive epochs, the initial learning rate was reduced by a factor of 0.5, and this process continued until 50 cycles, and the minimum learning rate (1.0e-5). This adaptive learning strategy helped the model to converge to the global optimum and avoid local minima. During the implementation of deep learning algorithms, ModelCheckpoint and ReduceLROnPlateau functions of the TensorFlow library were used. ModelCheckpoint was used to save the best model, while ReduceLROnPlateau was used to adjust the learning rate dynamically. In all models, the input layer size was 224×224×3, and the output layer used the Softmax activation function. Probabilistic class predictions were obtained with the Softmax function. After the training of deep learning models was completed, performance analysis was performed with the test dataset on the model with the lowest validation loss. This provided an objective measurement of the model's generalization ability. In this study, binary and multi-class (three-class) classification tasks were carried out. Along with this process, the adaptation of deep learning models to problems of various complexity was tested. 4.3. Evaluation metrics This research used various evaluation metrics to measure the success of the proposed methods and modern deep learning models in binary and multiple classification in crack detection. Table 2 shows the evaluation metrics and their mathematical formulations. The evaluation metrics used include Accuracy (ACC), Sensitivity (SN), Positive Predictive Value (PPV), F-1 score (F-1), and Receiver Operating Characteristic Area Under the Curve (ROC AUC). Table 2 shows the values of TN: True Negative, TP: True Positive, FN: False Negative, and FP: False Positive. In this study, PPV, SN and F-1 metrics were calculated with the macro average method in the multiple classification process. The macro metric takes into account the performance of each class equally and reflects the average performance; therefore, it is not affected by class imbalance. Table 2 Performance metrics for binary and multi-class classification. Performance metrics Binary-class classification metrics Multi-class classification metrics Performance metrics Mathematical Expression Performance metrics Mathematical Expression Accuracy (ACC) \(\:\frac{TP\:+\:TN}{TP\:+\:TN\:+\:FP\:+\:FN}\) Accuracy (ACC) \(\:\frac{\sum\:_{i=1}^{n}{TP}_{i}}{Total\:Number\:of\:Test\:Samples}\) Positive Predictive Value (PPV) \(\:\frac{TP}{TP\:+\:FP}\) Positive Predictive Value (PPV) (macro) \(\:\frac{1}{n}\sum\:_{i=1}^{n}{PPV}_{i}\) Sensitivity (SN) \(\:\frac{TP}{TP\:+\:FN}\) Sensitivity (SN) (macro) \(\:\frac{1}{n}\sum\:_{i=1}^{n}{SN}_{i}\) F1-score (F-1) \(\:\frac{2\:\times\:PPV\:\times\:\:SN\:}{PPV\:+\:SN}\) F1-score (F-1) (macro) \(\:\frac{1}{n}\sum\:_{i=1}^{n}{F1}_{i}\) 5. Experimental results and discussion In this study, the performance of the proposed model and state-of-the-art deep learning algorithms in crack image classification has been comprehensively evaluated on four different datasets (Cracks in Concrete Structures Dataset, Concrete & Pavement Crack Dataset, Crack Dataset and Concrete Cracks Image Dataset). The experimental results obtained and their analysis are discussed in detail in this section. The evaluation process aims to measure the effectiveness of the proposed method on different datasets and to provide a comparative analysis with state-of-the-art methods. Thus, the generalizability of the proposed approach and their performance under different conditions have been examined, and the advantages and limitations of these methods have been discussed by comparing them with the existing state-of-the-art methods. 5.1. Performance comparison on different crack datasets In this study, the performance of LG-CAFFNet model has been comprehensively evaluated on four different datasets (Cracks in Concrete Structures Dataset, Concrete & Pavement Crack Dataset, Crack Dataset and Concrete Cracks Image Dataset). To objectively measure the effectiveness of the proposed models, the experimental results of CNN, MLP, and Transformer-based modern deep learning algorithms based on the same datasets have been analyzed comparatively. This comprehensive evaluation aims to determine the advantages and limitations of LG-CAFFNet model over existing methods. The findings of the experimental evaluations are presented in Table 3 . In experiments conducted on the Cracks in the Concrete Structures Dataset, the proposed LG-CAFFNet model demonstrated the highest performance with a test loss of 0.1021 and an ACC of 97.61%. The model also outperformed the PPV (97.63%), SN (97.61%), and F-1 (97.61%) metrics. Comparative analyses reveal that the EfficientNetB2 (97.33% ACC, 0.1157 loss) and MobileNet (97.08% ACC, 0.1304 loss) models produced the closest accuracy values to the proposed model. Additionally, FasterNet (96.22% ACC, 0.1645 loss) and CMT (96.53% ACC, 0.2133 loss) exhibited moderate performance, while Swin Transformer V2 (93.81% ACC, 0.1894 loss) and MLP-Mixer (94.86% ACC, 0.2304 loss) exhibited lower performance. Finally, the FlexiViT model exhibited the lowest performance with 85.22% ACC and 0.3722 loss values. In experiments conducted on the Concrete & Pavement Crack Dataset, the proposed LG-CAFFNet model demonstrated the highest performance with a loss of 0.0238 and an ACC of 99.44%. The model also achieved the highest scores in the SN (99.12%) and F-1 (99.45%) metrics. In the PPV metric, FasterNet (99.82%) produced the highest value. Comparative analyses revealed that FasterNet (0.0422 loss, 99.31% ACC), MobileNet (0.0343 loss, 99.16% ACC), and EfficientNetB2 (0.0517 loss, 98.93% ACC) models produced the closest accuracy values to the proposed model. Besides, CMT (0.1447 loss, 94.22% ACC) and FlexiViT (0.1841 loss, 93.78% ACC) showed moderate performance, while MLP-Mixer (0.6307 loss, 61.51% ACC) and Swin Transformer V2 (0.6796 loss, 57.56% ACC) models showed significantly lower performance. In experiments conducted on the Crack Dataset, the LG-CAFFNet model demonstrated the highest accuracy with a loss of 0.0229 and an ACC of 99.23%. The model also achieved the highest values in PPV (99.22%), SN (99.21%), and F-1 (99.21%) metrics. Comparative analyses revealed that the FasterNet (loss of 0.0470, loss of 98.46%) and MobileNet (loss of 0.0559, loss of 98.46%) models produced accuracies closest to LG-CAFFNet. Besides, Swin Transformer V2 (0.1490 loss, 94.87% ACC), FlexiViT (0.1307 loss, 94.79% ACC), and CMT (0.1567 loss, 93.33% ACC) showed moderate performance, while MLP-Mixer (0.2221 loss, 91.97% ACC) showed significantly lower performance compared to other models. In experiments conducted on the Concrete Cracks Image Dataset, the LG-CAFFNet model demonstrated the highest overall accuracy performance among all models, with a loss of 0.0658 and an ACC of 98.28%. The model also achieved the best performance in the SN (96.88%) and F-1 (98.26%) metrics, while EfficientNetB2 (100%) produced the highest value in the PPV metric. Models with closer performance include FasterNet (0.0907 loss, 97.65% ACC), CMT (0.1126 loss, 96.87% ACC), MobileNet (0.1217 loss, 96.71% ACC), and FlexiViT (0.1134 loss, 96.55% ACC). In contrast, MLP-Mixer (0.5867 loss, 70.85% ACC) and Swin Transformer V2 (0.6922 loss, 44.20% ACC) models showed the poorest performance compared to other models, exhibiting significantly lower accuracy and higher error rate. Experimental results demonstrate that the LG-CAFFNet model consistently performs well in both positive and negative classes, with high ACC and low loss values. In particular, the high PPV and SN values in the positive class, representing cracks, demonstrate that the model can detect fracture zones with high accuracy. These findings demonstrate that LG-CAFFNet can accurately detect fractures and complex structures within them, exhibiting strong performance in this area. In conclusion, thanks to its ability to analyze high- and low-level features, the model has successfully produced successful crack detection results by effectively learning patterns at different scales Fig. 5 shows the graphs showing the change in validation loss of deep learning algorithms during the training process on different crack datasets. Figure 6 shows the graphs containing the ROC curves and the corresponding AUC values obtained during the testing phase of deep learning algorithms applied to different crack datasets. Figure 7 presents the confusion matrix showing the performance of the proposed LG-CAFFNet deep learning model on the test data of the crack datasets. Table 3 Comparative analysis of deep learning models on the crack datasets. Dataset Model Loss ACC PPV SN F-1 Cracks in the Concrete Structures Dataset FasterNet 0.1645 0.9622 0.9622 0.9623 0.9620 EfficientNetB2 0.1157 0.9733 0.9734 0.9732 0.9732 MobileNet 0.1304 0.9708 0.9708 0.9710 0.9707 CMT 0.2133 0.9653 0.9656 0.9652 0.9652 Swin Transformer V2 0.1894 0.9381 0.9388 0.9380 0.9378 FlexiViT 0.3722 0.8522 0.8563 0.8528 0.8498 MLP-Mixer 0.2304 0.9486 0.9483 0.9483 0.9482 LG-CAFFNet 0.1021 0.9761 0.9763 0.9761 0.9761 Concrete & Pavement Crack Dataset FasterNet 0.0422 0.9931 0.9982 0.9881 0.9932 EfficientNetB2 0.0517 0.9893 0.9942 0.9846 0.9894 MobileNet 0.0343 0.9916 0.9978 0.9855 0.9916 CMT 0.1447 0.9422 0.9688 0.9152 0.9413 Swin Transformer V2 0.6796 0.5756 0.5839 0.5598 0.5716 FlexiViT 0.1841 0.9378 0.9309 0.9473 0.9390 MLP-Mixer 0.6307 0.6151 0.6435 0.5360 0.5849 LG-CAFFNet 0.0238 0.9944 0.9978 0.9912 0.9945 Crack Dataset FasterNet 0.0470 0.9846 0.9842 0.9845 0.9843 EfficientNetB2 0.0677 0.9769 0.9770 0.9761 0.9764 MobileNet 0.0559 0.9846 0.9844 0.9843 0.9843 CMT 0.1567 0.9333 0.9334 0.9332 0.9329 Swin Transformer V2 0.1490 0.9487 0.9503 0.9459 0.9468 FlexiViT 0.1307 0.9479 0.9469 0.9467 0.9468 MLP-Mixer 0.2221 0.9197 0.9203 0.9171 0.9180 LG-CAFFNet 0.0229 0.9923 0.9922 0.9921 0.9921 Concrete Cracks Image Dataset FasterNet 0.0907 0.9765 0.9935 0.9595 0.9762 EfficientNetB2 0.2671 0.9389 1.0000 0.8785 0.9353 MobileNet 0.1217 0.9671 0.9902 0.9439 0.9665 CMT 0.1126 0.9687 0.9871 0.9502 0.9683 Swin Transformer V2 0.6922 0.4420 0.2989 0.0810 0.1275 FlexiViT 0.1134 0.9655 0.9902 0.9408 0.9649 MLP-Mixer 0.5867 0.7085 0.7626 0.6106 0.6782 LG-CAFFNet 0.0658 0.9828 0.9968 0.9688 0.9826 Figure 5 presents the graphs showing the changes in the validation losses of different deep learning algorithms on four datasets during the validation phase of training. According to the graphs, significant fluctuations were observed in all models at the beginning of the training process, but stabilization was achieved during the optimization process as the iterations progressed. However, MLP-Mixer and Swin Transformer V2 produced the most remarkable results during the models' stabilization process. These models exhibited unstable learning dynamics, particularly on the Concrete & Pavement Crack Dataset, characterized by small-scale fluctuations in the validation loss. In the Concrete Cracks Image Dataset, a decrease in the validation loss was observed at the beginning, but the change in the losses remained minimal in the later stages of the training process, and the models showed a lower performance compared to other deep learning approaches. Experimental findings reveal that the proposed LG-CAFFNet architecture exhibits a consistent and efficient learning process on four datasets. The model's low validation loss and stable optimization dynamics show that its generalization performance is high, and it offers a stronger representation capacity compared to existing deep learning models. Figure 6 compares the AUC performances of different deep-learning models on four datasets. Since Cracks in the Concrete Structures Dataset and Crack Dataset have a multi-class structure, AUC values were calculated separately for each class. In Cracks in the Concrete Structures Dataset, LG-CAFFNet (0.9821) produced the highest average AUC value, and FlexiViT (0.8893) had the lowest average AUC value. However, the average AUC value of all models was calculated as 0.9612. In Concrete & Pavement Crack Dataset, LG-CAFFNet (0.9945) was the most successful model, while Swin Transformer V2 (0.5757) had the lowest performance. The average AUC value of all models was 0.8801. According to the average AUC values in Crack Dataset, the LG-CAFFNet model (0.9941) produced the best result, and the MLP Mixer (0.9385) produced the highest average AUC value. The average AUC value of all models is 0.9703. In Concrete Cracks Image Dataset, LG-CAFFNet (0.9828) was the most successful model, while Swin Transformer V2 (0.4443) produced the lowest average AUC value. The average AUC value of all models was calculated as 0.8692. Experimental findings show that the proposed LG-CAFFNet model produces more successful results on different data sets. Figure 7 presents the test results of the LG-CAFFNet model on four different datasets. In Cracks in the Concrete Structures Dataset, the model made 86 errors in evaluating 3,600 test samples. The model achieved 98.71% accuracy rates in the Without Crack class, 94.86% in the Simple Cracks class, and 99.27% in the Multibranched Crack class. The findings show that the model successfully classified the Without Crack and Multibranched Crack classes. In Concrete & Pavement Crack Dataset, the model made 25 errors in 4,500 test samples. The model achieved 99.78% accuracy rates in the Negative class and 99.12% accuracy rates in the Positive class. These results show that the model classified the examples in the Negative class with higher accuracy. In Crack Dataset, the model made 9 errors in evaluating 1,170 test samples. The model has shown successful classification performance with 100% accuracy in the Clear class, 98.62% in the Shallow class, and 99.00% in the Deep class. In Concrete Cracks Image Dataset, 11 faults were detected in the evaluation made on 638 test samples. The model has shown successful classification performance by reaching 99.68% accuracy rates in the No Cracks class and 96.88% in the Cracks class. Experimental results show that the LG-CAFFNet model can classify different crack types with high accuracy and strong generalization ability. 5.2. Ablation study In this study, the effects of fusion strategies and sequence-based components (MHA, BiGRU, BiLSTM) applied in the LG-CAFFNet deep learning model on the classification performance were analyzed with comprehensive experiments. Experimental studies were performed using the Cracks in Concrete Structures Dataset, and the results are presented in Table 4 . The structure in which all components were integrated showed the highest performance with an accuracy rate of 97.61% and misclassified only 86 test examples. In the scenario where the late fusion, MHA, and RNN components were removed, the accuracy decreased to 97.47%, and the number of misclassified examples increased to 91. When the multi-layer feature fusion component was also removed in addition to the previous scenario, the accuracy decreased to 72.06%, and the model's performance significantly decreased by 25.55%. The number of misclassified examples reached 1006. When all fusion techniques were removed, the accuracy rate dropped to 69.72%, resulting in a significant performance decrease of 27.89%. As a result, the number of misclassified examples increased to 1090. Experimental findings show that fusion strategies with sequence-based components increased the generalization capacity of LG-CAFFNet, allowing complex data structures to be represented more effectively. However, although these structures deepened the model's feature extraction strategy, they significantly increased the temporal complexity. While the original model's temporal complexity was 3624 seconds, when these components were removed, it was determined to be 1480 seconds. Figure 8 visualizes the effects of model components on the training process and the performance in the testing phase. Table 4 Effects of different fusion techniques on the performance of the LG-CAFFNet deep learning model. Model Trainable Parameters (million) Total number of layers ACC F-1 Training Time (s) LG-CAFFNet wo (late fusion, multi-layer feature fusion, early fusion technique, MHA, BiGRU and BiLSTM) 0.35 486 0.6972 0.6922 1480 LG-CAFFNet wo (late fusion, multi-layer feature fusion technique, MHA, BiGRU, BiLSTM) 0.38 556 0.7206 0.7148 1777 LG-CAFFNet wo (late fusion technique, MHA, BiGRU, BiLSTM) 1.09 576 0.9747 0.9747 1667 LG-CAFFNet 1.48 669 0.9761 0.9761 3623 The graphs in Fig. 8 present the ACC and F-1 metrics results obtained during the testing process with the changes in validation loss values. The findings show that the fusion strategies and removal of sequence-based components used in the LG-CAFFNet model significantly decreased model performance. When the validation loss graph was examined, it was determined that removing three fusion strategies and sequence-based components led to high validation loss values. In line with the evaluation metrics in the testing process, the removal of the model component decreased ACC and F-1 scores compared to the original model. These findings reveal that the components of the LG-CAFFNet model play a critical role in the classification process and that removing these components negatively affects the model's generalization ability. 5.3. Comparative complexity analysis In this section, a complexity comparison analysis of deep learning models was conducted, and the relevant findings are presented in Table 5 . In this comparison process, deep learning models were evaluated using metrics such as the total number of layers, the number of trained parameters, giga-scale floating-point operations per second (GFLOPs), and the training times of deep learning models (for the Cracks in the Concrete Structures Dataset, Concrete & Pavement Crack Dataset, Crack Dataset, and Concrete Cracks Image Dataset). When deep learning models were evaluated in terms of the number of layers, the number of trained parameters, and GFLOPs, the LG-CAFFNet model stands out as a highly successful architecture in terms of computational efficiency, with only 1.48 million trained parameters and 0.75 GFLOPs, despite its 669-layer structure. Among the architectures examined, LG-CAFFNet had the deepest structure, while MobileNet had the fewest layers. In terms of the number of trained parameters, LG-CAFFNet has the lowest value at 1.48 million, while MLP-Mixer has the highest at 59.53 million. In terms of computational complexity, LG-CAFFNet has the lowest at 0.75 GFLOPs, while Swin Transformer V2 has the highest at 9.39 GFLOPs. When evaluating the training time of deep learning models, the FasterNet model had the shortest training times. In contrast, the Swin Transformer V2 model had the longest training times for the "Cracks in the Concrete Structures Dataset," "Concrete & Pavement Crack Dataset," "Crack Dataset," and "Concrete Cracks Image Dataset" datasets. The training times of the LG-CAFFNet model were 3623, 4439, 1416, and 911 seconds on the “Cracks in the Concrete Structures Dataset”, “Concrete & Pavement Crack Dataset”, “Crack Dataset”, and “Concrete Cracks Image Dataset” datasets, respectively. Table 5 Comparative analysis of deep learning model complexity. Model Total number of layers Trainable parameters (million) GFLOPs Training time (s) Cracks in Concrete Structures Dataset Concret&Pavement Crack Dataset Crack Dataset Concrete Cracks Image Dataset FasterNet 131 6.32 1.71 581 708 218 118 EfficientNetB2 342 7.70 1.36 2557 3165 884 514 MobileNet 91 3.21 1.15 954 1166 326 182 CMT 598 8.19 2.64 2757 3432 1040 639 Swin Transformer V2 569 27.57 9.39 5540 6718 1938 1041 FlexiViT 272 21.67 9.26 2575 3121 843 511 MLP-Mixer 150 59.53 6.51 2197 2716 742 447 LG-CAFFNet 669 1.48 0.75 3623 4439 1416 911 5.4. Explaining model predictions with XAI methods Gradient-weighted Class Activation Mapping (Grad-CAM) (Selvaraju et al., 2017 ) is an effective technique for increasing the transparency and interpretability of CNN-based deep learning models' predictions with visual data. This method visualizes the visual regions that play a decisive role in the model output, using gradient-based techniques to increase the understandability of the model's decision processes. Grad-CAM highlights the class activations by determining the regions where the network focuses its attention on each class. Thus, it determines which features are effective in the model's decision-making processes. This allows for more objective interpretation and analysis of the model's predictions in complex tasks such as image classification. This study used the Grad-CAM technique to increase the interpretability of the LG-CAFFNet model proposed for detecting fractures and cracks in structural elements such as concrete, pavement, and roads in the classification process. Experimental findings are presented in Fig. 9 . These results show the visualization of class-based activation maps and regional regions of interest obtained using the Grad-CAM technique with comprehensive feature matrices obtained with the late fusion technique in the LG-CAFFNet model. Figure 9 includes the visualization results obtained by the LG-CAFFNet model for detecting cracks in concrete and other surfaces in four different datasets. These results reveal that the model performs effectively in correctly localizing cracks. In the Cracks in Concrete Structures Dataset, the model provided accurate localization of cracks with Grad-CAM-based activation maps in simple and multi-branched cracks. In simple cracks, the areas focused by the model were concentrated along the crack, and it was observed that they accurately covered the fine details of the crack. Especially in complex multi-branched cracks, the areas focused by the model showed that it could correctly detect the crack geometry. In such cracks, the accurate determination of branching regions reflects the high geometric sensitivity of the model. In addition, overlay images generally show minimal activation in areas outside the crack region, confirming that the model has a low error rate. The model correctly detects thin and wide crack lines on the Concrete & Pavement Crack Dataset, increasing localization accuracy with dense activations spreading along the crack. Grad-CAM activation maps show that the model can cover the entire crack length, especially focusing on the crack's starting and ending points. In wider cracks, it was observed that the model was successfully detected without being affected by environmental noise. The model also correctly detected crack sections with low contrast and parallel to the surface in thin cracks. Overlay images show that the model detects cracks independently of environmental noise and that color and texture changes on concrete surfaces do not negatively affect the model's performance. In the analyses performed on the Crack Dataset, the model's depth sensitivity showed significant success in distinguishing shallow and deep cracks. In shallow cracks, the model could follow thinner and surface crack patterns in detail, but there were activation leaks around the crack. On the other hand, it was observed that the model's focus density increased in deep cracks and that it was able to more successfully determine the crack's deep structure. In deep cracks, Grad-CAM maps focused on the inner lines of the crack and the stress points around it. This shows that the model can detect depth information and surface cracks. In addition, the overlay results confirm that the model's focus area is consistent along the crack. Tests on the Concrete Cracks Image Dataset evaluated the generalization ability of cracks found on different surface types (wall, concrete, etc.). In addition to cracks on the wall and concrete surfaces, the model effectively performed on images with excessive texture or irregular surface features. In particular, it correctly detected large cracks on large surfaces and small wear-related surface cracks. In overlay images, the model generally focused on crack areas and was less affected by surface roughness, color variations, or textural differences. Additionally, Grad-CAM maps show that the regions focused on by the model are compatible with the crack geometry, and the detection accuracy is successful. Experimental findings show that the LG-CAFFNet model is remarkable for distinguishing between crack depth, geometry, and surface type differences. The model successfully addresses challenges such as crack depth, surface structure, and shape. Deep cracks have a wider and more complex structure than superficial cracks, and the model's accurate localization of such cracks confirms the model's depth sensitivity and detection ability. In addition, the model has also achieved practical results in more complex geometric structures, especially multi-branched cracks. In addition, successful detections have been achieved on different surface types (concrete, road, wall, etc.). The results show that LG-CAFFNet has high generalization ability and can accurately detect cracks regardless of environmental factors. The model can effectively distinguish different surface types and offers an important approach to crack detection and visualization. 5.5. Comparison of state-of-the-art methods This section examines the performance of the LG-CAFFNet deep learning model developed for crack detection compared with the studies proposed in the literature in recent years. The analysis of the experimental results is presented in Table 6 . Table 6 A comparative analysis of state-of-the-art methods and the proposed deep learning model. Literatures Methods Dataset Types of cracks ACC (%) PPV (%) SN (%) F-1 (%) Russel and Selvaraj ( 2024 ) MultiScaleCrackNet Asphalt Crack Database Negative, Positive 99.00 100 98.00 99.00 Shashidhar et al. ( 2024 ) CrackSpot Structure surface datasets Noncrack, Crack 97.11 97 97 97 Mohan et al. ( 2023 ) ResNet50 Cracks in Concrete Structures Dataset Without Crack, Simple Cracks, Multibranched Crack 96 92.1 90.8 N/A Omoebamije et al. ( 2023 ) CNN model Concrete & Pavement Crack Dataset Negative, Positive 99.04 98.81 99.28 99.04 Chen et al. ( 2023b ) ResNet101 Building Surface Crack (in China) Without Cracks, Cracks 94 N/A N/A N/A Rashid et al. ( 2024 ) CNN Surface Crack Detection Dataset Negative, Positive 99.27 99.7 98.85 99.3 Jabbari and Bigdeli ( 2024 ) CapsGAN Cracks in Concrete Structures Dataset Without Crack, Simple Cracks, Multibranched Crack 94.1 98.7 94.2 96.3 Sun et al. ( 2023 ) SVM SDNET2018 Without Cracks, With Cracks 94.38 N/A N/A N/A Proposed Approach LG-CAFFNet Cracks in Concrete Structures Dataset Without Crack, Simple Cracks, Multibranched Crack 97.61 97.63 97.61 97.61 Concrete & Pavement Crack Dataset Negative, Positive 99.44 99.78 99.12 99.45 Crack Dataset Clear, Shallow, Deep 99.23 99.29 99.31 99.30 Concrete Cracks Image Dataset No cracks, Cracks 98.28 99.68 96.88 98.26 According to Table 6 , the classification accuracies of the proposed deep learning models obtained from the “Cracks in Concrete Structures Dataset”, “Concrete & Pavement Crack Dataset”, “Crack Dataset”, and “Concrete Cracks Image Dataset” datasets were determined as 97.61%, 99.44%, 99.23%, and 98.28%, respectively. According to the table, the proposed methods have produced the highest accuracy values compared to the state-of-the-art methods in the literature. In the studies conducted using the “Cracks in Concrete Structures Dataset” dataset, Mohan et al. ( 2023 ) achieved the second-best classification accuracy with an ACC rate of 96%. In the study conducted using the “Concrete & Pavement Crack Dataset” dataset, Omoebamije et al. ( 2023 ) produced the second-best classification process with an ACC rate of 99.04%. 6. Conclusion This study proposes LG-CAFFNet, an advanced deep-learning model, for detecting cracks in concrete structures. The model is designed to perform comprehensive feature extraction by learning local correlations with CNN, global correlations with MHA, and sequential correlations with RNN (BiLSTM, BiGRU) techniques. In addition, feature fusion mechanisms at different levels are integrated into the model using early fusion, multi-layer feature fusion, and late fusion strategies. This increases the model's crack detection performance and provides a more balanced learning process. Although the proposed model has a deep structure of 669 layers, it contains only 1.48 million trainable parameters. This enhances the computational efficiency of the model while maintaining its high learning capacity. Thus, the model exhibits high performance while minimizing processing costs. The effectiveness of the proposed method was evaluated with extensive experiments on the Cracks in Concrete Structures Dataset, Concrete & Pavement Crack Dataset, Crack Dataset, and Concrete Cracks Image Dataset datasets collected by the authors. The experimental results show that the LG-CAFFNet model exhibits high performance, achieving accuracy rates of 97.61%, 99.44%, 99.23%, and 98.28%, respectively. These findings show that the CNN-MHA-Bidirectional RNN-based deep learning model significantly increases the capacity to learn crack patterns, providing a practical approach in this context. This study identified several limitations, including computational complexity, processing times, and generalizability, in real-world applications. Due to RNN and MHA integrations, the LG-CAFFNet model has high computational costs. However, although these integrations increased the model's time complexity, they significantly improved the classification success. In real-world applications, the effectiveness of this model may vary depending on factors such as dataset size and diversity, which can make it challenging to optimize the balance between accuracy and processing time. In future studies, strategies such as knowledge distillation, quantization, and pruning can be applied to minimize the computational complexity of the proposed LG-CAFFNet model. Computational efficiency can be increased by using dilated convolutions and group convolutions in the model. Additionally, alternative methods, such as self-supervised learning techniques, graph convolutional networks, or capsule networks, can be employed to enhance the model's feature extraction capabilities. In the parameter determination process of the deep learning model, more effective parameter adjustments can be made using optimization algorithms such as Particle Swarm Optimization, Cuckoo Search, and the Firefly Algorithm. Moreover, hybrid learning methods can be developed using meta-classifiers. Data augmentation techniques can be applied to increase the model's classification performance. Finally, comprehensive tests can be performed on datasets of different sizes to evaluate the model's generalizability. Declarations Competing interests: The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article. Ethics approval: Not Applicable. The article does not involve any human or animal participants. No ethical approval is required. Author contributions: All authors made a significant contribution to the work reported. H.C.R.: conception, study design, acquisition of data, software, analysis, writing, editing, and interpretation. V.T.: study design, acquisition of data, software, analysis, writing, and interpretation. K.K.: analysis, writing, editing, and interpretation. All authors have read and agreed to the published version of the manuscript. Funding: Not Applicable. Data availability: The data that support the findings of this study are available from the corresponding authors upon reasonable request. References [dataset] Jabbari, H., Bigdeli, N., Shojaei, M., 2023. Cracks in concrete structures (CICS) dataset. Mendeley Data, v1. https://doi.org/10.17632/9brnm3c39k.1. [dataset] Kassem, R., 2023. Crack Dataset. Kaggle. https://www.kaggle.com/datasets/reemkassem/crack-dataset. [dataset] Oluwaseun, O., 2023. Concrete & Pavement Crack Dataset. Kaggle. https://doi.org/10.34740/kaggle/dsv/5130126. [dataset] Reis, H.C., Turk, V., Bozkurt, M.F., Yigit, S.N., 2024. Concrete Cracks Image Dataset (CCID). Mendeley Data, v2. https://doi.org/10.17632/fgjy2s3nk7.2. Ahmed, T.U., Hossain, M.S., Alam, M.J., Andersson, K., 2019. An integrated CNN-RNN framework to assess road crack. In: 2019 22nd International Conference on Computer and Information Technology (ICCIT). pp. 1-6. https://doi.org/10.1109/ICCIT48885.2019.9038607. Beyer, L., Izmailov, P., Kolesnikov, A., Caron, M., Kornblith, S., Zhai, X., Minderer, M., Tschannen, M., Alabdulmohsin, I., Pavetic, F., 2023. Flexivit: One model for all patch sizes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 14496-14506. https://doi.org/10.1109/CVPR52729.2023.01393. Chang, S., Zheng, B., 2024. A lightweight convolutional neural network for automated crack inspection. Construction and Building Materials 416, 135151. https://doi.org/10.1016/j.conbuildmat.2024.135151. Chen, J., Kao, S.H., He, H., Zhuo, W., Wen, S., Lee, C.H., Chan, S.H.G., 2023a. Run, don't walk: chasing higher FLOPS for faster neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. CVPR, pp. 12021-12031. https://doi.org/10.1109/CVPR52729.2023.01157. Chen, T., Cai, Z., Zhao, X., Chen, C., Liang, X., Zou, T., Wang, P., 2020. Pavement crack detection and recognition using the architecture of segNet. Journal of Industrial Information Integration 18, 100144. https://doi.org/10.1016/j.jii.2020.100144. Chen, Y., Zhu, Z., Lin, Z., Zhou, Y., 2023b. Building surface crack detection using deep learning technology. Buildings 13 (7), 1814. https://doi.org/10.3390/buildings13071814. Cubero-Fernandez, A., Rodriguez-Lozano, F.J., Villatoro, R., Olivares, J., Palomares, J.M., 2017. Efficient pavement crack detection and classification. EURASIP Journal on Image and Video Processing 2017 (1), 39. https://doi.org/10.1186/s13640-017-0187-0. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Faghih-Roohi, S., Hajizadeh, S., Núñez, A., Babuska, R., De Schutter, B., 2016. Deep convolutional neural networks for detection of rail surface defects. In: 2016 International joint conference on neural networks (IJCNN). pp. 2584-2589. https://doi.org/10.1109/IJCNN.2016.7727522. Fang, F., Li, L., Gu, Y., Zhu, H., Lim, J.H., 2020. A novel hybrid approach for crack detection. Pattern Recognition 107, 107474. https://doi.org/10.1016/j.patcog.2020.107474. Fu, R., Zhang, Y., Zhu, K., Strauss, A., Cao, M., 2024. Real-time detection of concrete cracks via enhanced You Only Look Once Network: Algorithm and software. Advances in Engineering Software 195, 103691. https://doi.org/10.1016/j.advengsoft.2024.103691. Gandhi, M.A., Swaminathen, A.N., Patil, D.T., Ravitheja, A., Kamali, R., Rajput, A., 2023. Quantitative Evaluation to Detect Crack Depth in Beams Based on CNN-RNN-LSTM Approach. In: 2023 International Conference on Self Sustainable Artificial Intelligence Systems (ICSSAS). pp. 74-79. https://doi.org/10.1109/ICSSAS57918.2023.10331901. Gopalakrishnan, K., 2018. Deep learning in data-driven pavement image analysis and automated distress detection: A review. Data 3 (3), 28. https://doi.org/10.3390/data3030028. Guo, F., Qian, Y., Liu, J., Yu, H., 2023. Pavement crack detection based on transformer network. Automation in Construction 145, 104646. https://doi.org/10.1016/j.autcon.2022.104646. Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., Xu, C., 2022. Cmt: Convolutional neural networks meet vision transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. CVPR, pp. 12175-12185. https://doi.org/10.1109/CVPR52688.2022.01186. Hoang, N.D., Nguyen, Q.L., Tien Bui, D., 2018. Image processing–based classification of asphalt pavement cracks using support vector machine optimized by artificial bee colony. Journal of Computing in Civil Engineering 32 (5), 04018037. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000781. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An- dreetto, M., Adam, H., 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Jabbari, H., Bigdeli, N., 2024. A new hierarchical algorithm based on CapsGAN for imbalanced image classification. IET Image Processing 18 (1), 194-210. https://doi.org/10.1049/ipr2.12942. Kamaliardakani, M., Sun, L., Ardakani, M.K., 2016. Sealed-crack detection algorithm using heuristic thresholding approach. Journal of Computing in Civil Engineering 30 (1), 04014110. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000447. Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Li, L., Sun, R., 2019. Bridge crack detection algorithm based on image processing under complex background. Laser & Optoelectronics Progress 56 (6), 061002. http://dx.doi.org/10.3788/LOP56.061002. Li, Y., Li, H., Wang, H., 2018. Pixel-wise crack detection using deep local pattern predictor for robot application. Sensors 18 (9), 3042. https://doi.org/10.3390/s18093042. Liu, F., Liu, J., Wang, L., 2022a. Deep learning and infrared thermography for asphalt pavement crack severity classification. Automation in Construction 140, 104383. https://doi.org/10.1016/j.autcon.2022.104383. Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., Wei, F., Guo, B., 2022b. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. CVPR, pp. 12009-12019. https://doi.org/10.1109/CVPR52688.2022.01170. Ma, X., Li, Y., Yang, Z., Li, S., Li, Y., 2024. Lightweight network for millimeter-level concrete crack detection with dense feature connection and dual attention. Journal of Building Engineering, 94, 109821. https://doi.org/10.1016/j.jobe.2024.109821. Matarneh, S., Elghaish, F., Rahimian, F.P., Abdellatef, E., Abrishami, S., 2024. Evaluation and optimisation of pre-trained CNN models for asphalt pavement crack detection and classification. Automation in Construction 160, 105297. https://doi.org/10.1016/j.autcon.2024.105297. Mohan, A., Poobal, S., 2018. Crack detection using image processing: A critical review and analysis. alexandria engineering journal 57 (2), 787-798. https://doi.org/10.1016/j.aej.2017.01.020. Mohan, G.B., Kumar, R.P., Yogiraj, B., 2023. Deep Learning-Powered Concrete Crack Classification for Improved Structural Integrity. In: 2023 Seventh International Conference on Image Information Processing (ICIIP). pp. 844-849. https://doi.org/10.1109/ICIIP61524.2023.10537741. Nasimov, R., Cho, Y.I., 2025. Smart City Infrastructure Monitoring with a Hybrid Vision Transformer for Micro-Crack Detection. Sensors 25 (16), 5079. https://doi.org/10.3390/s25165079. Nyathi, M.A., Bai, J., Wilson, I.D., 2024. Deep learning for concrete crack detection and measurement. Metrology, 4(1), 66-81. https://doi.org/10.3390/metrology4010005. Omoebamije, O., Omoniyi, T.M., Musa, A., Duna, S., 2023. An improved deep learning convolutional neural network for crack detection based on UAV images. Innovative Infrastructure Solutions 8 (9), 236. https://doi.org/10.1007/s41062-023-01209-3. Rashid, T., Mokji, M.M., Rasheed, M., 2024. Cracked concrete surface classification in low-resolution images using a convolutional neural network. Journal of Optics 1-13. https://doi.org/10.1007/s12596-024-02080-w. Russel, N.S., Selvaraj, A., 2024. MultiScaleCrackNet: A parallel multiscale deep CNN architecture for concrete crack classification. Expert Systems with Applications 249, 123658. https://doi.org/10.1016/j.eswa.2024.123658. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D., 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618-626. https://doi.org/10.1109/ICCV.2017.74. Shamsabadi, E.A., Xu, C., Rao, A.S., Nguyen, T., Ngo, T., Dias-da-Costa, D., 2022. Vision transformer-based autonomous crack detection on asphalt and concrete surfaces. Automation in Construction 140, 104316. https://doi.org/10.1016/j.autcon.2022.104316. Shashidhar, R., Manjunath, D., Shanmukha, S.M., 2024. CrackSpot: Deep learning for automated detection of structural cracks in concrete infrastructure. Asian Journal of Civil Engineering 25 (1), 1079-1090. https://doi.org/10.1007/s42107-023-00754-7. Shi, M., Li, H., Yao, Q., Zeng, J., Wang, J., 2024. Vision based nighttime pavement cracks pixel level detection by integrating infrared visible fusion and deep learning. Construction and Building Materials 442, 137662. https://doi.org/10.1016/j.conbuildmat.2024.137662. Sun, Z., Caetano, E., Pereira, S., Moutinho, C., 2023. Employing histogram of oriented gradient to enhance concrete crack detection performance with classification algorithm and Bayesian optimization. Engineering Failure Analysis 150, 107351. https://doi.org/10.1016/j.engfailanal.2023.107351. Tan, M., Le, Q., 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning (PMLR), Vol. 97, pp. 6105-6114. Teng, S., Liu, A., Chen, B., Wang, J., Wu, Z., Fu, J., 2024. Unsupervised learning method for underwater concrete crack image enhancement and augmentation based on cross domain translation strategy. Engineering Applications of Artificial Intelligence 136, 108884. https://doi.org/10.1016/j.engappai.2024.108884. Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., Lucic, M., Dosovitskiy, A., 2021. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34, 24261-24272. Wang, C., Liu, H., An, X., Gong, Z., Deng, F., 2024. SwinCrack: Pavement crack detection using convolutional swin-transformer network. Digital Signal Processing 145, 104297. https://doi.org/10.1016/j.dsp.2023.104297. Yeung, C.C., Lam, K.M., 2024. Contrastive decoupling global and local features for pavement crack detection. Engineering Applications of Artificial Intelligence 133, 108632. https://doi.org/10.1016/j.engappai.2024.108632. Zhang, B., Zhang, Y., 2025. MSCViT: A small-size ViT architecture with multi-scale self-attention mechanism for tiny datasets. Neural Networks 188, 107499. https://doi.org/10.1016/j.neunet.2025.107499. Zhang, H., Ma, L., Yuan, Z., Liu, H., 2024a. Enhanced concrete crack detection and proactive safety warning based on I-ST-UNet model. Automation in Construction 166, 105612. https://doi.org/10.1016/j.autcon.2024.105612. Zhang, T., Qin, L., Zou, Q., Zhang, L., Wang, R., Zhang, H., 2024b. Crackscopenet: a lightweight neural network for rapid crack detection on resource-constrained drone platforms. Drones 8 (9), 417. https://doi.org/10.3390/drones8090417. Zhao, H., Qin, G., Wang, X., 2010. Improvement of canny algorithm based on pavement edge detection. In: 2010 3rd international congress on image and signal processing. Vol. 2, pp. 964-967. https://doi.org/10.1109/CISP.2010.5646923. Footnotes https://www.tensorflow.org/versions/r2.15/api_docs/python/tf (accessed 11th Jun 2024) https://github.com/leondgarse/keras_cv_attention_models (accessed 11th Jun 2024) Cite Share Download PDF Status: Under Revision Version 1 posted Editorial decision: Major Revision 07 May, 2026 Reviewers agreed at journal 14 Mar, 2026 Reviewers invited by journal 02 Mar, 2026 Editor invited by journal 28 Feb, 2026 First submitted to journal 17 Feb, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8892244","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":599402193,"identity":"f04a333e-d3d5-48ae-8814-4cc0fa71e000","order_by":0,"name":"HATİCE ÇATAL REİS","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABBElEQVRIiWNgGAWjYBACAyA+AGUzHvjAwJBAhBZmqBY2BoaDM5C0SODTwgDTcpiHGC3m7OcPHvhRwSDPP7/5wWHbNrs8fvYGxg8fcxjqzBuwa7HsSWY42HOGwXDGMTaDw7ltycWSPQeYJWduY5CQOYBdi8GBZIbDjG1A9xxjAGlhTtxwI4GNmReoBZfLDM4/Bmr5x5Agf4z9w2HLtnoitNwA2dLAkGBwjMcAaN1hYrQ8NjjYc0zCcOOxnIKDPeeOJ87sOdgM9IuE5AycDkt8/OFHjY283OHjGx/8KKtO7GdvPvjh4zYbfpyhDAFQaUY2MNnAgCda0MEfYhWOglEwCkbBSAIAw5BcefmNrpoAAAAASUVORK5CYII=","orcid":"https://orcid.org/0000-0003-2696-2446","institution":"Gumushane University: Gumushane Universitesi","correspondingAuthor":true,"prefix":"","firstName":"HATİCE","middleName":"ÇATAL","lastName":"REİS","suffix":""},{"id":599402194,"identity":"52f16322-4958-4104-85e9-1e7cef8d73f7","order_by":1,"name":"Veysel Turk","email":"","orcid":"","institution":"Harran University: Harran Universitesi","correspondingAuthor":false,"prefix":"","firstName":"Veysel","middleName":"","lastName":"Turk","suffix":""},{"id":599402195,"identity":"2aed642e-f468-4ac1-9f67-3b7b98ad49a1","order_by":2,"name":"Kourosh Khoshelham","email":"","orcid":"","institution":"The University of Melbourne","correspondingAuthor":false,"prefix":"","firstName":"Kourosh","middleName":"","lastName":"Khoshelham","suffix":""}],"badges":[],"createdAt":"2026-02-16 10:32:02","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8892244/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8892244/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":104013073,"identity":"55d928f2-996d-439e-8922-a87aacad0f3d","added_by":"auto","created_at":"2026-03-05 16:15:44","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":391258,"visible":true,"origin":"","legend":"\u003cp\u003eArchitecture of the proposed LG-CAFFNet model.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-8892244/v1/3a3fbe1d9dd5ebc62e908283.png"},{"id":104013074,"identity":"e1287547-036f-4d22-8c44-487c138ea012","added_by":"auto","created_at":"2026-03-05 16:15:44","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":798120,"visible":true,"origin":"","legend":"\u003cp\u003eThe structure of the (a) CaffBlock, (b) Module A, (c) Module B, (d) Transition blocks.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-8892244/v1/04507a8cb8f9936a70fd205c.png"},{"id":104402059,"identity":"e68bca1e-9413-42a6-80b1-e4fbeef701fa","added_by":"auto","created_at":"2026-03-11 12:14:08","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":340120,"visible":true,"origin":"","legend":"\u003cp\u003eThe structure of the PHCAR block.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-8892244/v1/ff759b636734de1572db9f0d.png"},{"id":104013080,"identity":"f5031b2b-f65f-4b3f-bbf1-f415b000c5be","added_by":"auto","created_at":"2026-03-05 16:15:45","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":308540,"visible":true,"origin":"","legend":"\u003cp\u003eSample images from the crack datasets used in the experiment a) Cracks in Concrete Structures Dataset b) Concrete \u0026amp; Pavement Crack Dataset c) Crack Dataset and d) Concrete Cracks Image Dataset.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-8892244/v1/e52b77ed0395d4c27db4b0a8.png"},{"id":104013075,"identity":"04251e3f-aa41-4263-9d54-8a3fed908067","added_by":"auto","created_at":"2026-03-05 16:15:44","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":723974,"visible":true,"origin":"","legend":"\u003cp\u003eVariation graphs of the validation loss of the deep learning models in different datasets (a) Cracks in the Concrete Structures Dataset, (b) Concrete \u0026amp; Pavement Crack Dataset, (c) Crack Dataset, and (d) Concrete Cracks Image Dataset.\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-8892244/v1/b00aed097634771727b4504c.png"},{"id":104013079,"identity":"d29196a5-04b6-4d4a-b193-cf0ecfc8fa47","added_by":"auto","created_at":"2026-03-05 16:15:45","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":950467,"visible":true,"origin":"","legend":"\u003cp\u003eComparisons of the ROC AUC of the deep learning models in different datasets (a) Cracks in the Concrete Structures Dataset, (b) Concrete \u0026amp; Pavement Crack Dataset, (c) Crack Dataset, and (d) Concrete Cracks Image Dataset.\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-8892244/v1/a9b65c947e39db42aba881cb.png"},{"id":104013076,"identity":"3293d570-acc0-41e7-b5c4-b8b2b3c2b641","added_by":"auto","created_at":"2026-03-05 16:15:44","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":362990,"visible":true,"origin":"","legend":"\u003cp\u003eComparisons of the confusion matrices of the LG-CAFFNet deep learning model in different datasets (a) Cracks in the Concrete Structures Dataset, (b) Concrete \u0026amp; Pavement Crack Dataset, (c) Crack Dataset, and (d) Concrete Cracks Image Dataset.\u003c/p\u003e","description":"","filename":"7.png","url":"https://assets-eu.researchsquare.com/files/rs-8892244/v1/8e983745c8d7b7a1c13d8118.png"},{"id":104402930,"identity":"dee1ffae-bd85-4e83-aaee-d478d286a819","added_by":"auto","created_at":"2026-03-11 12:16:57","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":224949,"visible":true,"origin":"","legend":"\u003cp\u003eAnalysis of the impact of model components on validation loss, ACC, and F-1.\u003c/p\u003e","description":"","filename":"8.png","url":"https://assets-eu.researchsquare.com/files/rs-8892244/v1/e946b4e1d0df6ce4921e8d34.png"},{"id":104013077,"identity":"a5c6adbc-7c25-44e7-8602-2a1d8655a705","added_by":"auto","created_at":"2026-03-05 16:15:45","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":1442234,"visible":true,"origin":"","legend":"\u003cp\u003eVisualization of LG-CAFFNet deep learning model's anomaly detection using class activation map.\u003c/p\u003e","description":"","filename":"9.png","url":"https://assets-eu.researchsquare.com/files/rs-8892244/v1/6fdb285bd6f0955d345cbf8e.png"},{"id":104408474,"identity":"014fb65c-e73f-442a-a138-6d873adc459f","added_by":"auto","created_at":"2026-03-11 12:42:34","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":7045779,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8892244/v1/f22b5ebf-a0e9-49a9-8201-c645fb830baa.pdf"}],"financialInterests":"","formattedTitle":"Deep neural network with local-global context-aware feature fusion for crack detection","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eWith the population rapidly increasing daily, infrastructure has become more critical. Having safe structures has now become even more vital (Gopalakrishnan, \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). Infrastructure and superstructure (public or government) health is vital for humanity. Environmental monitoring is required to protect healthy structures and contribute to their sustainability. Pavement, road, building, and bridge defects are sometimes on the surface and sometimes so small that they cannot be seen with the naked eye. Early detection of these defects/cracks is crucial. It is a proactive measure that can prevent potential risks to public safety (Chen et al., \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). Additionally, crack detection is essential in various industrial applications (Fang et al., \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). Data for building, road, pavement, and bridge crack detection must be provided for monitoring, health, and management. Cameras, smartphone images, and satellite or unmanned aerial vehicle data can provide important information. However, it is difficult to automatically detect cracks in buildings, roads, and pavements using non-standard data. Although technology has developed today, in most cases, humans perform visual inspections. This is based on the expert's knowledge and experience. Moreover, what we ultimately seek from visual inspections is reliability and the ability to generate repeatable data consistently. Manual inspection and interpretation are costly, time-consuming, and subject to human error (Mohan and Poobal, \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e2018\u003c/span\u003e, Hoang et al., \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). Therefore, automatic crack/defect detection continues to attract researchers' interest. Machine (Hoang et al., \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2018\u003c/span\u003e) and deep learning algorithms (Faghih-Roohi et al., \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2016\u003c/span\u003e) can also be used along with image processing (Zhao et al., \u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e2010\u003c/span\u003e, Mohan and Poobal, \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e2018\u003c/span\u003e, Li et al., \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2018\u003c/span\u003e, Li and Sun, \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) steps for cracks on surfaces. Artificial intelligence development has led to increased interest in deep learning-based crack/fracture detection methods as a possible solution to the problems caused by manual inspection (Matarneh et al., \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). Early detection of cracks allows the detection of early signs of road, asphalt, concrete, and pavement deterioration. This detection, early detection, and classification of deterioration are essential for maintenance strategy and decision-making (Cubero-Fernandez et al., \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2017\u003c/span\u003e, Matarneh et al., \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e2024\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eIn recent years, artificial intelligence-based Computer-Aided Diagnosis (CAD) systems have emerged with the potential to increase detection accuracy in concrete structures by quantitatively analyzing the morphology, distribution, and size of cracks. In the literature, deep learning models based on Convolutional Neural Networks (CNNs) structured within the framework of supervised learning are widely used in CAD systems (e.g., MultiScaleCrackNet (Russel and Selvaraj, \u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), lightweight CNN (Chang and Zheng, \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), and CNN\u0026thinsp;+\u0026thinsp;U-Net (Nyathi et al., \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e2024\u003c/span\u003e)). Among the most common approaches in this field, CNN-based methods structured under the supervised learning paradigm are widely used. However, while the inductive bias of convolution operations in these models allows for effective learning of local features, this structure limits the capacity to represent long-distance dependencies and broad contextual relationships, making it difficult to capture the global context. The CNN architecture processes the patterns in the image by analyzing them hierarchically from low to high levels. The first convolutional layers extract the basic features of the image (e.g., edges and corners), while deeper layers make these features more abstract and semantic. This structure is highly successful in recognizing local patterns in image processing applications, such as crack detection. However, CNN's local learning approach focuses mainly on recognizing local patterns, which may limit understanding of their relationships in the global context of crack data. However, increasing the layer depth allows the receptive field to expand and more complex features to be extracted. Increasing the model depth leads to a polynomial growth in the number of parameters and computational cost. This situation creates significant limitations on the model's scalability, real-time performance, and computational efficiency, especially in fracture and crack detection applications performed with large datasets. Accordingly, developing CNN architectures that are computationally efficient and low-cost for fracture and crack detection is an important research priority in this field. In this context, the Depthwise Separable Convolution (DSC) technique (e.g., YOLO v5s\u0026thinsp;+\u0026thinsp;DE(+\u0026thinsp;CA)+Slim-Neck+RFEM (Ma et al., \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) and CrackScopeNet (Zhang et al., \u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e2024b\u003c/span\u003e)) are among the alternative approaches used to increase computational efficiency in modern CNN-based deep learning methods. Although this method reduces processing costs by reducing the parameter density compared to classical convolutions, limited inter-channel interaction may limit the model's representative power. In recent years, approaches based on the attention mechanism Vision Transformer (ViT) (Dosovitskiy et al., \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2020\u003c/span\u003e) have emerged as an alternative to CNNs in fracture detection (Shamsabadi et al., \u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e2022\u003c/span\u003e, Nasimov and Cho, \u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2025\u003c/span\u003e). These models can model long-range dependencies and global relationships more effectively than CNNs. However, high computational costs and a lack of inductive bias limit the ability to learn local crack details, especially under data-limited conditions, resulting in performance losses on small-scale datasets (Zhang and Zhang, \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2025\u003c/span\u003e). Finally, the studies on fracture and crack detection conducted by Ahmed et al. (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) and Gandhi et al. (\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) focus on the potential advantages of hybrid models based on CNN and Recurrent Neural Networks (RNN). When the existing methods in the literature are examined, it is observed that RNNs are not widely used in classification processes, especially in image-based datasets, for fracture and crack detection; and when they are used, they are usually applied only in the last layers. This limitation prevents the effective processing of multi-layer spatial features extracted by CNN in the early layers by RNN. In particular, the accurate detection of complex geometric structures such as fractures and cracks requires evaluating local and global features in an integrated learning process. Therefore, developing methods that will allow the spatial features obtained by CNN to be processed more deeply by RNN can significantly improve the accuracy and generalization performance in this area.\u003c/p\u003e \u003cp\u003eLimitations of studies on crack detection in the literature include: (i) the inadequacy of CNN-based methods in capturing global context and long-range dependencies, (ii) limited research on the integrated use of bidirectional networks that can learn the sequential relationships of these features with convolutional layers that extract local features, and the potential of this approach has not been sufficiently discussed, and (iii) While recently proposed ViT-based models have strong global representation capabilities, they are limited in capturing local crack details due to high computational costs and a lack of inductive bias. In this study, a lightweight and computationally efficient deep learning model, the Local-Global Context-Aware Feature Fusion Network (LG-CAFFNet), is proposed to minimize the limitations of existing approaches to fracture and crack detection. The proposed model is designed by integrating CNN blocks consisting of DSC and standard convolution layers, Bidirectional Long Short-Term Memory (BiLSTM) and Bidirectional Gated Recurrent Unit (BiGRU) layers that handle bidirectional sequential dependencies, and the Multi-Head Attention (MHA) mechanism, which can model long-range spatial relationships. In the model, DSC blocks provide computational efficiency by extracting local features with a low number of parameters. At the same time, standard convolutions represent multi-scale spatial relationships with a wide receptive field. Furthermore, BiLSTM and BiGRU layers process local features obtained from convolution blocks in a sequential context and learn past and future context relationships bidirectionally. The MHA mechanism captures long-range contextual dependencies by calculating relationships between all locations in the input feature maps and enhances the model's representation capacity by integrating context information learned from different topics. Despite its 669-layer deep structure, the LG-CAFFNet model has only 1.48\u0026nbsp;million trainable parameters and a computational complexity of 0.75 GFLOPs. The proposed multi-component deep learning model offers a solution to the limited global context and long-range dependency learning capacity of CNN-based methods in the literature, as well as the integration shortcomings of bidirectional RNNs in modeling sequential spatial relationships. However, the model's lightweight structure and CNN-MHA fusion minimize the limitations of pure transformer-based models, which exhibit limited performance in capturing local crack details due to their high computational cost and lack of convolutional inductive bias. This approach provides high computational efficiency in crack detection.\u003c/p\u003e \u003cp\u003eThe proposed model was developed using early fusion, multi-layer feature fusion (supported by residual/skip connections), and late fusion strategies to optimize information flow within the network. The integration of these strategies, primarily through multi-layer fusion supported by residual/skip connections, stabilizes the information flow by integrating feature representations at different levels, makes gradient propagation more stable, and increases the generalization capacity of the model by enriching its representations. LG-CAFFNet combines features at different levels, providing a comprehensive capacity for analysis from micro-level local details to macro-level global context. This way, it can detect complex surface deteriorations and morphological structures of cracks with high precision. In addition, its optimized design provides a more efficient solution by reducing the high computational costs frequently encountered in traditional deep-learning models. LG-CAFFNet introduces a new perspective to deep learning by enhancing accuracy and generalizability in complex tasks, such as fracture and crack detection.\u003c/p\u003e \u003cp\u003eIn the experimental study, the proposed method's performance was compared with CNN, Multilayer Perceptron (MLP), and ViT-based deep learning models, and their effectiveness in crack detection was analyzed comprehensively. During the evaluation process, each model's strengths and weaknesses were determined, and their generalization ability and performance in real-world conditions were examined in detail. In this study, in addition to the Cracks in Concrete Structures Dataset, Concrete \u0026amp; Pavement Crack Dataset, and Crack Dataset, which are widely used in the literature, Concrete Cracks Image Dataset collected by the authors and containing different surface textures were used. The diversity offered by this new dataset provides the opportunity to examine the robustness and generalization capacity of the proposed model and modern algorithms in real-world conditions in more depth. Experimental findings show that the proposed deep learning model can produce successful results in real-world applications by exhibiting high accuracy and generalization ability in studies conducted on datasets with different scales and various characteristics.\u003c/p\u003e \u003cp\u003e \u003cb\u003eThe main contributions can be summarized as\u003c/b\u003e:\u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eLG-CAFFNet deep learning model\u003c/b\u003e: An innovative deep learning model has been developed that integrates CNN, BiLSTM/BiGRU, and MHA mechanisms. This model is capable of achieving high accuracy in crack detection by effectively learning features that include local, sequential, global, and long-range relationships.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eLightweight and computationally efficient design\u003c/b\u003e: Despite its 669-layer deep structure, the proposed model achieves high computational efficiency in crack detection, utilizing only 1.48\u0026nbsp;million trainable parameters and a computational complexity of 0.75 GFLOPs.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eOptimizing information flow with fusion strategies\u003c/b\u003e: The proposed model was developed using early fusion, multi-layer feature fusion, and late fusion strategies. These strategies effectively combine feature representations at different levels, minimizing information loss during feature propagation and providing consistent and reliable generalization performance across various data samples and scenarios.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eComprehensive analysis at micro and macro levels for crack detection\u003c/b\u003e: LG-CAFFNet combines features at different levels, providing analysis capabilities from local micro-level details to the overall macro-level context, enabling high-precision analysis of complex surface deteriorations.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eGeneralization and adaptation capacity\u003c/b\u003e: The model demonstrates both generalization and adaptation capabilities by consistently performing well across different crack types and diverse data samples.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003cp\u003eThis paper is organized as follows: In Section \u003cspan refid=\"Sec2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, the methods that have been prominent in crack detection in recent years have been comprehensively reviewed. In Section \u003cspan refid=\"Sec3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, a detailed analysis of a deep learning model developed for detecting cracks in concrete structures is presented. Section \u003cspan refid=\"Sec6\" class=\"InternalRef\"\u003e4\u003c/span\u003e presents the datasets and pre-processing phase used in the study, the implementation details of the deep learning models, and the metrics used to evaluate existing and proposed deep learning approaches. Section \u003cspan refid=\"Sec14\" class=\"InternalRef\"\u003e5\u003c/span\u003e presents the experimental results and discussion. Finally, in Section \u003cspan refid=\"Sec20\" class=\"InternalRef\"\u003e6\u003c/span\u003e, the conclusion of this article is given.\u003c/p\u003e"},{"header":"2. Related work","content":"\u003cp\u003eMachine learning stands out as a powerful tool for early detection of road, concrete and asphalt cracks. It offers higher accuracy compared to traditional methods, reducing maintenance costs and increasing infrastructure safety. In this section, machine learning approaches used in crack detection and findings in related studies are summarized.\u003c/p\u003e \u003cp\u003eKamaliardakani et al. (\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2016\u003c/span\u003e) developed a new algorithm for detecting covered cracks on the road surface. The algorithm has three main components: preprocessing, segmentation, and posterior enhancements. In the preprocessing step, the effects of non-uniform background and pavement markings that may negatively affect the detection accuracy were reduced. In the segmentation step, cracks in the road surface were detected using four different thresholding methods (Otsu, maximum triangle distance, minimum error, and local minimum). In the postprocessing step, the noise was cleaned with opening and closing morphology operations, and the accuracy of the detected cracks was increased by filling the gaps. The algorithm was tested with 110 sample images; 55 of these images (20 longitudinal sealed cracks, 20 transverse sealed cracks, 12 diagonal, and three alligator sealed cracks) contained cracks, while the other 55 (a maintenance hole cover, potholes, and discoloration spots) were crack-free regions. Experimental results showed that the algorithm performed well and was consistent with recall (87%), precision (98%), and accuracy (93%) values.\u003c/p\u003e \u003cp\u003eLiu et al. (\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e2022a\u003c/span\u003e) used DenseNet, ResNet, and EfficientNet models and infrared thermography methods to classify asphalt pavement crack severity. In the experimental process, different crack levels and image types (visible, infrared, fusion) were evaluated on the dataset consisting of 2316 images. Experimental analysis showed that the EfficientNet-B3 model achieved the highest accuracy in all scenarios. In particular, the fusion image achieved 94.14% accuracy, the visible image 93.28%, and the infrared image 86.55%. In the transfer learning process, the pre-trained EfficientNet-B3 model was the most successful, with an accuracy rate of 95.88%. In general, deep learning models classified low-severity cracks better, while misclassifications increased in medium and high-severity cracks.\u003c/p\u003e \u003cp\u003eGuo et al. (\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) developed the Swin Transformer-based Crack Transformer (CT) model for detecting pavement surface cracks. The model aims to reduce environmental noise using a Swin Transformer-based encoder and MLP-based decoder. The proposed model has been extensively evaluated on CFD, Crack500, and CrackSC datasets. According to the experimental results, it has been observed that the CT model generally achieves more successful results compared to other models by producing 94.60% mF1, 92.94% mPrecision, 96.41% mRecall values with CFD dataset; 88.73% mF1, 87.45% mPrecision, 90.12% mRecall values with Crack500 dataset; and 90.01% mF1, 90.09% mPrecision, 89.93% mRecall values with CrackSC dataset, respectively. These findings show that the proposed method provides an effective solution for pavement crack detection.\u003c/p\u003e \u003cp\u003eMatarneh et al. (\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) evaluated the performance of ten different pre-trained CNN architectures for detecting and classifying asphalt pavement cracks. In the study, various optimization techniques were compared, and an optimized CNN model for crack classification was developed, with DenseNet201 being determined as the most effective model. In addition, it was observed that the ShuffleNet and ResNet101 models also achieved successful results. In contrast, VGG16 and VGG19 models showed lower accuracy rates. DenseNet201 optimized with Grey Wolf optimization was tested on images containing different types and levels of noise, and its robustness and accuracy were proven. According to the experimental results, the optimized DenseNet201 model produced the most successful result with an accuracy rate of 98.73%.\u003c/p\u003e \u003cp\u003eYeung and Lam (\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) proposed the Contrastive Decoupling Network (CDNet) model for pavement crack detection. CDNet is developed with a contrastive learning framework that extracts global and local features separately to minimize challenges such as crack diversity, background complexity, and generalization ability. The Global Semantic Enhancement (GSE) module, Local Detail Refinement (LDR) module, and Dynamic Dependency-Aware Feature Aggregation (DDFA) method are added to improve the model's performance. In addition, three different contrastive loss functions are designed to optimize the global, local, and output features. CDNet was tested on Crack500, CrackTree200, CFD, and AEL datasets and obtained 0.683\u0026ndash;0.912 ODS, 0.724\u0026ndash;0.920 OIS, and 0.413\u0026ndash;0.903 AP values, respectively. The test results show that CDNet is more successful than existing methods.\u003c/p\u003e \u003cp\u003eTeng et al. (\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) performed image enhancement and augmentation operations using an Unsupervised Image-to-Image Translation (UNIT) network developed to solve the problems of low resolution and insufficient data volume in underwater concrete crack images. The UNIT network provides high-quality image transformation using an encoder-decoder structure. The network was developed using deep learning components such as Swin Transformer and ResNet-18. Swin Transformer provides effective extraction of local and global features. ResNet-18 has a lighter structure and provides faster and more efficient performance by reducing the computational requirements of the network. In addition, self-attention layers used in the network provide a more accurate capture of contextual information and long-distance dependencies, which enables the model to obtain more accurate results. In the experimental process, the clarity of low-resolution images in muddy water conditions was increased, and 45.2%, 40.4%, and 69.1% improvements were achieved in BRISQUE, NIQE, and PIQE metrics, respectively. In addition, converting the crack images from clean water and waterless environments to muddy water environments increased the number of images and improved the quality by at least 61.2%. The proposed method exhibited high performance despite difficulties such as low contrast and low illumination, and it was emphasized that it has the potential to provide a more comprehensive solution by combining it with sonar data in the future.\u003c/p\u003e \u003cp\u003eFu et al. (\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) proposed the YOLO-Crack model, an improved version of the YOLOv3 model optimized for real-time detection of concrete cracks. The proposed model reduced its size by 97.4% and increased its detection rate by 50.5% compared to the original model, thanks to the new block designs enhanced with attention mechanisms and DSC. YOLO-Crack achieved 72.22% mAP in crack detection at 48.11 FPS, 1.3% higher than YOLOv3. Experimental results show that the YOLOv5l model offers higher accuracy but has a 14.2 times larger model size and lower detection rate than YOLO-Crack. On the other hand, YOLO-Crack stands out with its compact structure and fast detection ability compared to other models, such as YOLOv4 and YOLOv5l.\u003c/p\u003e \u003cp\u003eZhang et al. (\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e2024a\u003c/span\u003e) proposed an improved Swin-Transformer-UNet (I-ST-UNet) model for detecting concrete cracks and calculating their widths. The proposed model improved the semantic segmentation performance by integrating Swin-Transformer blocks into the UNet architecture. The model's performance was tested with a dataset consisting of 2030 images. The model improved the semantic segmentation performance by providing 0.7% accuracy, 2.25% mean accuracy, 5.77% mean intersection-over-union, and 1% frequency weight intersection-over-union improvements. In the crack width calculations, the relative error remained below 5% for cracks between 0.1 and 0.2 mm and over 0.2 mm; 98.35% accuracy was achieved in safety warnings for cracks exceeding 0.2 mm. The experimental results demonstrated the effectiveness of the I-ST-UNet model in segmentation performance. They showed that the model can be used in various applications, from road maintenance to infrastructure monitoring.\u003c/p\u003e \u003cp\u003eShi et al. (\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) proposed a deep-learning model that combines infrared and visible light images to segment road cracks at night. First, a fusion technique was developed to integrate infrared and visible light data to enhance crack visibility in low-light conditions. Then, a network enhanced with a dynamic sparse attention mechanism was used to segment these enhanced images. Experimental results show that the proposed model provides higher accuracy (97.74%), mIoU (77.89%) and mPA (85.68%) than existing methods such as Unet, PSPNet and DeepLabv3+.\u003c/p\u003e \u003cp\u003eWang et al. (\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) proposed the Swin-Transformer-based SwinCrack model for automatic and accurate detection of asphalt cracks. The model is enhanced with convolution modules to solve traditional CNN methods' limited receptive field problem. Experiments with Crack500, CrackTree260, CrackLS315, Stone331, CRKWH100, and CFD datasets show that SwinCrack performs particularly well detecting long and thin cracks. The model achieved OIS values of 0.781\u0026ndash;0.880 on different datasets and achieved a 4.4% improvement in AP score compared to its closest competitor. Furthermore, ablation studies showed that convolution modules improved performance by better modeling local contexts, reducing the number of parameters of the model by 22.1% and the computational load by 18%.\u003c/p\u003e"},{"header":"3. Proposed model","content":"\u003cp\u003eLG-CAFFNet model is an advanced deep learning framework that performs deep and comprehensive contextual information extraction in image processing tasks. The model is powered by a combination of structures, such as standard convolutional layers, DSC layers, MHA, BiLSTM, and BiGRU. In addition, three different data integration strategies, such as late fusion, multi-layer feature fusion, and early fusion, are applied to optimize the model's overall performance. This section presents a detailed review of the proposed LG-CAFFNet model's basic components and network structure. The network structure of the LG-CAFFNet architecture is presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003eLG-CAFFNet's architectural design is focused on high accuracy and efficiency, optimized by combining multi-scale feature extraction and hierarchical data processing mechanisms. The model's initialization phase starts with an input block that processes the input data of 224\u0026times;224\u0026times;3 and extracts the basic features. At this stage, the two Conv2D layers (with 5\u0026times;5 filters), Batch Normalization (BN), and Rectified Linear Unit (ReLU) activations ensure that the data is processed effectively. The MaxPooling2D (3\u0026times;3) layer supports scaling the features extracted from the input and transferring these features to higher-level learning stages. This module is designed to extract feature maps from the input data and pass these features to higher-level learning layers hierarchically. This iterative process facilitates the detection of complex patterns by learning the model's more abstract (high-level) and specific features. The feature matrix obtained at the initial stage is transferred to the Module A and Module B blocks, which form the model's basic processing units. These blocks process different feature groups over a parallel architecture, and each group supports the other to perform more detailed and comprehensive feature extraction. The features obtained from Module A and Module B are combined via the Concatenate (Concat) process, which is carried out within the scope of the early fusion technique. This technique provides information integration at the early stage of the model. This process ensures that the information obtained from different modules is effectively integrated, significantly increasing the model's learning capacity and overall performance. The integrated features obtained are transferred to Convolution Adaptive Feature Fusion Block (CaffBlock) blocks in the later stages of the model to support advanced feature extraction and relational learning processes. CaffBlock blocks consist of three main components: Module A, Module B, and Transition blocks. These structures are structured with Parallel Hybrid Convolutional Attentional Recurrent (PHCAR) blocks to increase the model's multi-scale learning capacity. PHCAR blocks process the data received through the features obtained with the early fusion method and transition blocks and analyze the long-distance dependencies of these data, contextual information extraction, and relationships from different feature levels to create a richer and more abstract representation. This process ensures the integration of information from multiple contexts and the effective learning of long-term relationships in the model's learning process. In the final stages of the model, the features obtained from the CaffBlock and PHCAR blocks are combined with the late fusion technique. They are then forwarded to the global average pooling layer and converted into a compact vector, which increases parameter efficiency and reduces the risk of overfitting. Finally, the abstract features obtained from the previous layers are processed with the dense layer, which has the Softmax activation function in the output layer, and the classification is performed. One of the proposed model's most striking features is the optimal balance between its deep structure and parameter efficiency. Although the model consists of 669 layers, it only contains 1.48\u0026nbsp;million trainable parameters and 0.75 GFLOPs, representing significant success in model optimization. This method shows that increasing parameter efficiency in deep learning can achieve high performance with a low number of parameters in large-scale networks. It provides a significant advantage, especially when computational resources are limited or real-time applications are required.\u003c/p\u003e \u003cp\u003eThe design of LG-CAFFNet, together with optimized multi-scale feature fusion, provides a structure that significantly increases the ability to learn meaningful and abstract representations from visual data. This architecture improves the model's learning capacity in large data sets and complex tasks by effectively integrating information at different resolution levels. LG-CAFFNet provides a significant performance advantage with a low number of parameters, especially in applications requiring high precision and accuracy, such as fracture and crack detection. This enables the model to work faster and more efficiently while achieving high-accuracy performance. Thus, while the model's computational efficiency is optimized, a more practical and effective solution is provided in real-world applications. In this section, the primary structural components of the architecture are examined in detail.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e3.1. Convolution adaptive feature fusion block\u003c/h2\u003e \u003cp\u003eThe CaffBlock block, the basic structural unit of the LG-CAFFNet model, has a sophisticated architecture consisting of various submodules. This block consists of three Module A, one Module B, one transition block, two convolutional layers, BN and one addition (add) layer. The network structure of the CaffBlock block architecture is presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea. In the CaffBlock block, the hierarchical feature extraction ability of the model is improved with the modules used. While the first three modules (Module A) enable the model to learn basic features effectively, the last module (Module B) improves the model's ability to capture more complex and abstract features. This structure allows the model to learn features at different levels in stages. In this way, a gradual transition from low-level features to high-level features is provided, making it possible to learn deeper and more complex information. This progressive learning approach optimizes the model's generalization ability and performance, especially in high-dimensional data sets and complex tasks, while also preserving computational efficiency. In addition, the multi-layer feature fusion technique was applied in the design of the CaffBlock block. This approach enables the integration of feature maps at different depths, effectively combining information at different model abstraction levels, allowing the model to perform richer and more comprehensive feature extraction.\u003c/p\u003e \u003cp\u003eModule A block is designed to extract features at different scales using convolution kernels of different sizes (1\u0026times;1 and 3\u0026times;3). This approach increases the model's ability to analyze complex data structures by enabling the model to learn low- and high-level features effectively. The basis of the Module A block is DSC technology. This technique increases the model's computational efficiency compared to standard convolutions while significantly reducing the number of model parameters. In traditional CNN, convolutional filters process spatial and channel-level features of the input data together. While spatial features define local patterns and structures, channel-level features model the interactions between channels, combining different filters to create more abstract and meaningful representations. This process allows obtaining higher-level data representations by learning relationships at both levels. However, the computational cost of this approach increases in proportion to the filter sizes and channel numbers at O \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\left({D}_{k}^{2}\\times\\:{C}_{in}\\times\\:{C}_{out}\\right)\\)\u003c/span\u003e\u003c/span\u003e level (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{D}_{k}\\)\u003c/span\u003e\u003c/span\u003e represents the filter size, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{C}_{in}\\)\u003c/span\u003e\u003c/span\u003e represents the number of input channels, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{C}_{out}\\)\u003c/span\u003e\u003c/span\u003e represents the number of output channels), which creates a significant computational burden, especially for large datasets and deep network structures. Therefore, standard convolutions are limited in terms of efficiency and scalability due to their high parameter density and computational requirements. More efficient alternative methods, such as DSC, can be preferred to minimize these limitations and optimize the CNN model's computational costs and parameter counts. This technique reduces the computational cost to O\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\left({D}_{k}^{2}\\times\\:{C}_{in}+{C}_{in}\\times\\:{C}_{out}\\right)\\)\u003c/span\u003e\u003c/span\u003e by separating the classical convolution process into two stages: depthwise convolution and pointwise convolution. In this technique, the depthwise convolution stage independently extracts spatial features for each channel. In contrast, the pointwise convolution stage creates more comprehensive and rich feature representations by modeling channel relationships using 1\u0026times;1 filters. This structure significantly reduces the number of model parameters and the processing volume, resulting in a more efficient and lightweight model. In addition, optimizing spatial and channel-level features can increase the efficiency of the model's learning process. Although the DSC technique offers significant computational advantages over traditional convolutional methods, it is limited modeling of interactions between channels may limit learning more complex feature relationships. These limitations may limit the learning and generalization capacity of the model and, therefore, its overall performance. Therefore, in order to eliminate potential problems that may be caused by DSC techniques in the deep learning model proposed in this study, a standard convolution layer is also used in specific layers (initial block, skip connection, module b block (stage 2), transition block). This hybrid approach increases the computational efficiency of the model while also providing deep feature extraction and rich feature representations. The basic structural components of Module A include SeparableConv2D, BN, ReLU activation function, and Concat layer. In this module, the SeparableConv2D layer consists of different filters (e.g., 48, 36, 32, etc.), different kernel sizes (1\u0026times;1, 3\u0026times;3), and \u0026ldquo;same\u0026rdquo; padding parameters and values. The network structure of the Module A block architecture is presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb.\u003c/p\u003e \u003cp\u003eModule B block consists of two stages. In the first stage, a network architecture is developed similarly to the Module A block but with a deeper structure. The network in this stage was developed using DSC technology. In the second stage, there is a Residual Feed-Forward Network (RFFN) block consisting of a feed-forward neural network. The design process of this block includes a skip connection structure inspired by the residual network architecture. Skip connection improves gradient propagation by optimizing the gradient flow in the deep neural network. In this way, the network's overall learning capacity and training performance are increased. The first stage of Module B performs multi-scale feature extraction. The second stage focuses on extracting higher-level and abstract features. This hybrid structure facilitates the detection of complex patterns by combining features at different levels and optimizing the gradient flow. Thus, gradient vanishing/exploding problems in the model are reduced. As a result, the integration of these two approaches can increase the model's adaptive capabilities and learning capacity, providing stronger generalization performance on different datasets and various task types. Module B's structural components include SeparableConv2D, Conv2D, BN, the ReLU activation function, the Concat layer, 2D upsampling, and 2D max pooling. In this module, the SeparableConv2D layer consists of different filters (e.g., 36, 32, 28, etc.), different kernel sizes (1\u0026times;1, 3\u0026times;3), and \u0026ldquo;same\u0026rdquo; padding parameters and values. The conv2D layer consists of different filters (e.g., 10, 8, 6, etc.), 1\u0026times;1 kernel size, and \u0026ldquo;same\u0026rdquo; padding parameters, and values. The network structure of Module B block architecture is presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec.\u003c/p\u003e \u003cp\u003eThe Transition block is the last structural component of the CaffBlock block and is developed based on the basic principles of feedforward neural networks. This block also includes a skip connection structure. The Transition block is included in the layers of the network to increase the ability of deep neural networks to extract higher-level, more abstract, and more complex features. The basic structural components of the Transition block include Conv2D, BN, ReLU activation function, and add layer. In this module, the Conv2D layer consists of different filters (e.g., 64, 48, 24, etc.), different kernel sizes (1\u0026times;1, 3\u0026times;3), \u0026ldquo;same\u0026rdquo; padding, and kernel initializer (\u0026ldquo;he_normal\u0026rdquo;) parameters and values. The network structure of the Transition block architecture is presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ed.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e3.2. Parallel hybrid convolution attentional recurrent block\u003c/h2\u003e \u003cp\u003eThe PHCAR block was created by integrating CNN, MHA, BiLSTM, and BiGRU technologies. In this block, the feature matrices from the early fusion and Transition block (located in the CaffBlock blocks) are first passed through the convolutional layer to extract local features in the image. The feature matrices obtained from this process are converted to a two-dimensional format with the reshape operation after passing through the BN and ReLU activation functions. After the resizing process, these features are processed through the MHA, BiLSTM, and BiGRU mechanisms in parallel.\u003c/p\u003e \u003cp\u003eThe MHA mechanism optimizes the parallel context learning capacity in detecting fracture and crack regions in the LG-CAFFNet model, enabling more precise and in-depth analysis. This mechanism evaluates the importance levels of different regions in the image, making it possible to model the morphological features, fine details, and continuity of cracks more accurately. Thus, in addition to correctly analyzing local features, global contexts are also processed effectively. In particular, the evaluation of irregular fracture and crack structures, together with the environmental context, enables the model to precisely determine the starting and ending points of these structures and examine the relationships between regions in detail. The MHA mechanism is initialized by transforming the query (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:Q\\)\u003c/span\u003e\u003c/span\u003e), key \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:(K\\)\u003c/span\u003e\u003c/span\u003e) and value (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:V)\\)\u003c/span\u003e\u003c/span\u003e matrices with different weights for each topic: \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{{Q}_{i}=XW}_{i}^{Q}\\)\u003c/span\u003e\u003c/span\u003e, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{{K}_{i}=XW}_{i}^{K}\\:,\\:\\:{{V}_{i}=XW}_{i}^{V}\\)\u003c/span\u003e\u003c/span\u003e here \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:X\\)\u003c/span\u003e\u003c/span\u003e standart convolution (Conv2D) represents the feature maps coming from the layer and\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\:{W}_{i}^{Q},\\:{W}_{i}^{K}\\:and\\:{W}_{i}^{V}\\)\u003c/span\u003e\u003c/span\u003e are the weight matrices learned for each topic. In the next step, attention calculation is performed for each title and the outputs of the titles are calculated as follows: \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{head}_{i}=Attention\\:\\left({Q}_{i},\\:{K}_{i},\\:{V}_{i}\\right)=\\:Softmax\\:\\left(\\frac{{Q}_{i}{K}_{i}^{T}}{\\sqrt{{d}_{k}}}\\right){V}_{i}\\)\u003c/span\u003e\u003c/span\u003e, here \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{d}_{k}\\)\u003c/span\u003e\u003c/span\u003e represents the size of key vectors. Finally, each header output is concatenated: \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:MultiHead\\:\\left(Q,\\:K,\\:V\\right)=Concat\\:\\left({head}_{1},\\:\\dots\\:,{head}_{h}\\right){W}^{0}\\)\u003c/span\u003e\u003c/span\u003e. \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{W}^{0}\\)\u003c/span\u003e\u003c/span\u003eis the weight matrix that projects the output. Thanks to their bidirectional architectures, BiLSTM and BiGRU provide an effective solution for contextual information extraction, allowing for forward and backward information flow. In applications where contextual information extraction is critical, such as fracture and crack detection, the structural differences of Long Short-Term Memory and Gated Recurrent Unit play a decisive role in selecting modeling strategies appropriate for the task type. With its bidirectional architecture, BiLSTM effectively models long-term dependencies, such as cracks' start and end points, enabling detailed structural information to be extracted. In contrast, BiGRU's computational efficiency allows it to quickly learn the general features of fractured regions, thus offering advantages, especially when time and resource constraints are present. The combined use of these two structures allows for more in-depth and multifaceted modeling of contextual relationships, enabling local and global contexts to be represented holistically and effectively in feature maps. This integration increases the capacity of models such as LG-CAFFNet to more comprehensively analyze the environmental context of fractured areas. It optimizes the model's computational processes, thus improving the model's analytical accuracy and environmental context representativeness in critical tasks such as fracture and crack detection. The features obtained from MHA and RNN technologies are collected through the added layer after passing through BN and ReLU and then resized in three-dimensional format with the reshape operation. In the final stage, the output of the PHCAR block is generated by applying Layer Normalization. This hybrid architecture can effectively analyze complex data structures by combining the advantages of different feature extraction and processing techniques. The detailed diagram of the PHCAR block architecture is presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"4. Experimental setup","content":"\u003cp\u003eThis section discusses in detail the datasets used in crack classification and the applied preprocessing techniques. In addition, specific details regarding the implementation of the methodologies are presented, and the evaluation metrics used to evaluate the classification performance are specified.\u003c/p\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e4.1. Datasets and preprocessing\u003c/h2\u003e \u003cp\u003eIn this study, the effectiveness of the proposed models and modern deep learning algorithms in the classification process of cracks was evaluated on four different datasets: Cracks in Concrete Structures Dataset, Concrete \u0026amp; Pavement Crack Dataset, Crack Dataset and Concrete Cracks Image Dataset. The number of data used in the training and testing processes of the deep learning algorithms and other details about the datasets are presented in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. Example images of the datasets are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eData distribution statistics.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"8\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026times;\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDataset\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eImage type\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eTrain\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eTest\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eNo. of samples\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eImage size\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eMulti-Binary values\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003eTotal instances\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003eCracks in Concrete Structures Dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWithout Crack\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2839\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1161\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e4000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026times;\" colname=\"c6\"\u003e \u003cp\u003e224\u0026thinsp;\u0026times;\u0026thinsp;224\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e12.000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSimple Cracks\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2794\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1206\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e4000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026times;\" colname=\"c6\"\u003e \u003cp\u003e224\u0026thinsp;\u0026times;\u0026thinsp;224\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMultibranched Crack\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2767\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1233\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e4000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026times;\" colname=\"c6\"\u003e \u003cp\u003e224\u0026thinsp;\u0026times;\u0026thinsp;224\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eConcrete \u0026amp; Pavement Crack Dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNegative\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e5276\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2224\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e7500\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026times;\" colname=\"c6\"\u003e \u003cp\u003e224\u0026thinsp;\u0026times;\u0026thinsp;224\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e15.000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePositive\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e5224\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2276\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e7500\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026times;\" colname=\"c6\"\u003e \u003cp\u003e224\u0026thinsp;\u0026times;\u0026thinsp;224\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003eCrack Dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eClear\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e892\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e408\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1300\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026times;\" colname=\"c6\"\u003e \u003cp\u003e224\u0026thinsp;\u0026times;\u0026thinsp;224\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e3900\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eShallow\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e937\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e363\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1300\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026times;\" colname=\"c6\"\u003e \u003cp\u003e224\u0026thinsp;\u0026times;\u0026thinsp;224\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDeep\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e901\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e399\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1300\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026times;\" colname=\"c6\"\u003e \u003cp\u003e224\u0026thinsp;\u0026times;\u0026thinsp;224\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eConcrete Cracks Image Dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNo cracks\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e758\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e317\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1075\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026times;\" colname=\"c6\"\u003e \u003cp\u003e224\u0026thinsp;\u0026times;\u0026thinsp;224\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e2126\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCracks\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e730\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e321\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1051\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026times;\" colname=\"c6\"\u003e \u003cp\u003e224\u0026thinsp;\u0026times;\u0026thinsp;224\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section3\"\u003e \u003ch2\u003e4.1.1. Cracks in the Concrete Structures Dataset\u003c/h2\u003e \u003cp\u003eThe Cracks in the Concrete Structures Dataset (Jabbari et al., 2023) were obtained from concrete structures in the Imam Khomeini International University campus and Qazvin City in Iran. A Phantom 4 Pro drone with 20 megapixel and Full HD resolution cameras was used to obtain the images. The data provided by the drone was recorded in video format and as color images. As a result, 900 color images with 4K resolution were obtained. These images were converted to grayscale and divided into 12,000 small images of 330\u0026times;330 pixels. The images were divided into three categories, each with 4000 images: Without Crack, Simple Cracks, and Multibranched Crack. In this study, all images were resized to 224\u0026times;224 pixels in the dataset's pre-processing stage, which was used in the deep learning models' input layer. The bicubic interpolation technique was used in the resizing process of the images. This method uses a cubic polynomial function to calculate pixel values. It is generally preferred for enlarging low-resolution images or reducing high-resolution images. After resizing the images, data normalization was performed; at this stage, pixel values were scaled between 0 and 1. After the normalization process, the data was divided into categories and labeled 0 for the \u0026ldquo;Without Crack\u0026rdquo; category, 1 for the \u0026ldquo;Simple Cracks\u0026rdquo; category, and 2 for the \u0026ldquo;Multibranched Crack\u0026rdquo; category. In the experimental process, 70% of the dataset was divided into training and 30% as test datasets. 30% of the training dataset was used in the validation phase of the model. This dataset can be accessed via the link: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://data.mendeley.com/datasets/9brnm3c39k/1\u003c/span\u003e\u003cspan address=\"https://data.mendeley.com/datasets/9brnm3c39k/1\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section3\"\u003e \u003ch2\u003e4.1.2. Concrete \u0026amp; Pavement Crack Dataset\u003c/h2\u003e \u003cp\u003eThe Concrete \u0026amp; Pavement Crack Dataset was collected by Oluwaseun (2023). This dataset contains concrete and pavement surface images collected at the Nigerian Army University Biu in Borno State, Nigeria. The images were collected using a DJI Mavic 2 Enterprise drone and a smartphone and saved as JPEG in RGB format. The dataset has two categories: Negative and Positive. The images have a resolution of 170\u0026times;227 pixels. In this study, 15,000 visual data were used, with 7500 data in each category (Negative and Positive). In the pre-processing stage of the dataset, resizing, data normalization, and labeling processes were performed. In the resizing process, all images were resized to 224\u0026times;224 pixels using the bicubic interpolation technique. After this stage, the image pixel values were scaled from 0\u0026ndash;1. In the labeling process, label 0 was defined for the \u0026ldquo;Negative\u0026rdquo; category, and label 1 was defined for the \u0026ldquo;Positive\u0026rdquo; category. In the experimental process, 70% of the dataset was divided into training and 30% into test datasets. 30% of the training dataset was used in the model's validation phase. This dataset can be accessed via the link: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.kaggle.com/datasets/oluwaseunad/concrete-and-pavement-crack-images\u003c/span\u003e\u003cspan address=\"https://www.kaggle.com/datasets/oluwaseunad/concrete-and-pavement-crack-images\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section3\"\u003e \u003ch2\u003e4.1.3. Crack Dataset\u003c/h2\u003e \u003cp\u003eThe Crack Dataset (Kassem, 2023) dataset consists of (1) Clear, (2) Shallow, and (3) Deep categories. It contains 3900 data in total, 1300 data in each category. In the pre-processing phase of the dataset, resizing, data normalization, and labeling processes were performed. In the resizing process, all images were resized to 224\u0026times;224 pixels. In this process, the bicubic interpolation technique was used. After the resizing phase, the image pixel values were scaled from 0 to 1. In the labeling process, 0 was defined for the \u0026ldquo;Clear\u0026rdquo; category, 1 for the \u0026ldquo;Shallow\u0026rdquo; category, and 2 for the \u0026ldquo;Deep\u0026rdquo; category. In the experimental process, 70% of the dataset was separated as training and 30% as test datasets. 30% of the training dataset was used in the model's validation phase. This dataset can be accessed via the link: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.kaggle.com/datasets/reemkassem/crack-dataset\u003c/span\u003e\u003cspan address=\"https://www.kaggle.com/datasets/reemkassem/crack-dataset\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section3\"\u003e \u003ch2\u003e4.1.4. Concrete Cracks Image Dataset\u003c/h2\u003e \u003cp\u003eThe authors collected the Concrete Cracks Image Dataset (Reis and Turk, 2024) at G\u0026uuml;m\u0026uuml;şhane University Faculty of Engineering and Natural Sciences in Turkey. This dataset contains concrete crack images. The images were collected using Samsung Galaxy M31 and Samsung Galaxy A50 smartphones with Android operating system and saved as JPEG in RGB format. The dataset has two categories: \u0026ldquo;No Cracks\u0026rdquo; and \u0026ldquo;Cracks.\u0026rdquo; The original images have different pixel resolutions, such as 1504\u0026times;3264 and 1860\u0026times;4032. The dataset contains 1075 data in the \u0026ldquo;No Cracks\u0026rdquo; category and 1051 data in the \u0026ldquo;Cracks\u0026rdquo; category. In the pre-processing stage of the dataset, resizing, data normalization, and labeling processes were performed. In the resizing process, all images were resized to 224\u0026times;224 pixels using the bicubic interpolation technique. After this stage, the image pixel values were scaled in the range of 0\u0026ndash;1. In the labeling process, 0 was defined for the \u0026ldquo;No Cracks\u0026rdquo; category, and 1 was defined for the \u0026ldquo;Cracks\u0026rdquo; category. In the experimental process, 70% of the dataset was separated as training and 30% as test datasets. 30% of the training data set was used in the validation phase of the model. This dataset can be accessed via the link: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://data.mendeley.com/datasets/fgjy2s3nk7/2\u003c/span\u003e\u003cspan address=\"https://data.mendeley.com/datasets/fgjy2s3nk7/2\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e4.2. Implementation details\u003c/h2\u003e \u003cp\u003eIn this study, CNN, Transformer, and MLP-based deep learning algorithms trained from scratch were used to detect crack. Among the deep learning models, in addition to the proposed LG-CAFFNet deep learning algorithm, there are MLP-Mixer (Tolstikhin et al., \u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e2021\u003c/span\u003e), EfficientNetB2 (Tan and Le, \u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e2019\u003c/span\u003e), MobileNet (Howard et al., \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2017\u003c/span\u003e), FasterNet (Chen et al., \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2023a\u003c/span\u003e), CMT (Guo et al., \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2022\u003c/span\u003e), Swin Transformer V2 (Liu et al., \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e2022b\u003c/span\u003e), and FlexiViT (Beyer et al., \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) models. The TensorFlow v2.15.0 framework\u003csup\u003e1\u003c/sup\u003e was used to apply the LG-CAFFNet, EfficientNetB2, and MobileNet deep learning models. The Keras CV Attention GitHub repository\u003csup\u003e2\u003c/sup\u003e was used to implement FasterNet, CMT, Swin Transformer V2, FlexiViT, and MLP-Mixer deep learning models. The experimental process was carried out in the Google Colab Pro environment. The system features in this version are Intel(R) Xeon(R) CPU @ 2.20GHz, driver version 535.104.05, CUDA version 12.2, NVIDIA L4 GPU, 22.5 GB of graphics memory, 53.0 GB RAM and 78.2 GB hard disk space. The study was carried out using the Python programming language.\u003c/p\u003e \u003cp\u003eIn this study, a comprehensive methodology was applied for the training and evaluation of deep learning models. Models were trained under the same conditions for 50 epochs within the framework of the training-validation-test paradigm. Backpropagation and optimization processes were used in the training process of deep learning models. The backpropagation algorithm was used to optimize the model's weights and biases, and the parameters were updated by calculating the gradient of the loss function. The optimization process was carried out using the Adam algorithm (Kingma and Ba, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2014\u003c/span\u003e). This algorithm provided fast convergence and balanced performance by using adaptive learning. Another important hyperparameter used in training the deep learning model is the batch size (32), which provides a balance between stochastic and deterministic approaches in gradient calculations, helping the training process be stable and fast. The models' training process aimed to have minimum training loss and validation loss values. For this purpose, the categorical cross-entropy loss function was used. Categorical cross-entropy is a loss function that measures the difference between the actual labels and the one-hot coding of the class probabilities predicted by the model. In this study, when the performance of the deep learning models became stagnant or decreased, the initial learning rate (1.0e-3) was dynamically adjusted. In this process, when no improvement was observed in the validation loss for two (patience value) consecutive epochs, the initial learning rate was reduced by a factor of 0.5, and this process continued until 50 cycles, and the minimum learning rate (1.0e-5). This adaptive learning strategy helped the model to converge to the global optimum and avoid local minima. During the implementation of deep learning algorithms, ModelCheckpoint and ReduceLROnPlateau functions of the TensorFlow library were used. ModelCheckpoint was used to save the best model, while ReduceLROnPlateau was used to adjust the learning rate dynamically. In all models, the input layer size was 224\u0026times;224\u0026times;3, and the output layer used the Softmax activation function. Probabilistic class predictions were obtained with the Softmax function. After the training of deep learning models was completed, performance analysis was performed with the test dataset on the model with the lowest validation loss. This provided an objective measurement of the model's generalization ability. In this study, binary and multi-class (three-class) classification tasks were carried out. Along with this process, the adaptation of deep learning models to problems of various complexity was tested.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003e4.3. Evaluation metrics\u003c/h2\u003e \u003cp\u003eThis research used various evaluation metrics to measure the success of the proposed methods and modern deep learning models in binary and multiple classification in crack detection. Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e shows the evaluation metrics and their mathematical formulations. The evaluation metrics used include Accuracy (ACC), Sensitivity (SN), Positive Predictive Value (PPV), F-1 score (F-1), and Receiver Operating Characteristic Area Under the Curve (ROC AUC). Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e shows the values of TN: True Negative, TP: True Positive, FN: False Negative, and FP: False Positive. In this study, PPV, SN and F-1 metrics were calculated with the macro average method in the multiple classification process. The macro metric takes into account the performance of each class equally and reflects the average performance; therefore, it is not affected by class imbalance.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePerformance metrics for binary and multi-class classification.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colspan=\"4\" nameend=\"c4\" namest=\"c1\"\u003e \u003cp\u003ePerformance metrics\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003eBinary-class classification metrics\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eMulti-class classification metrics\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePerformance metrics\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMathematical Expression\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePerformance metrics\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMathematical Expression\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAccuracy (ACC)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\frac{TP\\:+\\:TN}{TP\\:+\\:TN\\:+\\:FP\\:+\\:FN}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAccuracy (ACC)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\frac{\\sum\\:_{i=1}^{n}{TP}_{i}}{Total\\:Number\\:of\\:Test\\:Samples}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePositive Predictive Value (PPV)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\frac{TP}{TP\\:+\\:FP}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePositive Predictive Value (PPV) (macro)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\frac{1}{n}\\sum\\:_{i=1}^{n}{PPV}_{i}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSensitivity (SN)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\frac{TP}{TP\\:+\\:FN}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSensitivity (SN) (macro)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\frac{1}{n}\\sum\\:_{i=1}^{n}{SN}_{i}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eF1-score (F-1)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\frac{2\\:\\times\\:PPV\\:\\times\\:\\:SN\\:}{PPV\\:+\\:SN}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eF1-score (F-1) (macro)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\frac{1}{n}\\sum\\:_{i=1}^{n}{F1}_{i}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"5. Experimental results and discussion","content":"\u003cp\u003eIn this study, the performance of the proposed model and state-of-the-art deep learning algorithms in crack image classification has been comprehensively evaluated on four different datasets (Cracks in Concrete Structures Dataset, Concrete \u0026amp; Pavement Crack Dataset, Crack Dataset and Concrete Cracks Image Dataset). The experimental results obtained and their analysis are discussed in detail in this section. The evaluation process aims to measure the effectiveness of the proposed method on different datasets and to provide a comparative analysis with state-of-the-art methods. Thus, the generalizability of the proposed approach and their performance under different conditions have been examined, and the advantages and limitations of these methods have been discussed by comparing them with the existing state-of-the-art methods.\u003c/p\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003e5.1. Performance comparison on different crack datasets\u003c/h2\u003e \u003cp\u003eIn this study, the performance of LG-CAFFNet model has been comprehensively evaluated on four different datasets (Cracks in Concrete Structures Dataset, Concrete \u0026amp; Pavement Crack Dataset, Crack Dataset and Concrete Cracks Image Dataset). To objectively measure the effectiveness of the proposed models, the experimental results of CNN, MLP, and Transformer-based modern deep learning algorithms based on the same datasets have been analyzed comparatively. This comprehensive evaluation aims to determine the advantages and limitations of LG-CAFFNet model over existing methods. The findings of the experimental evaluations are presented in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e.\u003c/p\u003e \u003cp\u003eIn experiments conducted on the Cracks in the Concrete Structures Dataset, the proposed LG-CAFFNet model demonstrated the highest performance with a test loss of 0.1021 and an ACC of 97.61%. The model also outperformed the PPV (97.63%), SN (97.61%), and F-1 (97.61%) metrics. Comparative analyses reveal that the EfficientNetB2 (97.33% ACC, 0.1157 loss) and MobileNet (97.08% ACC, 0.1304 loss) models produced the closest accuracy values to the proposed model. Additionally, FasterNet (96.22% ACC, 0.1645 loss) and CMT (96.53% ACC, 0.2133 loss) exhibited moderate performance, while Swin Transformer V2 (93.81% ACC, 0.1894 loss) and MLP-Mixer (94.86% ACC, 0.2304 loss) exhibited lower performance. Finally, the FlexiViT model exhibited the lowest performance with 85.22% ACC and 0.3722 loss values.\u003c/p\u003e \u003cp\u003eIn experiments conducted on the Concrete \u0026amp; Pavement Crack Dataset, the proposed LG-CAFFNet model demonstrated the highest performance with a loss of 0.0238 and an ACC of 99.44%. The model also achieved the highest scores in the SN (99.12%) and F-1 (99.45%) metrics. In the PPV metric, FasterNet (99.82%) produced the highest value. Comparative analyses revealed that FasterNet (0.0422 loss, 99.31% ACC), MobileNet (0.0343 loss, 99.16% ACC), and EfficientNetB2 (0.0517 loss, 98.93% ACC) models produced the closest accuracy values to the proposed model. Besides, CMT (0.1447 loss, 94.22% ACC) and FlexiViT (0.1841 loss, 93.78% ACC) showed moderate performance, while MLP-Mixer (0.6307 loss, 61.51% ACC) and Swin Transformer V2 (0.6796 loss, 57.56% ACC) models showed significantly lower performance.\u003c/p\u003e \u003cp\u003eIn experiments conducted on the Crack Dataset, the LG-CAFFNet model demonstrated the highest accuracy with a loss of 0.0229 and an ACC of 99.23%. The model also achieved the highest values in PPV (99.22%), SN (99.21%), and F-1 (99.21%) metrics. Comparative analyses revealed that the FasterNet (loss of 0.0470, loss of 98.46%) and MobileNet (loss of 0.0559, loss of 98.46%) models produced accuracies closest to LG-CAFFNet.\u003c/p\u003e \u003cp\u003eBesides, Swin Transformer V2 (0.1490 loss, 94.87% ACC), FlexiViT (0.1307 loss, 94.79% ACC), and CMT (0.1567 loss, 93.33% ACC) showed moderate performance, while MLP-Mixer (0.2221 loss, 91.97% ACC) showed significantly lower performance compared to other models.\u003c/p\u003e \u003cp\u003eIn experiments conducted on the Concrete Cracks Image Dataset, the LG-CAFFNet model demonstrated the highest overall accuracy performance among all models, with a loss of 0.0658 and an ACC of 98.28%. The model also achieved the best performance in the SN (96.88%) and F-1 (98.26%) metrics, while EfficientNetB2 (100%) produced the highest value in the PPV metric. Models with closer performance include FasterNet (0.0907 loss, 97.65% ACC), CMT (0.1126 loss, 96.87% ACC), MobileNet (0.1217 loss, 96.71% ACC), and FlexiViT (0.1134 loss, 96.55% ACC). In contrast, MLP-Mixer (0.5867 loss, 70.85% ACC) and Swin Transformer V2 (0.6922 loss, 44.20% ACC) models showed the poorest performance compared to other models, exhibiting significantly lower accuracy and higher error rate.\u003c/p\u003e \u003cp\u003eExperimental results demonstrate that the LG-CAFFNet model consistently performs well in both positive and negative classes, with high ACC and low loss values. In particular, the high PPV and SN values in the positive class, representing cracks, demonstrate that the model can detect fracture zones with high accuracy. These findings demonstrate that LG-CAFFNet can accurately detect fractures and complex structures within them, exhibiting strong performance in this area. In conclusion, thanks to its ability to analyze high- and low-level features, the model has successfully produced successful crack detection results by effectively learning patterns at different scales Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e shows the graphs showing the change in validation loss of deep learning algorithms during the training process on different crack datasets. Figure\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e shows the graphs containing the ROC curves and the corresponding AUC values obtained during the testing phase of deep learning algorithms applied to different crack datasets. Figure\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e presents the confusion matrix showing the performance of the proposed LG-CAFFNet deep learning model on the test data of the crack datasets.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eComparative analysis of deep learning models on the crack datasets.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDataset\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLoss\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eACC\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003ePPV\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eSN\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eF-1\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"7\" rowspan=\"8\"\u003e \u003cp\u003eCracks in the Concrete Structures Dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFasterNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1645\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9622\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9622\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9623\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9620\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEfficientNetB2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1157\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9733\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9734\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9732\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9732\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMobileNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1304\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9708\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9708\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9710\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9707\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCMT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.2133\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9653\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9656\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9652\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9652\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSwin Transformer V2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1894\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9381\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9388\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9380\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9378\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFlexiViT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.3722\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.8522\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.8563\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.8528\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.8498\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMLP-Mixer\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.2304\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9486\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9483\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9483\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9482\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLG-CAFFNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1021\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9761\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9763\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9761\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9761\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"7\" rowspan=\"8\"\u003e \u003cp\u003eConcrete \u0026amp; Pavement Crack Dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFasterNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0422\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9931\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9982\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9881\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9932\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEfficientNetB2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0517\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9893\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9942\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9846\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9894\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMobileNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0343\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9916\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9978\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9855\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9916\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCMT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1447\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9422\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9688\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9152\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9413\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSwin Transformer V2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.6796\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.5756\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.5839\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.5598\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.5716\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFlexiViT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1841\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9378\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9309\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9473\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9390\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMLP-Mixer\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.6307\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.6151\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.6435\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.5360\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.5849\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLG-CAFFNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0238\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9944\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9978\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9912\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9945\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"7\" rowspan=\"8\"\u003e \u003cp\u003eCrack Dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFasterNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0470\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9846\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9842\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9845\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9843\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEfficientNetB2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0677\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9769\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9770\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9761\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9764\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMobileNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0559\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9846\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9844\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9843\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9843\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCMT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1567\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9333\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9334\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9332\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9329\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSwin Transformer V2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1490\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9487\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9503\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9459\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9468\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFlexiViT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1307\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9479\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9469\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9467\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9468\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMLP-Mixer\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.2221\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9197\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9203\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9171\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9180\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLG-CAFFNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0229\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9923\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9922\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9921\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9921\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"7\" rowspan=\"8\"\u003e \u003cp\u003eConcrete Cracks Image Dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFasterNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0907\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9765\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9935\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9595\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9762\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEfficientNetB2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.2671\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9389\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1.0000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.8785\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9353\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMobileNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1217\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9671\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9902\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9439\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9665\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCMT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1126\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9687\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9871\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9502\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9683\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSwin Transformer V2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.6922\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.4420\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.2989\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.0810\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.1275\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFlexiViT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1134\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9655\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9902\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9408\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9649\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMLP-Mixer\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.5867\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.7085\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.7626\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.6106\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.6782\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLG-CAFFNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0658\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9828\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9968\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.9688\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.9826\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e presents the graphs showing the changes in the validation losses of different deep learning algorithms on four datasets during the validation phase of training. According to the graphs, significant fluctuations were observed in all models at the beginning of the training process, but stabilization was achieved during the optimization process as the iterations progressed. However, MLP-Mixer and Swin Transformer V2 produced the most remarkable results during the models' stabilization process. These models exhibited unstable learning dynamics, particularly on the Concrete \u0026amp; Pavement Crack Dataset, characterized by small-scale fluctuations in the validation loss. In the Concrete Cracks Image Dataset, a decrease in the validation loss was observed at the beginning, but the change in the losses remained minimal in the later stages of the training process, and the models showed a lower performance compared to other deep learning approaches. Experimental findings reveal that the proposed LG-CAFFNet architecture exhibits a consistent and efficient learning process on four datasets. The model's low validation loss and stable optimization dynamics show that its generalization performance is high, and it offers a stronger representation capacity compared to existing deep learning models.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e compares the AUC performances of different deep-learning models on four datasets. Since Cracks in the Concrete Structures Dataset and Crack Dataset have a multi-class structure, AUC values were calculated separately for each class. In Cracks in the Concrete Structures Dataset, LG-CAFFNet (0.9821) produced the highest average AUC value, and FlexiViT (0.8893) had the lowest average AUC value. However, the average AUC value of all models was calculated as 0.9612. In Concrete \u0026amp; Pavement Crack Dataset, LG-CAFFNet (0.9945) was the most successful model, while Swin Transformer V2 (0.5757) had the lowest performance. The average AUC value of all models was 0.8801. According to the average AUC values in Crack Dataset, the LG-CAFFNet model (0.9941) produced the best result, and the MLP Mixer (0.9385) produced the highest average AUC value. The average AUC value of all models is 0.9703. In Concrete Cracks Image Dataset, LG-CAFFNet (0.9828) was the most successful model, while Swin Transformer V2 (0.4443) produced the lowest average AUC value. The average AUC value of all models was calculated as 0.8692. Experimental findings show that the proposed LG-CAFFNet model produces more successful results on different data sets.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e presents the test results of the LG-CAFFNet model on four different datasets. In Cracks in the Concrete Structures Dataset, the model made 86 errors in evaluating 3,600 test samples. The model achieved 98.71% accuracy rates in the Without Crack class, 94.86% in the Simple Cracks class, and 99.27% in the Multibranched Crack class. The findings show that the model successfully classified the Without Crack and Multibranched Crack classes. In Concrete \u0026amp; Pavement Crack Dataset, the model made 25 errors in 4,500 test samples. The model achieved 99.78% accuracy rates in the Negative class and 99.12% accuracy rates in the Positive class. These results show that the model classified the examples in the Negative class with higher accuracy. In Crack Dataset, the model made 9 errors in evaluating 1,170 test samples. The model has shown successful classification performance with 100% accuracy in the Clear class, 98.62% in the Shallow class, and 99.00% in the Deep class. In Concrete Cracks Image Dataset, 11 faults were detected in the evaluation made on 638 test samples. The model has shown successful classification performance by reaching 99.68% accuracy rates in the No Cracks class and 96.88% in the Cracks class. Experimental results show that the LG-CAFFNet model can classify different crack types with high accuracy and strong generalization ability.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003e5.2. Ablation study\u003c/h2\u003e \u003cp\u003eIn this study, the effects of fusion strategies and sequence-based components (MHA, BiGRU, BiLSTM) applied in the LG-CAFFNet deep learning model on the classification performance were analyzed with comprehensive experiments. Experimental studies were performed using the Cracks in Concrete Structures Dataset, and the results are presented in Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e. The structure in which all components were integrated showed the highest performance with an accuracy rate of 97.61% and misclassified only 86 test examples. In the scenario where the late fusion, MHA, and RNN components were removed, the accuracy decreased to 97.47%, and the number of misclassified examples increased to 91. When the multi-layer feature fusion component was also removed in addition to the previous scenario, the accuracy decreased to 72.06%, and the model's performance significantly decreased by 25.55%. The number of misclassified examples reached 1006. When all fusion techniques were removed, the accuracy rate dropped to 69.72%, resulting in a significant performance decrease of 27.89%. As a result, the number of misclassified examples increased to 1090. Experimental findings show that fusion strategies with sequence-based components increased the generalization capacity of LG-CAFFNet, allowing complex data structures to be represented more effectively. However, although these structures deepened the model's feature extraction strategy, they significantly increased the temporal complexity. While the original model's temporal complexity was 3624 seconds, when these components were removed, it was determined to be 1480 seconds. Figure\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e visualizes the effects of model components on the training process and the performance in the testing phase.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eEffects of different fusion techniques on the performance of the LG-CAFFNet deep learning model.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTrainable Parameters (million)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eTotal number of layers\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eACC\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eF-1\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eTraining Time (s)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLG-CAFFNet wo (late fusion, multi-layer feature fusion, early fusion technique, MHA, BiGRU and BiLSTM)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.35\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e486\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.6972\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.6922\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e1480\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLG-CAFFNet wo (late fusion, multi-layer feature fusion technique, MHA, BiGRU, BiLSTM)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.38\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e556\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.7206\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.7148\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e1777\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLG-CAFFNet wo (late fusion technique, MHA, BiGRU, BiLSTM)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1.09\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e576\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9747\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9747\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e1667\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLG-CAFFNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1.48\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e669\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9761\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9761\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e3623\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe graphs in Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e present the ACC and F-1 metrics results obtained during the testing process with the changes in validation loss values. The findings show that the fusion strategies and removal of sequence-based components used in the LG-CAFFNet model significantly decreased model performance. When the validation loss graph was examined, it was determined that removing three fusion strategies and sequence-based components led to high validation loss values. In line with the evaluation metrics in the testing process, the removal of the model component decreased ACC and F-1 scores compared to the original model. These findings reveal that the components of the LG-CAFFNet model play a critical role in the classification process and that removing these components negatively affects the model's generalization ability.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003e5.3. Comparative complexity analysis\u003c/h2\u003e \u003cp\u003eIn this section, a complexity comparison analysis of deep learning models was conducted, and the relevant findings are presented in Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e. In this comparison process, deep learning models were evaluated using metrics such as the total number of layers, the number of trained parameters, giga-scale floating-point operations per second (GFLOPs), and the training times of deep learning models (for the Cracks in the Concrete Structures Dataset, Concrete \u0026amp; Pavement Crack Dataset, Crack Dataset, and Concrete Cracks Image Dataset). When deep learning models were evaluated in terms of the number of layers, the number of trained parameters, and GFLOPs, the LG-CAFFNet model stands out as a highly successful architecture in terms of computational efficiency, with only 1.48\u0026nbsp;million trained parameters and 0.75 GFLOPs, despite its 669-layer structure. Among the architectures examined, LG-CAFFNet had the deepest structure, while MobileNet had the fewest layers. In terms of the number of trained parameters, LG-CAFFNet has the lowest value at 1.48\u0026nbsp;million, while MLP-Mixer has the highest at 59.53\u0026nbsp;million. In terms of computational complexity, LG-CAFFNet has the lowest at 0.75 GFLOPs, while Swin Transformer V2 has the highest at 9.39 GFLOPs. When evaluating the training time of deep learning models, the FasterNet model had the shortest training times. In contrast, the Swin Transformer V2 model had the longest training times for the \"Cracks in the Concrete Structures Dataset,\" \"Concrete \u0026amp; Pavement Crack Dataset,\" \"Crack Dataset,\" and \"Concrete Cracks Image Dataset\" datasets. The training times of the LG-CAFFNet model were 3623, 4439, 1416, and 911 seconds on the \u0026ldquo;Cracks in the Concrete Structures Dataset\u0026rdquo;, \u0026ldquo;Concrete \u0026amp; Pavement Crack Dataset\u0026rdquo;, \u0026ldquo;Crack Dataset\u0026rdquo;, and \u0026ldquo;Concrete Cracks Image Dataset\u0026rdquo; datasets, respectively.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eComparative analysis of deep learning model complexity.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"8\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eTotal number of layers\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eTrainable parameters (million)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eGFLOPs\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"4\" nameend=\"c8\" namest=\"c5\"\u003e \u003cp\u003eTraining time (s)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eCracks in Concrete Structures Dataset\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eConcret\u0026amp;Pavement Crack Dataset\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eCrack Dataset\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003eConcrete Cracks Image Dataset\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFasterNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e131\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e6.32\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.71\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e581\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e708\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e218\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e118\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eEfficientNetB2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e342\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e7.70\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.36\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e2557\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e3165\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e884\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e514\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMobileNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e91\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3.21\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.15\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e954\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e1166\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e326\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e182\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCMT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e598\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e8.19\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2.64\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e2757\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e3432\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1040\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e639\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSwin Transformer V2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e569\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e27.57\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e9.39\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e5540\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e6718\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1938\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e1041\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFlexiViT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e272\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e21.67\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e9.26\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e2575\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e3121\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e843\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e511\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMLP-Mixer\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e150\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e59.53\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e6.51\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e2197\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e2716\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e742\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e447\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLG-CAFFNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e669\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.48\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.75\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e3623\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e4439\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1416\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e911\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003e5.4. Explaining model predictions with XAI methods\u003c/h2\u003e \u003cp\u003eGradient-weighted Class Activation Mapping (Grad-CAM) (Selvaraju et al., \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) is an effective technique for increasing the transparency and interpretability of CNN-based deep learning models' predictions with visual data. This method visualizes the visual regions that play a decisive role in the model output, using gradient-based techniques to increase the understandability of the model's decision processes. Grad-CAM highlights the class activations by determining the regions where the network focuses its attention on each class. Thus, it determines which features are effective in the model's decision-making processes. This allows for more objective interpretation and analysis of the model's predictions in complex tasks such as image classification. This study used the Grad-CAM technique to increase the interpretability of the LG-CAFFNet model proposed for detecting fractures and cracks in structural elements such as concrete, pavement, and roads in the classification process. Experimental findings are presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003e. These results show the visualization of class-based activation maps and regional regions of interest obtained using the Grad-CAM technique with comprehensive feature matrices obtained with the late fusion technique in the LG-CAFFNet model.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003e includes the visualization results obtained by the LG-CAFFNet model for detecting cracks in concrete and other surfaces in four different datasets. These results reveal that the model performs effectively in correctly localizing cracks. In the Cracks in Concrete Structures Dataset, the model provided accurate localization of cracks with Grad-CAM-based activation maps in simple and multi-branched cracks. In simple cracks, the areas focused by the model were concentrated along the crack, and it was observed that they accurately covered the fine details of the crack. Especially in complex multi-branched cracks, the areas focused by the model showed that it could correctly detect the crack geometry. In such cracks, the accurate determination of branching regions reflects the high geometric sensitivity of the model. In addition, overlay images generally show minimal activation in areas outside the crack region, confirming that the model has a low error rate. The model correctly detects thin and wide crack lines on the Concrete \u0026amp; Pavement Crack Dataset, increasing localization accuracy with dense activations spreading along the crack. Grad-CAM activation maps show that the model can cover the entire crack length, especially focusing on the crack's starting and ending points. In wider cracks, it was observed that the model was successfully detected without being affected by environmental noise. The model also correctly detected crack sections with low contrast and parallel to the surface in thin cracks. Overlay images show that the model detects cracks independently of environmental noise and that color and texture changes on concrete surfaces do not negatively affect the model's performance. In the analyses performed on the Crack Dataset, the model's depth sensitivity showed significant success in distinguishing shallow and deep cracks. In shallow cracks, the model could follow thinner and surface crack patterns in detail, but there were activation leaks around the crack. On the other hand, it was observed that the model's focus density increased in deep cracks and that it was able to more successfully determine the crack's deep structure. In deep cracks, Grad-CAM maps focused on the inner lines of the crack and the stress points around it. This shows that the model can detect depth information and surface cracks.\u003c/p\u003e \u003cp\u003eIn addition, the overlay results confirm that the model's focus area is consistent along the crack. Tests on the Concrete Cracks Image Dataset evaluated the generalization ability of cracks found on different surface types (wall, concrete, etc.). In addition to cracks on the wall and concrete surfaces, the model effectively performed on images with excessive texture or irregular surface features. In particular, it correctly detected large cracks on large surfaces and small wear-related surface cracks. In overlay images, the model generally focused on crack areas and was less affected by surface roughness, color variations, or textural differences. Additionally, Grad-CAM maps show that the regions focused on by the model are compatible with the crack geometry, and the detection accuracy is successful. Experimental findings show that the LG-CAFFNet model is remarkable for distinguishing between crack depth, geometry, and surface type differences. The model successfully addresses challenges such as crack depth, surface structure, and shape. Deep cracks have a wider and more complex structure than superficial cracks, and the model's accurate localization of such cracks confirms the model's depth sensitivity and detection ability. In addition, the model has also achieved practical results in more complex geometric structures, especially multi-branched cracks. In addition, successful detections have been achieved on different surface types (concrete, road, wall, etc.). The results show that LG-CAFFNet has high generalization ability and can accurately detect cracks regardless of environmental factors. The model can effectively distinguish different surface types and offers an important approach to crack detection and visualization.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003e5.5. Comparison of state-of-the-art methods\u003c/h2\u003e \u003cp\u003eThis section examines the performance of the LG-CAFFNet deep learning model developed for crack detection compared with the studies proposed in the literature in recent years. The analysis of the experimental results is presented in Table\u0026nbsp;\u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e6\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab6\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 6\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eA comparative analysis of state-of-the-art methods and the proposed deep learning model.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"8\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLiteratures\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMethods\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDataset\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eTypes of cracks\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eACC (%)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003ePPV (%)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eSN (%)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003eF-1 (%)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRussel and Selvaraj (\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e2024\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMultiScaleCrackNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAsphalt Crack Database\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNegative, Positive\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e99.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e100\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e98.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e99.00\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eShashidhar et al. (\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e2024\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCrackSpot\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eStructure surface datasets\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNoncrack, Crack\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e97.11\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e97\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e97\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e97\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMohan et al. (\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2023\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eResNet50\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCracks in Concrete Structures Dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eWithout Crack, Simple Cracks, Multibranched Crack\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e96\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e92.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e90.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003eN/A\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOmoebamije et al. (\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e2023\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCNN model\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eConcrete \u0026amp; Pavement Crack Dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNegative, Positive\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e99.04\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e98.81\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e99.28\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e99.04\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eChen et al. (\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2023b\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eResNet101\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBuilding Surface Crack (in China)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eWithout Cracks, Cracks\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e94\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eN/A\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eN/A\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003eN/A\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRashid et al. (\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e2024\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSurface Crack Detection Dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNegative, Positive\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e99.27\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e99.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e98.85\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e99.3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eJabbari and Bigdeli (\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2024\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCapsGAN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCracks in Concrete Structures Dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eWithout Crack, Simple Cracks, Multibranched Crack\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e94.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e98.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e94.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e96.3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSun et al. (\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e2023\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSVM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSDNET2018\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eWithout Cracks, With Cracks\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e94.38\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eN/A\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eN/A\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003eN/A\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"3\" rowspan=\"4\"\u003e \u003cp\u003eProposed Approach\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"3\" rowspan=\"4\"\u003e \u003cp\u003eLG-CAFFNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCracks in Concrete Structures Dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eWithout Crack, Simple Cracks, Multibranched Crack\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e97.61\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e97.63\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e97.61\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e97.61\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eConcrete \u0026amp; Pavement Crack Dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNegative, Positive\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e99.44\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e99.78\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e99.12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e99.45\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCrack Dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eClear, Shallow, Deep\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e99.23\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e99.29\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e99.31\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e99.30\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eConcrete Cracks Image Dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNo cracks, Cracks\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e98.28\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e99.68\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e96.88\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e98.26\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eAccording to Table\u0026nbsp;\u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e6\u003c/span\u003e, the classification accuracies of the proposed deep learning models obtained from the \u0026ldquo;Cracks in Concrete Structures Dataset\u0026rdquo;, \u0026ldquo;Concrete \u0026amp; Pavement Crack Dataset\u0026rdquo;, \u0026ldquo;Crack Dataset\u0026rdquo;, and \u0026ldquo;Concrete Cracks Image Dataset\u0026rdquo; datasets were determined as 97.61%, 99.44%, 99.23%, and 98.28%, respectively. According to the table, the proposed methods have produced the highest accuracy values compared to the state-of-the-art methods in the literature. In the studies conducted using the \u0026ldquo;Cracks in Concrete Structures Dataset\u0026rdquo; dataset, Mohan et al. (\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) achieved the second-best classification accuracy with an ACC rate of 96%. In the study conducted using the \u0026ldquo;Concrete \u0026amp; Pavement Crack Dataset\u0026rdquo; dataset, Omoebamije et al. (\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) produced the second-best classification process with an ACC rate of 99.04%.\u003c/p\u003e \u003c/div\u003e"},{"header":"6. Conclusion","content":"\u003cp\u003eThis study proposes LG-CAFFNet, an advanced deep-learning model, for detecting cracks in concrete structures. The model is designed to perform comprehensive feature extraction by learning local correlations with CNN, global correlations with MHA, and sequential correlations with RNN (BiLSTM, BiGRU) techniques. In addition, feature fusion mechanisms at different levels are integrated into the model using early fusion, multi-layer feature fusion, and late fusion strategies. This increases the model's crack detection performance and provides a more balanced learning process. Although the proposed model has a deep structure of 669 layers, it contains only 1.48\u0026nbsp;million trainable parameters. This enhances the computational efficiency of the model while maintaining its high learning capacity. Thus, the model exhibits high performance while minimizing processing costs. The effectiveness of the proposed method was evaluated with extensive experiments on the Cracks in Concrete Structures Dataset, Concrete \u0026amp; Pavement Crack Dataset, Crack Dataset, and Concrete Cracks Image Dataset datasets collected by the authors. The experimental results show that the LG-CAFFNet model exhibits high performance, achieving accuracy rates of 97.61%, 99.44%, 99.23%, and 98.28%, respectively. These findings show that the CNN-MHA-Bidirectional RNN-based deep learning model significantly increases the capacity to learn crack patterns, providing a practical approach in this context.\u003c/p\u003e \u003cp\u003eThis study identified several limitations, including computational complexity, processing times, and generalizability, in real-world applications. Due to RNN and MHA integrations, the LG-CAFFNet model has high computational costs. However, although these integrations increased the model's time complexity, they significantly improved the classification success.\u003c/p\u003e \u003cp\u003eIn real-world applications, the effectiveness of this model may vary depending on factors such as dataset size and diversity, which can make it challenging to optimize the balance between accuracy and processing time. In future studies, strategies such as knowledge distillation, quantization, and pruning can be applied to minimize the computational complexity of the proposed LG-CAFFNet model. Computational efficiency can be increased by using dilated convolutions and group convolutions in the model. Additionally, alternative methods, such as self-supervised learning techniques, graph convolutional networks, or capsule networks, can be employed to enhance the model's feature extraction capabilities. In the parameter determination process of the deep learning model, more effective parameter adjustments can be made using optimization algorithms such as Particle Swarm Optimization, Cuckoo Search, and the Firefly Algorithm. Moreover, hybrid learning methods can be developed using meta-classifiers. Data augmentation techniques can be applied to increase the model's classification performance. Finally, comprehensive tests can be performed on datasets of different sizes to evaluate the model's generalizability.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eCompeting interests:\u0026nbsp;\u003c/strong\u003eThe authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;Ethics approval:\u003c/strong\u003e Not Applicable. The article does not involve any human or animal participants. No ethical approval is required.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;Author contributions:\u003c/strong\u003e All authors made a significant contribution to the work reported. H.C.R.: conception, study design, acquisition of data, software, analysis, writing, editing, and interpretation.\u003c/p\u003e\n\u003cp\u003eV.T.: study design, acquisition of data, software, analysis, writing, and interpretation. K.K.: analysis, writing, editing, and interpretation. All authors have read and agreed to the published version of the manuscript.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;Funding:\u003c/strong\u003e Not Applicable.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData availability:\u003c/strong\u003e The data that support the findings of this study are available from the corresponding authors upon reasonable request.\u003c/p\u003e"},{"header":"References","content":"\u003cp\u003e[dataset] Jabbari, H., Bigdeli, N., Shojaei, M., 2023. Cracks in concrete structures (CICS) dataset. Mendeley Data, v1. https://doi.org/10.17632/9brnm3c39k.1.\u003c/p\u003e\n\u003cp\u003e[dataset] Kassem, R., 2023. Crack Dataset. Kaggle. https://www.kaggle.com/datasets/reemkassem/crack-dataset.\u003c/p\u003e\n\u003cp\u003e[dataset] Oluwaseun, O., 2023. Concrete \u0026amp; Pavement Crack Dataset. Kaggle. https://doi.org/10.34740/kaggle/dsv/5130126.\u003c/p\u003e\n\u003cp\u003e[dataset] Reis, H.C., Turk, V., Bozkurt, M.F., Yigit, S.N., 2024. Concrete Cracks Image Dataset (CCID). Mendeley Data, v2. https://doi.org/10.17632/fgjy2s3nk7.2.\u003c/p\u003e\n\u003cp\u003eAhmed, T.U., Hossain, M.S., Alam, M.J., Andersson, K., 2019. An integrated CNN-RNN framework to assess road crack. In: 2019 22nd International Conference on Computer and Information Technology (ICCIT). pp. 1-6. https://doi.org/10.1109/ICCIT48885.2019.9038607.\u003c/p\u003e\n\u003cp\u003eBeyer, L., Izmailov, P., Kolesnikov, A., Caron, M., Kornblith, S., Zhai, X., Minderer, M., Tschannen, M., Alabdulmohsin, I., Pavetic, F., 2023. Flexivit: One model for all patch sizes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 14496-14506. https://doi.org/10.1109/CVPR52729.2023.01393.\u003c/p\u003e\n\u003cp\u003eChang, S., Zheng, B., 2024. A lightweight convolutional neural network for automated crack inspection. Construction and Building Materials 416, 135151. https://doi.org/10.1016/j.conbuildmat.2024.135151.\u003c/p\u003e\n\u003cp\u003eChen, J., Kao, S.H., He, H., Zhuo, W., Wen, S., Lee, C.H., Chan, S.H.G., 2023a. Run, don\u0026apos;t walk: chasing higher FLOPS for faster neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. CVPR, pp. 12021-12031. https://doi.org/10.1109/CVPR52729.2023.01157.\u003c/p\u003e\n\u003cp\u003eChen, T., Cai, Z., Zhao, X., Chen, C., Liang, X., Zou, T., Wang, P., 2020. Pavement crack detection and recognition using the architecture of segNet. Journal of Industrial Information Integration 18, 100144. https://doi.org/10.1016/j.jii.2020.100144.\u003c/p\u003e\n\u003cp\u003eChen, Y., Zhu, Z., Lin, Z., Zhou, Y., 2023b. Building surface crack detection using deep learning technology. Buildings 13 (7), 1814. https://doi.org/10.3390/buildings13071814.\u003c/p\u003e\n\u003cp\u003eCubero-Fernandez, A., Rodriguez-Lozano, F.J., Villatoro, R., Olivares, J., Palomares, J.M., 2017. Efficient pavement crack detection and classification. EURASIP Journal on Image and Video Processing 2017 (1), 39. https://doi.org/10.1186/s13640-017-0187-0.\u003c/p\u003e\n\u003cp\u003eDosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.\u003c/p\u003e\n\u003cp\u003eFaghih-Roohi, S., Hajizadeh, S., N\u0026uacute;\u0026ntilde;ez, A., Babuska, R., De Schutter, B., 2016. Deep convolutional neural networks for detection of rail surface defects. In: 2016 International joint conference on neural networks (IJCNN). pp. 2584-2589. https://doi.org/10.1109/IJCNN.2016.7727522.\u003c/p\u003e\n\u003cp\u003eFang, F., Li, L., Gu, Y., Zhu, H., Lim, J.H., 2020. A novel hybrid approach for crack detection. Pattern Recognition 107, 107474. https://doi.org/10.1016/j.patcog.2020.107474.\u003c/p\u003e\n\u003cp\u003eFu, R., Zhang, Y., Zhu, K., Strauss, A., Cao, M., 2024. Real-time detection of concrete cracks via enhanced You Only Look Once Network: Algorithm and software. Advances in Engineering Software 195, 103691. https://doi.org/10.1016/j.advengsoft.2024.103691.\u003c/p\u003e\n\u003cp\u003eGandhi, M.A., Swaminathen, A.N., Patil, D.T., Ravitheja, A., Kamali, R., Rajput, A., 2023. Quantitative Evaluation to Detect Crack Depth in Beams Based on CNN-RNN-LSTM Approach. In: 2023 International Conference on Self Sustainable Artificial Intelligence Systems (ICSSAS). pp. 74-79. https://doi.org/10.1109/ICSSAS57918.2023.10331901.\u003c/p\u003e\n\u003cp\u003eGopalakrishnan, K., 2018. Deep learning in data-driven pavement image analysis and automated distress detection: A review. Data 3 (3), 28. https://doi.org/10.3390/data3030028.\u003c/p\u003e\n\u003cp\u003eGuo, F., Qian, Y., Liu, J., Yu, H., 2023. Pavement crack detection based on transformer network. Automation in Construction 145, 104646. https://doi.org/10.1016/j.autcon.2022.104646.\u003c/p\u003e\n\u003cp\u003eGuo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., Xu, C., 2022. Cmt: Convolutional neural networks meet vision transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. CVPR, pp. 12175-12185. https://doi.org/10.1109/CVPR52688.2022.01186.\u003c/p\u003e\n\u003cp\u003eHoang, N.D., Nguyen, Q.L., Tien Bui, D., 2018. Image processing\u0026ndash;based classification of asphalt pavement cracks using support vector machine optimized by artificial bee colony. Journal of Computing in Civil Engineering 32 (5), 04018037. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000781.\u003c/p\u003e\n\u003cp\u003eHoward, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An- dreetto, M., Adam, H., 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861.\u003c/p\u003e\n\u003cp\u003eJabbari, H., Bigdeli, N., 2024. A new hierarchical algorithm based on CapsGAN for imbalanced image classification. IET Image Processing 18 (1), 194-210. https://doi.org/10.1049/ipr2.12942.\u003c/p\u003e\n\u003cp\u003eKamaliardakani, M., Sun, L., Ardakani, M.K., 2016. Sealed-crack detection algorithm using heuristic thresholding approach. Journal of Computing in Civil Engineering 30 (1), 04014110. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000447.\u003c/p\u003e\n\u003cp\u003eKingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.\u003c/p\u003e\n\u003cp\u003eLi, L., Sun, R., 2019. Bridge crack detection algorithm based on image processing under complex background. Laser \u0026amp; Optoelectronics Progress 56 (6), 061002. http://dx.doi.org/10.3788/LOP56.061002.\u003c/p\u003e\n\u003cp\u003eLi, Y., Li, H., Wang, H., 2018. Pixel-wise crack detection using deep local pattern predictor for robot application. Sensors 18 (9), 3042. https://doi.org/10.3390/s18093042.\u003c/p\u003e\n\u003cp\u003eLiu, F., Liu, J., Wang, L., 2022a. Deep learning and infrared thermography for asphalt pavement crack severity classification. Automation in Construction 140, 104383. https://doi.org/10.1016/j.autcon.2022.104383.\u003c/p\u003e\n\u003cp\u003eLiu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., Wei, F., Guo, B., 2022b. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. CVPR, pp. 12009-12019. https://doi.org/10.1109/CVPR52688.2022.01170.\u003c/p\u003e\n\u003cp\u003eMa, X., Li, Y., Yang, Z., Li, S., Li, Y., 2024. Lightweight network for millimeter-level concrete crack detection with dense feature connection and dual attention. Journal of Building Engineering, 94, 109821. https://doi.org/10.1016/j.jobe.2024.109821.\u003c/p\u003e\n\u003cp\u003eMatarneh, S., Elghaish, F., Rahimian, F.P., Abdellatef, E., Abrishami, S., 2024. Evaluation and optimisation of pre-trained CNN models for asphalt pavement crack detection and classification. Automation in Construction 160, 105297. https://doi.org/10.1016/j.autcon.2024.105297.\u003c/p\u003e\n\u003cp\u003eMohan, A., Poobal, S., 2018. Crack detection using image processing: A critical review and analysis. alexandria engineering journal 57 (2), 787-798. https://doi.org/10.1016/j.aej.2017.01.020.\u003c/p\u003e\n\u003cp\u003eMohan, G.B., Kumar, R.P., Yogiraj, B., 2023. Deep Learning-Powered Concrete Crack Classification for Improved Structural Integrity. In: 2023 Seventh International Conference on Image Information Processing (ICIIP). pp. 844-849. https://doi.org/10.1109/ICIIP61524.2023.10537741.\u003c/p\u003e\n\u003cp\u003eNasimov, R., Cho, Y.I., 2025. Smart City Infrastructure Monitoring with a Hybrid Vision Transformer for Micro-Crack Detection. Sensors 25 (16), 5079. https://doi.org/10.3390/s25165079.\u003c/p\u003e\n\u003cp\u003eNyathi, M.A., Bai, J., Wilson, I.D., 2024. Deep learning for concrete crack detection and measurement. Metrology, 4(1), 66-81. https://doi.org/10.3390/metrology4010005.\u003c/p\u003e\n\u003cp\u003eOmoebamije, O., Omoniyi, T.M., Musa, A., Duna, S., 2023. An improved deep learning convolutional neural network for crack detection based on UAV images. Innovative Infrastructure Solutions 8 (9), 236. https://doi.org/10.1007/s41062-023-01209-3.\u003c/p\u003e\n\u003cp\u003eRashid, T., Mokji, M.M., Rasheed, M., 2024. Cracked concrete surface classification in low-resolution images using a convolutional neural network. Journal of Optics 1-13. https://doi.org/10.1007/s12596-024-02080-w.\u003c/p\u003e\n\u003cp\u003eRussel, N.S., Selvaraj, A., 2024. MultiScaleCrackNet: A parallel multiscale deep CNN architecture for concrete crack classification. Expert Systems with Applications 249, 123658. https://doi.org/10.1016/j.eswa.2024.123658.\u003c/p\u003e\n\u003cp\u003eSelvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D., 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618-626. https://doi.org/10.1109/ICCV.2017.74.\u003c/p\u003e\n\u003cp\u003eShamsabadi, E.A., Xu, C., Rao, A.S., Nguyen, T., Ngo, T., Dias-da-Costa, D., 2022. Vision transformer-based autonomous crack detection on asphalt and concrete surfaces. Automation in Construction 140, 104316. https://doi.org/10.1016/j.autcon.2022.104316.\u003c/p\u003e\n\u003cp\u003eShashidhar, R., Manjunath, D., Shanmukha, S.M., 2024. CrackSpot: Deep learning for automated detection of structural cracks in concrete infrastructure. Asian Journal of Civil Engineering 25 (1), 1079-1090. https://doi.org/10.1007/s42107-023-00754-7.\u003c/p\u003e\n\u003cp\u003eShi, M., Li, H., Yao, Q., Zeng, J., Wang, J., 2024. Vision based nighttime pavement cracks pixel level detection by integrating infrared visible fusion and deep learning. Construction and Building Materials 442, 137662. https://doi.org/10.1016/j.conbuildmat.2024.137662.\u003c/p\u003e\n\u003cp\u003eSun, Z., Caetano, E., Pereira, S., Moutinho, C., 2023. Employing histogram of oriented gradient to enhance concrete crack detection performance with classification algorithm and Bayesian optimization. Engineering Failure Analysis 150, 107351. https://doi.org/10.1016/j.engfailanal.2023.107351.\u003c/p\u003e\n\u003cp\u003eTan, M., Le, Q., 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning (PMLR), Vol. 97, pp. 6105-6114.\u003c/p\u003e\n\u003cp\u003eTeng, S., Liu, A., Chen, B., Wang, J., Wu, Z., Fu, J., 2024. Unsupervised learning method for underwater concrete crack image enhancement and augmentation based on cross domain translation strategy. Engineering Applications of Artificial Intelligence 136, 108884. https://doi.org/10.1016/j.engappai.2024.108884.\u003c/p\u003e\n\u003cp\u003eTolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., Lucic, M., Dosovitskiy, A., 2021. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34, 24261-24272.\u003c/p\u003e\n\u003cp\u003eWang, C., Liu, H., An, X., Gong, Z., Deng, F., 2024. SwinCrack: Pavement crack detection using convolutional swin-transformer network. Digital Signal Processing 145, 104297. https://doi.org/10.1016/j.dsp.2023.104297.\u003c/p\u003e\n\u003cp\u003eYeung, C.C., Lam, K.M., 2024. Contrastive decoupling global and local features for pavement crack detection. Engineering Applications of Artificial Intelligence 133, 108632. https://doi.org/10.1016/j.engappai.2024.108632.\u003c/p\u003e\n\u003cp\u003eZhang, B., Zhang, Y., 2025. MSCViT: A small-size ViT architecture with multi-scale self-attention mechanism for tiny datasets. Neural Networks 188, 107499. https://doi.org/10.1016/j.neunet.2025.107499.\u003c/p\u003e\n\u003cp\u003eZhang, H., Ma, L., Yuan, Z., Liu, H., 2024a. Enhanced concrete crack detection and proactive safety warning based on I-ST-UNet model. Automation in Construction 166, 105612. https://doi.org/10.1016/j.autcon.2024.105612.\u003c/p\u003e\n\u003cp\u003eZhang, T., Qin, L., Zou, Q., Zhang, L., Wang, R., Zhang, H., 2024b. Crackscopenet: a lightweight neural network for rapid crack detection on resource-constrained drone platforms. Drones 8 (9), 417. https://doi.org/10.3390/drones8090417.\u003c/p\u003e\n\u003cp\u003eZhao, H., Qin, G., Wang, X., 2010. Improvement of canny algorithm based on pavement edge detection. In: 2010 3rd international congress on image and signal processing. Vol. 2, pp. 964-967. https://doi.org/10.1109/CISP.2010.5646923.\u003c/p\u003e"},{"header":"Footnotes","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.tensorflow.org/versions/r2.15/api_docs/python/tf\u003c/span\u003e\u003cspan address=\"https://www.tensorflow.org/versions/r2.15/api_docs/python/tf\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (accessed 11th Jun 2024)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/leondgarse/keras_cv_attention_models\u003c/span\u003e\u003cspan address=\"https://github.com/leondgarse/keras_cv_attention_models\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (accessed 11th Jun 2024)\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"soft-computing","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"soco","sideBox":"Learn more about [Soft Computing](https://www.springer.com/journal/500)","snPcode":"500","submissionUrl":"https://submission.nature.com/new-submission/500/3","title":"Soft Computing","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Context-aware crack detection, Light-weight deep neural network architecture, Local–global feature fusion, Multi-scale feature integration, Structural health monitoring, Surface crack detection in concrete and pavements","lastPublishedDoi":"10.21203/rs.3.rs-8892244/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8892244/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eEarly and accurate detection of cracks in concrete structures is crucial for maintaining structural integrity and ensuring the safety of the structure. However, traditional visual inspection methods are limited in their application, especially with large datasets. In this area, deep learning-based approaches offer high potential for the automatic detection of micro- and macro-damage due to their large data processing capacity and ability to model complex structural patterns in this data. In recent years, among deep learning-based approaches, the Convolutional Neural Network (CNN) has become prominent in crack detection. These models hold significant potential for identifying small cracks and micro-damage due to their ability to extract local features effectively. However, due to their limited ability to represent global context and long-range relationships, these models may be limited in detecting complex structural patterns where micro- and macro-cracks coexist. In this study, an advanced lightweight deep learning model called the Local-Global Context-Aware Feature Fusion Network (LG-CAFFNet) was developed to minimize the limitations of existing crack detection methods. The model focuses on comprehensively representing crack morphology at micro and macro scales with its multilayered structure that integrates local morphological details and global contextual relationships. In the model, local textural features are extracted through CNN-based layers. At the same time, the self-attention mechanism represents large-scale contextual relationships, and bidirectional recurrent neural network layers represent sequential structural dependencies. This multilayer contextual fusion-based approach, addressing the limitations observed in previous studies, contributes to a more comprehensive modeling of the morphological diversity of crack patterns, their multi-scale representation, and the contextual relationships between them. The proposed model was tested on four different concrete crack datasets, achieving accuracies of 97.61%, 99.44%, 99.23%, and 98.28%, respectively. Experimental results demonstrate that the proposed method offers competitive accuracy and computational efficiency in concrete crack detection, surpassing existing technologies and providing effective solutions for practical applications.\u003c/p\u003e","manuscriptTitle":"Deep neural network with local-global context-aware feature fusion for crack detection","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-03-05 16:15:38","doi":"10.21203/rs.3.rs-8892244/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Major Revision","date":"2026-05-07T09:57:06+00:00","index":"","fulltext":""},{"type":"reviewerAgreed","content":"","date":"2026-03-14T06:33:15+00:00","index":0,"fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-03-02T14:44:53+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"Soft Computing","date":"2026-02-28T22:17:36+00:00","index":"","fulltext":""},{"type":"submitted","content":"Soft Computing","date":"2026-02-18T02:33:58+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"soft-computing","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"soco","sideBox":"Learn more about [Soft Computing](https://www.springer.com/journal/500)","snPcode":"500","submissionUrl":"https://submission.nature.com/new-submission/500/3","title":"Soft Computing","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"0ab339db-2aba-49e5-838a-8a39ac7fc153","owner":[],"postedDate":"March 5th, 2026","published":true,"recentEditorialEvents":[{"type":"decision","content":"Major Revision","date":"2026-05-07T09:57:06+00:00","index":"","fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"in-revision","subjectAreas":[],"tags":[],"updatedAt":"2026-05-07T14:44:30+00:00","versionOfRecord":[],"versionCreatedAt":"2026-03-05 16:15:38","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8892244","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8892244","identity":"rs-8892244","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Outcome instruments

MUSA

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00