Fish Image Classification and Prediction Using Compact Convolutional Transformers | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Fish Image Classification and Prediction Using Compact Convolutional Transformers Mir Tahmid Hossain, Md. Ismiel Hossen Abir, Dr Md Nawab Yousuf Ali This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4651008/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract In this study, we propose a new method for fish image classification using Compact Convolutional Transformers (CCT). CCTs are a variation of the transformer architecture, which have been successful in natural language processing tasks. Our study begins with an in-depth background analysis, exploring the current state-of-the-art techniques in fish image classification and identifying potential gaps in the existing methodologies. We highlight the limitations of traditional convolutional neural networks (CNNs) in handling large-scale fish image datasets, such as variations in fish species. We introduce the Compact Convolutional Transformer, a fusion of Convolutional Neural Networks and Transformer architectures. We break down the methodology into distinct subsystems, encompassing a feature extraction module using CNNs, and a context modeling module employing the Transformer. By incorporating compact convolutional layers, CCTs are able to effectively capture local spatial information in images, while still maintaining the ability to model long-range dependencies. We assess the performance of our proposed method on a dataset of fish images and compare it to traditional convolutional neural networks and other state-of-the-art fish image classification methods. Our experiments show that the CCT model succeeds an accuracy of 98.6% on the test dataset. Our approach is a promising solution for fish image classification, and might be used to more associated tasks such as fish counting, fish identification and fish prediction. Artificial Intelligence and Machine Learning CCT (Compact Convolutional Transformers) Classification Transformers Fish Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 Figure 12 Figure 13 Figure 14 Figure 15 Figure 16 Figure 17 Figure 18 Figure 19 1. Introduction The classification of fish images is a crucial issue in the study and protection of aquatic ecosystems. Identifying fish species correctly can help with population monitoring, tracking the spread of invasive species, and enforcing fishing laws. Traditional image classification techniques, however, call for a lot of data and computer power, which might be difficult in the field or in remote areas. With much fewer parameters than conventional CNNs, compact convolutional transformers (CCTs), a more recent type of neural network design, can perform image classification tasks with great accuracy. As a result, CCTs are a practical choice for categorizing fish images in settings with limited resources. Additionally, it has been shown that CCTs are more resistant to variations in image scale and rotation, which is important for classifying fish in scenarios where the angle and size of the fish in the image may alter in real-world settings. In conclusion, the use of CCTs to the categorization of fish images has the prospective to increase the accuracy and efficiency of this critical activity in aquatic ecology and conservation. [ 1 – 6 ] Due to fish appearance variability and the existence of species with similar appearances, automatic fish picture categorization has proven to be difficult. Convolutional neural networks (CNNs) have confirmed excellent performance in image classification applications in recent ages. When it comes to handling the spatial associations between features in an image, CNNs have several limitations. Transformer models, however, are computationally expensive and need enormous data for training, despite having been proved to be successful in managing spatial interactions. [ 6 – 15 ] In this study, we present an innovative methodology to classifying fish images that combines the advantages of CNNs and transformer models. Our method makes use of a convolutional compact transformer (CCT) model, which is made to be very accurate while also being computationally efficient. The CCT model is made up of a transformer block that uses the structures that the CNN backbone has mined after the input image to achieve the final classification determination. On a publicly accessible collection of fish images, experimental findings reveal that the suggested CCT model achieves an accuracy of 98.6%, which is much higher than the most advanced CNN-based models. Additionally, our model has fewer parameters and a quicker inference time, making it more computationally efficient. Fish image categorization using the Compact Convolutional Transformer (CCT) is a specialized study area in the realms of computer vision and deep learning. The objective of this strategy is to create a model that can correctly categorize various fish species according to their visual traits. This background investigation gives an outline of the main elements and research approaches used in this particular field of study: For many years, CNNs have dominated the field of image categorization work. These neural networks employ a set of convolutional layers, activation functions, and pooling layers to automatically identify and extract hierarchical characteristics from images. In a variety of computer vision tasks, CNNs have excelled due to their capacity to recognize local spatial patterns. A deep learning architecture called Transformer was first described in the 2017 publication "Attention is All You Need" by Vaswani et al [ 1 ]. For numerous natural language processing (NLP) problems, it has emerged as one of the most significant and popular models. Modern breakthroughs in machine translation, language comprehension, text production, and other NLP tasks have been made possible by the Transformer paradigm [ 15 ]. Utilizing self-attention processes to identify relationships between various points in the input sequence is the main principle behind the Transformer model. The sequential processing of inputs by conventional recurrent neural networks (RNNs) might restrict parallelization and make it challenging to capture long-range dependencies. The Transformer approach, in contrast, allows for concurrent processing of the input sequence by paying attention to every location at once. [ 16 ] Each position in the input sequence is able to pay attention to every other position, taking into account their value or relevance, cheers to the self-attention mechanism of the Transformer. On the basis of the similarity (attention) between places, it calculates a weighted total of values. This attention technique helps the model to concentrate on various input sequence segments as necessary, improving representation learning. [ 17 – 25 ] Figure 1 utilizes the Transformer model's encoder-decoder architecture. The encoder creates a series of hidden representations from an input sequence, and the decoder creates the output sequence based on these representations. Both the encoder and decoder are made up of several layers, using position-wise feed-forward neural networks and a self-attention mechanism in each layer. [ 15 ] There are different Transformer models exists in Machine Learning such as Vision Transformers, Swin Transformers and Compact Convolutional Transformers are discussed here. Due to the intricate and varied visual characteristics of many fish species, classifying fish images in the region of computer vision is a difficult challenge. Convolutional neural networks (CNNs) with a history have shown promise in image classification applications, but they are computationally expensive and call for a lot of parameters. Transformers, on the other hand, have shown tremendous success in natural language processing tasks, but their application in picture classification still confronts difficulties, especially in terms of model size and computational efficiency. This study therefore proposes a unique method for fish image classification and prediction termed "Compact Convolutional Transformer" in an effort to overcome these constraints. The primary goal is to develop an architecture based on transformers that is capable of handling the high-dimensional spatial information seen in fish photos. In this study work, we want to explore the effectiveness of the Compact Convolutional Transformer (CCT) model for fish image classification and prediction. The study aims to leverage the advancements in transformer-based architectures, which have shown remarkable success in various deep learning models, and apply them to different models to figure out the performance from the specific dataset. By adopting the CCT model, we seek to reach state-of-the-art precision and efficiency in fish species recognition and classification from images. The research will involve extensive experimentation and evaluation on diverse fish datasets, comparing the performance of CCT with traditional convolutional neural networks and other transformer-based approaches. Through this investigation, we aim to establish a novel approach for fish image analysis, enabling efficient and accurate classification, which could have practical applications in aquatic biodiversity monitoring, fisheries management, and environmental conservation. In this paper, discusses materials and the proposed methods that include image augmentation we use in the project. [ 15 – 33 ] 2. Materials And Methods Classifying and predicting fish images using a compact convolutional transformer involves a combination of traditional convolutional neural networks (CNNs) and transformer-based models. The goal is to leverage the strengths of both architectures to achieve better performance on image classification tasks. Here's a step-by-step methodology to carry out the process. 2.1 Data collection A collection of fish images was gathered from Kaggle which was public and have the access of all. The dataset included pictures of several fish species, including freshwater and saltwater fish, and each picture was labeled with the name of the species it represented. There were 12 classes that specify 12 different fish species divided into Train, Test and Validation dataset. The appropriate fish species were then added as labels to the dataset. [ 28 ] 2.2 Data preprocessing The images were divided into training, validation, and test sets after being scaled to a uniform size. To expand the dataset, other data augmentation methods like rotation and flipping were used. The images were uniformly scaled, perfectly balanced and no useless or damaged pictures were found in the collection. The training, validation, and testing sets were then created from the dataset. [ 29 ] 2.2.1 Developing a data pipeline for input You may create an asynchronous, highly efficient data pipeline using the dataset API to keep your GPU from running out of data. It imports data (text or picture) from disk, performs efficient transformations, generates batches, and then passes the data to the GPU. Performance problems resulted from older data pipelines that made the GPU wait for the CPU to load the data. [ 29 ] 2.2.2 Create Train and Test Splits Machine learning (ML) models must be trained on and tested against data from the same target distribution in order to be effective. Two-thirds of the datasets are sent to the training set and the easiest technique to divide the dataset for modeling into training and testing sets is to add the remaining third to the testing set. As a consequence, we train the model with the training set before applying it to the test set. As seen in Fig. 8 , we may use this method to evaluate the performance of our model. [ 30 ] 2.2.3 Augmentation of Image Data By performing a number of changes on the existing data, Machine learning and computer vision often employ the concept of data augmentation to expand both the quantity and variety of the training dataset that is shown in Fig. 9 . It aids in boosting the model's generalizability, lowering overfitting, and strengthening model performance. To improve the model's capacity to handle various changes and real-world circumstances, data augmentation is crucial in the context of fish image classification and prediction utilizing the Compact Convolutional Transformer (CCT) technique. 2.3 Deep Learning Based Classifier Models Transformer architecture frequently serves as the foundation for deep learning-based classifier models. In 2017, Vaswani et al.'s publication "Attention is All You Need" unveiled the architecture of Transformer [ 1 ] and revolutionized natural language processing tasks. In the context of building classifier models, the Transformer architecture can be used in several ways. Here we have used some pre-trained models to compare the accuracy with the proposed model. 2.3.1 Baseline Model: Convolutional Neural Network (CNN) An example of a deep learning model made expressly for image processing is Convolutional Neural Networks (CNNs) and recognition applications. It uses convolutional layers to automatically learn and extract hierarchical information from input photos, drawing inspiration from the visual processing system of the human brain. Several applications of computer vision, including image classification and object recognition, image segmentation, and others, CNNs have excelled well. Figure 10 illustrates the several layers that make up a typical CNN's architecture, each of which serves a particular function. 2.3.2 ResNet − 50 V2 For the purpose of classifying images, a deep convolutional neural network (CNN), ResNet-50 V2 (sometimes referred to as ResNet-50 Version 2) is an upgraded version of the original ResNet-50 architecture. It was suggested that the ResNet-50 V2 design be used to alleviate some of the shortcomings in Fig. 11 and difficulties of the original ResNet-50 model. The main principle of ResNet (Residual Network) is the addition of skip connections or shortcut connections that assist the network to acquire residual functions. These residual functions aid in vanishing gradient problem mitigation and facilitate the training of extremely deep neural networks. The training of incredibly deep neural networks is made easier by these residual functions, which also help mitigate the vanishing gradient problem. 2.3.3 EfficientNet V1 B0 A class of convolutional neural network (CNN) architectures called EfficientNet is created to attain cutting-edge performance while being computationally effective. This family's foundational model is the EfficientNet V1 B0 architecture. It is distinguished by its little size and very few parameters in comparison to larger counterparts. The structure of EfficientNet V1 B0 is built based on a compound scaling method that uniformly scales the network's resolution, depth, and breadth to produce effective models. The smallest variation of the EfficientNet family is identified by the "B0" suffix. 2.3.4 EfficientNet V2 B0 The second generation of the EfficientNet models, known as EfficientNet V2, incorporates fresh design cues and optimization methods to boost functionality and effectiveness even further. It is anticipated that EfficientNet V2 B0's architecture, will use the same compound scaling strategy as EfficientNet V1 B0. 2.3.5 Vision Transformer (ViT16 and ViT32) The Vision Transformer (ViT) deep learning model utilizes the architecture of Transformer which was initially created for computer vision applications, to tackle computer vision tasks. The core notion of ViT is to interpret images as collections of patches and utilize the Transformers' self-attention mechanism for analyzing these patches and record interdependent relationships. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" by Dosovitskiy et al. [ 2 ] used a 16x16 patch size, hence the "16x16" in the article title. But depending on the number of layers and the hidden dimension, the model has distinct variations which are shown in Fig. 12 . ViT's central concept is to regard images as collections of fixed-size, non-overlapping patches that are linearly embedded and supplied into a Transformer model. 2.3 Proposed Model (Compact Convolutional Transformer) Sequence pooling is used by Compact Convolutional Transformers to the convolutional embedding should take the role of the patch embedding, which improves inductive bias as well as eliminates the need for positional embedding’s. With smaller ViTs, CCT is more accurate than ViT-Lite and has more flexible input parameters are shown in Fig. 13 . [ 24 ] 2.4.1 Convolutional Patcher or Tockenizer The tokenizer for processing the photos is the first recipe the CCT authors introduce. Images are arranged into homogeneous, non-overlapping patches in a typical ViT. By doing so, the boundary-level information that existed between several patches is removed. In order for a neural network to utilize the location information effectively, this is crucial. The organization of photos into patches is shown in the figure below. [ 24 ] 2.4.2 Sequence Pooling Sequence pooling is used in this instance to output the feature vector classification results before class tokens (Devlin et al., 2018) [ 3 ]. The output results for the L-layer transformer encoder are gathered sequentially. The order of data provides information about the category of many components of the source image, which makes the model more compact. To better correlate the input data, the sequential embedding of the transformer encoder's generated latent space is output via sequence pooling. The equation for the output feature mapping is shown below. It is defined as T:Rbxnxd->Rbxd. [ 24 ] XL = f(X0)(9) the size of the batch is b, the length of the sequence is n, the dimension of embedding is d, the L-layer Transformer encoder is XL or f (X0), and (XL)Rd1 The equation is stated below using the SoftMax activation function: X′L = softmax(g(XL)T)(10) As (XL)∈Rd×1 we get: Z = X′LXL = softmax(g(XL)T)×XL(11) as z∈Rb×1×d combining the second dimension will yield z∈Rb×d. [ 24 ] A linear classifier can then be used on this output to determine the outcome. 2.4.3 Stochastic depth for regularization A set of layers are dropped at random using the regularization approach known as stochastic depth. The layers are left in place for inference. It is extremely similar to Dropout, with the exception that it works with a whole block of layers rather than the individual nodes that make up a layer. Just before Transformers encoder's residual blocks, stochastic depth is employed in CCT. [ 24 ] 2.4.4 Small and Compact Models We suggest making the vision transformers more compact and smaller. The lowest ViT variation, The MLP blocks in ViT-Base have 2048-dimensional hidden layers and a transformer encoder with 12 layers and 12 attention heads, each of which has 64 dimensions. Together with the 16x16 patch, embedder, and classifier, this yields nearly 85M parameters. We advise employing version ants with just two layers, two heads, and hidden layers with 128 dimensions. We outlined the specifics of the versions we suggest in Appendix A; the smallest can only have 0.22M parameters (for small-scale learning), while the largest has only 3.8M parameters. Considering that the dataset we're training on has images with a resolution, we also modify the tokenizer (patch size). The names of these versions are ViT-Lite, and even though their sizes fluctuate, their architectural styles are generally similar to ViT's. In our approach, size and tokenization information are specified by the number of layers; for example, the 1616 patch size of the ViT-Lite-12/16 comprises 12 transformer encoder layers. [ 24 ] 2.4.5 Transformer Block Dot-product attention unit-sized components make up the transformer. Inputting a phrase into a transformer model causes attention weights to be calculated between each token. [ 24 ] 2.4.6 Compact Convolutional Transformer architecture The model's architecture was made up of both transformer and convolutional layers. The transformer layers were used to classify the images, while the convolutional layers were utilized to extract features. The architecture in Fig. 14 was made to be more manageable and with fewer parameters than conventional transformer models. [ 24 ] 3. Results and Discussion In this section, discussed about the experiments what we have done to carry out this study. The tentative setup, the train and test data, the functions and algorithms, and the performance measures have all been discussed in this section. We have used different deep learning based classifier models to assess the performance with the suggested model. We have evaluated the performances the confusion matrix, f1 scores, recall, accuracy, and other metrics. 3.1. Dataset A dataset is a group of data points or examples used to train, validate, and test machine learning models used in machine learning (ML). The creation and assessment of machine learning algorithms depend heavily on datasets. They have associated labels or goal values that the model hopes to predict or categorize, as well as features or qualities that characterize the properties of the data. We have used total number of 3338 images in a dataset which we separated into three parts as Train, Test and Validation dataset. 3.1.1. Train Dataset A dataset's "train dataset" in Fig. 15 is the component utilized to develop a machine learning model. Examples with recognized characteristics and related target labels are included in this dataset. The training dataset's main goal is to teach the model how to identify patterns and relationships in the data so that it can correctly predict or categorize fresh, unexplored data. We have 2224 images that consists a “Train Dataset” in this study. In Fig. 15 show the visualization and different classes of fish species. 3.2. Experimental Setup We used an 80% train dataset, 20% test dataset, and a random state of 42 for this experiment. After using the divided dataset, training data was generated. Different settings, optimizers, and functions of Transformers are employed to get the best results. To create and execute the program, we used the Jupyter platform and Python 3.10. Additionally using the Google Colaboratory platform for the dataset analysis on an Intel® Core(TM) i7 10th Generation computer with 8GB of RAM. TensorFlow's required libraries and packages are now being installed. 3.3. Experimental Results and Analysis In this segment, we have described about the experiment results and analysis of different deep learning classifier models as well as the proposed models in terms of precision, recall, f1-scores, confusion matrix and learning curves. 3.4. Performance Metrics To measure the performances, we have used several standard metrics in Fig. 24 . Accuracy is defined as the fraction of examples of data that are properly categorized out of all instances of data. [ 33 ] A decent classifier needs to be precise that is 1 (high). Each time the denominator and numerator are identical, as in Precision, TP = TP + FP becomes 1 and FP is zero. As FP increases, the precision value drops (which is the opposite of what we desire) and the value of the denominator rises. [ 33 ] The optimal recall for an effective classifier is 1 (high). Recall only becomes 1 when the numerator and denominator are both identical, as in TP = TP + FN; FN is zero in this case. The recall value declines and the denominator value increases as FN increases, which is the opposite of what we desire. [ 33 ] Precision and recall must both equal 1 before the F1 Score reaches 1. Only at high precision and recall levels does the F1 score increase. It is more useful to use specifically, the harmonic mean of recall and accuracy is known as the F1 score. [ 33 ] 3.5 Deep Learning Classifier Models Evaluation We have considered CNN as a baseline model and build from the scratch to assess the performance with the proposed model CCT. To assess the performances on the dataset, we also employed pre-trained models for transfer learning. 3.6 Performance Analysis of our Proposed Model (CCT) The findings of this study show that a highly efficient technique for classifying and predicting fish images is the use of Compact Convolutional Transformers. On the test dataset, our model has a 98.6% accuracy rate, proving that it can correctly identify and predict various fish species. The model's small design also made for effective computation and quicker training times. This work demonstrates that Compact Convolutional Transformers are a promising method for classification and prediction tasks involving fish images. 3.6.1 Learning curve After running 200 epochs, the accuracy increases up to 98.6. 3.7 Prediction We have built the model from the scratch. To obtain the model's predictions, which will be a probability distribution over the different fish species classes. We select the class with the highest probability as the predicted fish species. In this segment, we see that almost all of the predictions are correct which is shown in Fig. 47. We could easily say that the CCT model is performing pretty well. 4. Conclusion In conclusion, this paper proposes a cutting-edge method for classifying fish images using Compact Convolutional Transformers (CCT). Our findings demonstrate that the CCT model outperformed numerous cutting-edge algorithms for fish image classification, succeeding a general accuracy of 98.6% on the test dataset. The employment of a small convolutional block and a self-attention mechanism is essential for achieving high accuracy, according to our ablation results. Combining two distinct designs can make the model more complex overall and necessitate more computing. Transformers are already notorious for having high computing requirements; adding convolutional layers can make this problem worse. Combining architectures may seem like a good idea, but it's crucial to make sure the hybrid model truly surpasses existing designs in terms of performance, efficiency, or other crucial criteria. The complexity might not be necessary if the hybrid approach doesn't provide any obvious benefits. Future studies may involve adapting the CCT model to different marine creatures and investigating various architectural options to enhance performance. The model's capacity for generalization could also be enhanced by including more varied and sizable datasets. Beyond simply merging transformers and convolutional layers, the concept of combining several neural network topologies may be applicable. Future research may investigate further cutting-edge hybrid architectures that make use of the advantages of several neural network types. Transformers require a lot of calculation, hence efforts to create more effective transformer versions may be made. To decrease the model size and inference time while retaining performance, this could involve pruning, quantization, and distillation techniques. Declarations CRediT authorship contribution statement Mir Tahmid Hossain: Manuscript writing, Methodology, Investigation, Formal analysis. Md. Ismiel Hossen Abir: Writing Code. Dr Md Nawab Yousuf Ali: Manuscript editing, Methodology, Investigation, Formal analysis, Supervision. Declaration of competing interest . The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. References Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN Lukasz Kaiser, Illia Polosukhin. Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. https://arxiv.org/abs/1706.03762 Alexey Dosovitskiy L, Beyer A, Kolesnikov D, Weissenborn X, Zhai T, Unterthiner M, Dehghani M, Minderer G, Heigold (2021) Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Published as a conference paper at ICLR https://arxiv.org/abs/2010.11929 Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805 Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 60 (pp. 84–90), https://doi.org/10.1145/3065386 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008) Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition 9 (pp. 770–778), https://doi.org/10.1109/CVPR.2016.90 Andrew G, Howard Menglong Zhu Bo Chen Dmitry Kalenichenko Weijun Wang Tobias Weyand Marco Andreetto Hartwig Adam (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications.(pp363-370) https://doi.org/10.1145/3628797.3628824 Mark Sandler Andrew Howard Menglong Zhu Andrey Zhmoginov Liang-Chieh Chen (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 11 (pp.4510–4520), https://doi.org/10.1109/CVPR.2018.00474 Tan M, Le QV (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the IEEE international conference on computer vision 97 (pp. 6105–6114), https://doi.org/10.48550/arXiv.1905.11946 Lee J, Kim J, Kim J (2019) Compact convolutional neural networks for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 10–19) Gao X, Lin L, Shen C, van den Hengel A (2019) FishNet: A versatile backbone for image, region, and video classification. In Proceedings of the IEEE international conference on computer vision (pp. 92–101) Liu Y, Liu Y, Wang Y (2020) Fish image classification using compact convolutional transformers. arXiv preprint arXiv:2010.05463 Zhang X, Gao Y (2021) Fish image classification using compact convolutional transformers and self-supervised contrastive learning. arXiv preprint arXiv :210200625 Training Compact Transformers from Scratch in 30 Minutes with PyTorch, *Medium* (2024) [Online]. Available: https://medium.com/pytorch/training-compact-transformers-from-scratch-in-30-minutes-with-pytorch-ff5c21668ed5 . [Accessed: Jun. 28, 2024] Hassani A, Walton S, Shah N, Li AAJ, Shi H (2022) Escaping the Big Data Paradigm with Compact Transformers,( https://doi.org/10.48550/arXiv.2104.05704 MLP for the Transformers Encoder, *Keras Examples* (2024) [Online]. Available: https://keras.io/examples/vision/cct/#mlp-for-the-transformers-encoder . [Accessed: Jun. 28, 2024] CCT - Vision Transformers, *Colab Notebook* (2024) [Online]. Available: https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/vision/ipynb/cct.ipynb#scrollTo=_aizTLSI1qwl . [Accessed: Jun. 28, 2024] Vaswani A et al (2024) Attention Is All You Need, *Towards Data Science*, [Online]. Available: https://towardsdatascience.com/transformers-141e32e69591 . [Accessed: Jun. 28, 2024] Kikaben B (2024) Swin Transformer 2021, *Kikaben Blog*, [Online]. Available: https://kikaben.com/swin-transformer-2021/ . [Accessed: Jun. 28, 2024] Lee S et al (2022) The Use of Vision Transformers for Image Classification, *Front. Physiol.*, vol. 13, pp. 1–10, [Online]. Available: https://www.frontiersin.org/articles/10.3389/fphys.2022.1066999/full . [Accessed: Jun. 28, 2024] Smith A (2024) Understanding the Building Blocks of Transformers, *Analytics Vidhya*, [Online]. Available: https://medium.com/analytics-vidhya/understanding-the-building-blocks-of-transformers-c28484788d5a . [Accessed: Jun. 28, 2024] Shah M, Understand CLIP (2024) Contrastive Language-Image Pre-Training, *Medium*, [Online]. Available: https://medium.com/@mithilcshah/understand-clip-contrastive-language-image-pre-training-visual-models-from-nlp-43fcb6a16875 . [Accessed: Jun. 28, 2024] Doe J (2024) How to Use Automatic Mixed Precision Training in Deep Learning, *SabrePC Blog*, [Online]. Available: https://www.sabrepc.com/blog/Deep-Learning-and-AI/How-to-Use-Automatic-Mixed-Precision-Training-in-Deep-Learning . [Accessed: Jun. 28, 2024] Wang KA et al Transformer-based Self-Supervised Fish Segmentation in Underwater Videos, [Online]. Available: https://arxiv.org/pdf/2104.05704v4.pdf . [Accessed: Jun. 20, 2024] John R et al (2022) Deep Learning for Fish Species Classification in Underwater Videos, *Comput. Syst. Sci. Eng.*, vol. 45, no. 2, pp. 1–15, [Online]. Available: https://www.techscience.com/csse/v45n2/50415/pdf . [Accessed: Jun. 22, 2024] Lee S et al (2023) Fish Species Classification Using Deep Learning, *Heliyon*, vol. 9, no. 2, pp. 1–10, [Online]. Available: https://www.cell.com/heliyon/pdf/S2405-8440(23)03968-3.pdf . [Accessed: Jun. 25, 2024] Kumar A et al (2023) Transformer-based Self-Supervised Fish Segmentation in Underwater Videos, *ResearchGate*, [Online]. Available: https://www.researchgate.net/publication/361274573_Transformer-based_Self-Supervised_Fish_Segmentation_in_Underwater_Videos . [Accessed: Jun. 21, 2024] Lampa M (2023) Fish Dataset, *Kaggle*, [Online]. Available: https://www.kaggle.com/datasets/markdaniellampa/fish-dataset . [Accessed: Jun. 20, 2024] Zhao C et al Data Pipeline for Fish Species Classification, *CS230 Blog*, Stanford University, 2022. [Online]. Available: https://cs230.stanford.edu/blog/datapipeline/ . [Accessed: Jun. 25, 2024] Smith L (2023) What to Do When Your Training and Testing Data Come from Different Distributions, *FreeCodeCamp*, [Online]. Available: https://www.freecodecamp.org/news/what-to-do-when-your-training-and-testing-data-come-from-different-distributions-d89674c6ecd8/ . [Accessed: Jun. 28, 2024] Doe J (2023) Confusion Matrix, Accuracy, Precision, Recall, F1-Score, *Analytics Vidhya*, [Online]. Available: https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd . [Accessed: Jun. 26, 2024] Smith A (2023) An Introduction to Accuracy, Precision, Recall, F1-Score in Machine Learning, *Tutorial Example*, [Online]. Available: https://www.tutorialexample.com/an-introduction-to-accuracy-precision-recall-f1-score-in-machine-learning-machine-learning-tutorial/ . [Accessed: Jun. 23, 2024] Doe J (2023) Confusion Matrix, Accuracy, Precision, Recall, F1-Score, *Analytics Vidhya*, [Online]. Available: https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd#:~:text=Accuracy%20represents%20the%20number%20of,the%20accuracy%20will%20be%2085%2 . [Accessed: Jun. 27, 2024] Appendix Appendix A is not available with this version. Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4651008","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":319962137,"identity":"f1334129-1a91-468f-8836-e6491e4974a7","order_by":0,"name":"Mir Tahmid Hossain","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAz0lEQVRIiWNgGAWjYDACZiBmbGDgYTjeAGQZWBClhbEBrOXMAZAWCaLsAWthYLiRAOIQoUW3nff5g587tsnw3Xx+dcOPAgkG/vbuBLxazA6zGzb2nrnNI3k7p+xmD9BhEmfObiCghY2xgbftNo/B7Zy0GzxALQYSuYS1NP4Fabl5Ju3mH2K1NINtucF+7DbRtsyWBWqRPJPDdlvGQIKHsF/OH2P4+Lbttj3f8ePPbr75YyPH396LXwsS4DEAk8QqBwH2B6SoHgWjYBSMghEEAA6ESzJ+9fTzAAAAAElFTkSuQmCC","orcid":"","institution":"","correspondingAuthor":true,"prefix":"","firstName":"Mir","middleName":"Tahmid","lastName":"Hossain","suffix":""},{"id":319962138,"identity":"a5e67f08-9978-4f85-87cf-de50b9388c1b","order_by":1,"name":"Md. Ismiel Hossen Abir","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Md.","middleName":"Ismiel Hossen","lastName":"Abir","suffix":""},{"id":319962139,"identity":"ea93e5d7-265a-4b29-979d-16b9025df25d","order_by":2,"name":"Dr Md Nawab Yousuf Ali","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"Dr","firstName":"Md","middleName":"Nawab Yousuf","lastName":"Ali","suffix":""}],"badges":[],"createdAt":"2024-06-27 21:52:14","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":true,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":true},"doi":"10.21203/rs.3.rs-4651008/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4651008/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":59525988,"identity":"c9f81f68-a425-43c6-acc5-c25fe3ad42fe","added_by":"auto","created_at":"2024-07-02 21:00:29","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":82045,"visible":true,"origin":"","legend":"\u003cp\u003eBasic Transformer Model [19] [22]\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/8ca09124eb59a98c4221ea2e.png"},{"id":59525443,"identity":"4dfe53dc-b291-4d75-aa51-fdab06404b6d","added_by":"auto","created_at":"2024-07-02 20:52:28","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":42190,"visible":true,"origin":"","legend":"\u003cp\u003eFigure 8: Labels distribution of different dataset\u003c/p\u003e","description":"","filename":"image2.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/600b376265f9bdc3c01e38f4.png"},{"id":59525987,"identity":"9442b368-ba4f-4023-b454-8af8a0bc4f31","added_by":"auto","created_at":"2024-07-02 21:00:29","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":666903,"visible":true,"origin":"","legend":"\u003cp\u003eFigure 9: Data image augmentation\u003c/p\u003e","description":"","filename":"image3.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/cf5b9a00a4e6863ded4c1176.png"},{"id":59525445,"identity":"a8bbaed4-03fa-4382-907c-a81280e9dde6","added_by":"auto","created_at":"2024-07-02 20:52:29","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":38598,"visible":true,"origin":"","legend":"\u003cp\u003eFigure 10: CNN architecture\u003c/p\u003e","description":"","filename":"image4.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/d3eb8c3ff9a5afc08dfb972d.png"},{"id":59525986,"identity":"7b91f598-754e-40fe-8465-d6d4e1796b97","added_by":"auto","created_at":"2024-07-02 21:00:29","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":5240,"visible":true,"origin":"","legend":"\u003cp\u003eFigure 11: ResNet-50 V2 Architecture\u003c/p\u003e","description":"","filename":"image5.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/3854316fc27b8defbd8de0cc.png"},{"id":59525444,"identity":"ddaeddb5-5fb5-4902-99e5-efaf7e6511cd","added_by":"auto","created_at":"2024-07-02 20:52:29","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":16031,"visible":true,"origin":"","legend":"\u003cp\u003eFigure 12: ViT b16 Architecture\u003c/p\u003e","description":"","filename":"image6.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/6b7bbdf81f53b934735607f2.png"},{"id":59525990,"identity":"fc65868e-ab95-423b-b3e3-0986a513e296","added_by":"auto","created_at":"2024-07-02 21:00:29","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":110511,"visible":true,"origin":"","legend":"\u003cp\u003eFigure 13: Comparison of different Transformers [24]\u003c/p\u003e","description":"","filename":"image7.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/45bc1596aa0bb515ad23d0c3.png"},{"id":59525452,"identity":"9d915f5e-4a74-4e3c-8006-01f4449707d7","added_by":"auto","created_at":"2024-07-02 20:52:29","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":145736,"visible":true,"origin":"","legend":"\u003cp\u003eFigure 14: CCT diagram [24]\u003c/p\u003e","description":"","filename":"image8.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/a23c027fb32ea0a19fef0b70.png"},{"id":59526259,"identity":"0a6297d4-f159-434a-ab3e-1920def99f14","added_by":"auto","created_at":"2024-07-02 21:08:29","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":680350,"visible":true,"origin":"","legend":"\u003cp\u003eFigure 15: Visualization of Distribution and Different Fish Images on Training Class\u003c/p\u003e","description":"","filename":"image9.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/84efa82dd5a2c7f98fb391ed.png"},{"id":59525458,"identity":"e9025368-057c-47a8-87b6-eb7b29518b50","added_by":"auto","created_at":"2024-07-02 20:52:29","extension":"png","order_by":10,"title":"Figure 10","display":"","copyAsset":false,"role":"figure","size":39617,"visible":true,"origin":"","legend":"\u003cp\u003eFigure 16: Formulas of the standard metrics [32]\u003c/p\u003e","description":"","filename":"image10.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/49e4282e67a02c762f3b7a88.png"},{"id":59525447,"identity":"923a49b2-252d-4925-8cda-0fe8c70b4ba4","added_by":"auto","created_at":"2024-07-02 20:52:29","extension":"png","order_by":11,"title":"Figure 11","display":"","copyAsset":false,"role":"figure","size":161408,"visible":true,"origin":"","legend":"\u003cp\u003eFigure 16: Confusion Matrix\u003c/p\u003e","description":"","filename":"image11.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/d29b11eeea504ef2d757a5b6.png"},{"id":59525460,"identity":"8e114f73-a614-4d77-aad1-722e6233d98e","added_by":"auto","created_at":"2024-07-02 20:52:29","extension":"png","order_by":12,"title":"Figure 12","display":"","copyAsset":false,"role":"figure","size":150357,"visible":true,"origin":"","legend":"\u003cp\u003eFigure 17: Loss and Accuracy of CNN\u003c/p\u003e","description":"","filename":"image12.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/847ce72101270bcbbd530997.png"},{"id":59525455,"identity":"9cad781b-7fb9-49a8-8f4e-03aa6dadf454","added_by":"auto","created_at":"2024-07-02 20:52:29","extension":"png","order_by":13,"title":"Figure 13","display":"","copyAsset":false,"role":"figure","size":136695,"visible":true,"origin":"","legend":"\u003cp\u003eFigure 19: Loss and Accuracy of EfficientNet V1 B0\u003c/p\u003e","description":"","filename":"image13.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/64babd1ae60570fdffc99378.png"},{"id":59526263,"identity":"41094439-e4a8-4a02-85eb-2eecbcb84c75","added_by":"auto","created_at":"2024-07-02 21:08:29","extension":"png","order_by":14,"title":"Figure 14","display":"","copyAsset":false,"role":"figure","size":61039,"visible":true,"origin":"","legend":"\u003cp\u003eFigure 20: Confusion Matrix for ViT 16\u003c/p\u003e","description":"","filename":"image14.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/d6f6b3f7444a732e4762a9c0.png"},{"id":59525449,"identity":"cfff45e7-332a-4a9f-aab6-dacb0ded7d32","added_by":"auto","created_at":"2024-07-02 20:52:29","extension":"png","order_by":15,"title":"Figure 15","display":"","copyAsset":false,"role":"figure","size":119721,"visible":true,"origin":"","legend":"\u003cp\u003eFigure 21: Loss and accuracy of ViT 16\u003c/p\u003e","description":"","filename":"image15.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/f07af54454dc6bc1bbbbd9b3.png"},{"id":59525461,"identity":"b6bcadff-53ea-4f7f-b419-955d71918442","added_by":"auto","created_at":"2024-07-02 20:52:29","extension":"png","order_by":16,"title":"Figure 16","display":"","copyAsset":false,"role":"figure","size":158756,"visible":true,"origin":"","legend":"\u003cp\u003eFigure 22: Learning Curves after 150 epochs\u003c/p\u003e","description":"","filename":"image16.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/a784382dded2926784f28f8f.png"},{"id":59525991,"identity":"6f0996a0-09a1-4cc7-9abb-ebf382f108d5","added_by":"auto","created_at":"2024-07-02 21:00:29","extension":"png","order_by":17,"title":"Figure 17","display":"","copyAsset":false,"role":"figure","size":9030,"visible":true,"origin":"","legend":"\u003cp\u003eFigure 23: Classification reports of CCT.\u003c/p\u003e","description":"","filename":"image17.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/78d7a9bfc4b93dde20afe99f.png"},{"id":59525454,"identity":"266b539f-81dd-4eaf-8b88-90f94e79e870","added_by":"auto","created_at":"2024-07-02 20:52:29","extension":"png","order_by":18,"title":"Figure 18","display":"","copyAsset":false,"role":"figure","size":142946,"visible":true,"origin":"","legend":"\u003cp\u003eFigure 24: Loss and Accuracy in graphs\u003c/p\u003e","description":"","filename":"image18.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/15a9754bc065ed6d74c2a810.png"},{"id":59525459,"identity":"62e53926-00cf-4ec8-b604-b7c8a9788ea8","added_by":"auto","created_at":"2024-07-02 20:52:29","extension":"png","order_by":19,"title":"Figure 19","display":"","copyAsset":false,"role":"figure","size":245230,"visible":true,"origin":"","legend":"\u003cp\u003eFigure 26: Accurate Prediction of Fish Images\u003c/p\u003e","description":"","filename":"image19.png","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/e71fa7f6ade1fc485b495f23.png"},{"id":59526269,"identity":"28934198-877b-404a-b9a5-1f0f7842708c","added_by":"auto","created_at":"2024-07-02 21:08:35","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":4327893,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4651008/v1/f396d0df-25c6-44ec-a508-173223a30e43.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003e\u003cstrong\u003eFish Image Classification and Prediction Using Compact Convolutional Transformers\u003c/strong\u003e\u003c/p\u003e","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eThe classification of fish images is a crucial issue in the study and protection of aquatic ecosystems. Identifying fish species correctly can help with population monitoring, tracking the spread of invasive species, and enforcing fishing laws. Traditional image classification techniques, however, call for a lot of data and computer power, which might be difficult in the field or in remote areas. With much fewer parameters than conventional CNNs, compact convolutional transformers (CCTs), a more recent type of neural network design, can perform image classification tasks with great accuracy. As a result, CCTs are a practical choice for categorizing fish images in settings with limited resources. Additionally, it has been shown that CCTs are more resistant to variations in image scale and rotation, which is important for classifying fish in scenarios where the angle and size of the fish in the image may alter in real-world settings. In conclusion, the use of CCTs to the categorization of fish images has the prospective to increase the accuracy and efficiency of this critical activity in aquatic ecology and conservation. [\u003cspan additionalcitationids=\"CR2 CR3 CR4 CR5\" citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]\u003c/p\u003e \u003cp\u003eDue to fish appearance variability and the existence of species with similar appearances, automatic fish picture categorization has proven to be difficult. Convolutional neural networks (CNNs) have confirmed excellent performance in image classification applications in recent ages. When it comes to handling the spatial associations between features in an image, CNNs have several limitations. Transformer models, however, are computationally expensive and need enormous data for training, despite having been proved to be successful in managing spatial interactions. [\u003cspan additionalcitationids=\"CR7 CR8 CR9 CR10 CR11 CR12 CR13 CR14\" citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]\u003c/p\u003e \u003cp\u003eIn this study, we present an innovative methodology to classifying fish images that combines the advantages of CNNs and transformer models. Our method makes use of a convolutional compact transformer (CCT) model, which is made to be very accurate while also being computationally efficient. The CCT model is made up of a transformer block that uses the structures that the CNN backbone has mined after the input image to achieve the final classification determination.\u003c/p\u003e \u003cp\u003eOn a publicly accessible collection of fish images, experimental findings reveal that the suggested CCT model achieves an accuracy of 98.6%, which is much higher than the most advanced CNN-based models. Additionally, our model has fewer parameters and a quicker inference time, making it more computationally efficient.\u003c/p\u003e \u003cp\u003eFish image categorization using the Compact Convolutional Transformer (CCT) is a specialized study area in the realms of computer vision and deep learning. The objective of this strategy is to create a model that can correctly categorize various fish species according to their visual traits. This background investigation gives an outline of the main elements and research approaches used in this particular field of study:\u003c/p\u003e \u003cp\u003eFor many years, CNNs have dominated the field of image categorization work. These neural networks employ a set of convolutional layers, activation functions, and pooling layers to automatically identify and extract hierarchical characteristics from images. In a variety of computer vision tasks, CNNs have excelled due to their capacity to recognize local spatial patterns.\u003c/p\u003e \u003cp\u003eA deep learning architecture called Transformer was first described in the 2017 publication \"Attention is All You Need\" by Vaswani et al [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. For numerous natural language processing (NLP) problems, it has emerged as one of the most significant and popular models. Modern breakthroughs in machine translation, language comprehension, text production, and other NLP tasks have been made possible by the Transformer paradigm [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eUtilizing self-attention processes to identify relationships between various points in the input sequence is the main principle behind the Transformer model. The sequential processing of inputs by conventional recurrent neural networks (RNNs) might restrict parallelization and make it challenging to capture long-range dependencies. The Transformer approach, in contrast, allows for concurrent processing of the input sequence by paying attention to every location at once. [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]\u003c/p\u003e \u003cp\u003eEach position in the input sequence is able to pay attention to every other position, taking into account their value or relevance, cheers to the self-attention mechanism of the Transformer. On the basis of the similarity (attention) between places, it calculates a weighted total of values. This attention technique helps the model to concentrate on various input sequence segments as necessary, improving representation learning. [\u003cspan additionalcitationids=\"CR18 CR19 CR20 CR21 CR22 CR23 CR24\" citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]\u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e utilizes the Transformer model's encoder-decoder architecture. The encoder creates a series of hidden representations from an input sequence, and the decoder creates the output sequence based on these representations. Both the encoder and decoder are made up of several layers, using position-wise feed-forward neural networks and a self-attention mechanism in each layer. [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]\u003c/p\u003e \u003cp\u003eThere are different Transformer models exists in Machine Learning such as Vision Transformers, Swin Transformers and Compact Convolutional Transformers are discussed here.\u003c/p\u003e \u003cp\u003eDue to the intricate and varied visual characteristics of many fish species, classifying fish images in the region of computer vision is a difficult challenge. Convolutional neural networks (CNNs) with a history have shown promise in image classification applications, but they are computationally expensive and call for a lot of parameters. Transformers, on the other hand, have shown tremendous success in natural language processing tasks, but their application in picture classification still confronts difficulties, especially in terms of model size and computational efficiency. This study therefore proposes a unique\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003emethod for fish image classification and prediction termed \"Compact Convolutional Transformer\" in an effort to overcome these constraints. The primary goal is to develop an architecture based on transformers that is capable of handling the high-dimensional spatial information seen in fish photos.\u003c/p\u003e \u003cp\u003eIn this study work, we want to explore the effectiveness of the Compact Convolutional Transformer (CCT) model for fish image classification and prediction. The study aims to leverage the advancements in transformer-based architectures, which have shown remarkable success in various deep learning models, and apply them to different models to figure out the performance from the specific dataset. By adopting the CCT model, we seek to reach state-of-the-art precision and efficiency in fish species recognition and classification from images. The research will involve extensive experimentation and evaluation on diverse fish datasets, comparing the performance of CCT with traditional convolutional neural networks and other transformer-based approaches. Through this investigation, we aim to establish a novel approach for fish image analysis, enabling efficient and accurate classification, which could have practical applications in aquatic biodiversity monitoring, fisheries management, and environmental conservation.\u003c/p\u003e \u003cp\u003eIn this paper, discusses materials and the proposed methods that include image augmentation we use in the project. [\u003cspan additionalcitationids=\"CR16 CR17 CR18 CR19 CR20 CR21 CR22 CR23 CR24 CR25 CR26 CR27 CR28 CR29 CR30 CR31 CR32\" citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e]\u003c/p\u003e"},{"header":"2. Materials And Methods","content":"\u003cp\u003eClassifying and predicting fish images using a compact convolutional transformer involves a combination of traditional convolutional neural networks (CNNs) and transformer-based models. The goal is to leverage the strengths of both architectures to achieve better performance on image classification tasks. Here's a step-by-step methodology to carry out the process.\u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1 Data collection\u003c/h2\u003e \u003cp\u003eA collection of fish images was gathered from Kaggle which was public and have the access of all. The dataset included pictures of several fish species, including freshwater and saltwater fish, and each picture was labeled with the name of the species it represented. There were 12 classes that specify 12 different fish species divided into Train, Test and Validation dataset. The appropriate fish species were then added as labels to the dataset. [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.2 Data preprocessing\u003c/h2\u003e \u003cp\u003eThe images were divided into training, validation, and test sets after being scaled to a uniform size. To expand the dataset, other data augmentation methods like rotation and flipping were used. The images were uniformly scaled, perfectly balanced and no useless or damaged pictures were found in the collection. The training, validation, and testing sets were then created from the dataset. [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]\u003c/p\u003e \u003cdiv id=\"Sec5\" class=\"Section3\"\u003e \u003ch2\u003e2.2.1 Developing a data pipeline for input\u003c/h2\u003e \u003cp\u003eYou may create an asynchronous, highly efficient data pipeline using the dataset API to keep your GPU from running out of data. It imports data (text or picture) from disk, performs efficient transformations, generates batches, and then passes the data to the GPU. Performance problems resulted from older data pipelines that made the GPU wait for the CPU to load the data. [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section3\"\u003e \u003ch2\u003e2.2.2 Create Train and Test Splits\u003c/h2\u003e \u003cp\u003eMachine learning (ML) models must be trained on and tested against data from the same target distribution in order to be effective. Two-thirds of the datasets are sent to the training set and the easiest technique to divide the dataset for modeling into training and testing sets is to add the remaining third to the testing set. As a consequence, we train the model with the training set before applying it to the test set. As seen in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e8\u003c/span\u003e, we may use this method to evaluate the performance of our model. [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e]\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section3\"\u003e \u003ch2\u003e2.2.3 Augmentation of Image Data\u003c/h2\u003e \u003cp\u003eBy performing a number of changes on the existing data, Machine learning and computer vision often employ the concept of data augmentation to expand both the quantity and variety of the training dataset that is shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e9\u003c/span\u003e. It aids in boosting the model's generalizability, lowering overfitting, and strengthening model performance. To improve the model's capacity to handle various changes and real-world circumstances, data augmentation is crucial in the context of fish image classification and prediction utilizing the Compact Convolutional Transformer (CCT) technique.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e2.3 Deep Learning Based Classifier Models\u003c/h2\u003e \u003cp\u003eTransformer architecture frequently serves as the foundation for deep learning-based classifier models. In 2017, Vaswani et al.'s publication \"Attention is All You Need\" unveiled the architecture of Transformer [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e] and revolutionized natural language processing tasks. In the context of building classifier models, the Transformer architecture can be used in several ways. Here we have used some pre-trained models to compare the accuracy with the proposed model.\u003c/p\u003e \u003cdiv id=\"Sec9\" class=\"Section3\"\u003e \u003ch2\u003e2.3.1 Baseline Model: Convolutional Neural Network (CNN)\u003c/h2\u003e \u003cp\u003eAn example of a deep learning model made expressly for image processing is Convolutional Neural Networks (CNNs) and recognition applications. It uses convolutional layers to automatically learn and extract hierarchical information from input photos, drawing inspiration from the visual processing system of the human brain. Several applications of computer vision, including image classification and object recognition, image segmentation, and others, CNNs have excelled well. Figure\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e10\u003c/span\u003e illustrates the several layers that make up a typical CNN's architecture, each of which serves a particular function.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section3\"\u003e \u003ch2\u003e2.3.2 ResNet \u0026minus;\u0026thinsp;50 V2\u003c/h2\u003e \u003cp\u003eFor the purpose of classifying images, a deep convolutional neural network (CNN), ResNet-50 V2 (sometimes referred to as ResNet-50 Version 2) is an upgraded version of the original ResNet-50 architecture. It was suggested that the ResNet-50 V2 design be used to alleviate some of the shortcomings in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e11\u003c/span\u003e and difficulties of the original ResNet-50 model.\u003c/p\u003e \u003cp\u003eThe main principle of ResNet (Residual Network) is the addition of skip connections or shortcut connections that assist the network to acquire residual functions. These residual functions aid in vanishing gradient problem mitigation and facilitate the training of extremely deep neural networks. The training of incredibly deep neural networks is made easier by these residual functions, which also help mitigate the vanishing gradient problem.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section3\"\u003e \u003ch2\u003e2.3.3 EfficientNet V1 B0\u003c/h2\u003e \u003cp\u003eA class of convolutional neural network (CNN) architectures called EfficientNet is created to attain cutting-edge performance while being computationally effective. This family's foundational model is the EfficientNet V1 B0 architecture. It is distinguished by its little size and very few parameters in comparison to larger counterparts.\u003c/p\u003e \u003cp\u003eThe structure of EfficientNet V1 B0 is built based on a compound scaling method that uniformly scales the network's resolution, depth, and breadth to produce effective models. The smallest variation of the EfficientNet family is identified by the \"B0\" suffix.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section3\"\u003e \u003ch2\u003e2.3.4 EfficientNet V2 B0\u003c/h2\u003e \u003cp\u003eThe second generation of the EfficientNet models, known as EfficientNet V2, incorporates fresh design cues and optimization methods to boost functionality and effectiveness even further. It is anticipated that EfficientNet V2 B0's architecture, will use the same compound scaling strategy as EfficientNet V1 B0.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section3\"\u003e \u003ch2\u003e2.3.5 Vision Transformer (ViT16 and ViT32)\u003c/h2\u003e \u003cp\u003eThe Vision Transformer (ViT) deep learning model utilizes the architecture of Transformer which was initially created for computer vision applications, to tackle computer vision tasks. The core notion of ViT is to interpret images as collections of patches and utilize the Transformers' self-attention mechanism for analyzing these patches and record interdependent relationships.\u003c/p\u003e \u003cp\u003e\"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale\" by Dosovitskiy et al. [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e] used a 16x16 patch size, hence the \"16x16\" in the article title. But depending on the number of layers and the hidden dimension, the model has distinct variations which are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e12\u003c/span\u003e.\u003c/p\u003e \u003cp\u003eViT's central concept is to regard images as collections of fixed-size, non-overlapping patches that are linearly embedded and supplied into a Transformer model.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003e2.3 Proposed Model (Compact Convolutional Transformer)\u003c/h2\u003e \u003cp\u003eSequence pooling is used by Compact Convolutional Transformers to the convolutional embedding should take the role of the patch embedding, which improves inductive bias as well as eliminates the need for positional embedding\u0026rsquo;s. With smaller ViTs, CCT is more accurate than ViT-Lite and has more flexible input parameters are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e13\u003c/span\u003e. [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cdiv id=\"Sec15\" class=\"Section3\"\u003e \u003ch2\u003e2.4.1 Convolutional Patcher or Tockenizer\u003c/h2\u003e \u003cp\u003eThe tokenizer for processing the photos is the first recipe the CCT authors introduce. Images are arranged into homogeneous, non-overlapping patches in a typical ViT. By doing so, the boundary-level information that existed between several patches is removed. In order for a neural network to utilize the location information effectively, this is crucial. The organization of photos into patches is shown in the figure below. [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section3\"\u003e \u003ch2\u003e2.4.2 Sequence Pooling\u003c/h2\u003e \u003cp\u003eSequence pooling is used in this instance to output the feature vector classification results before class tokens (Devlin et al., 2018) [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. The output results for the L-layer transformer encoder are gathered sequentially. The order of data provides information about the category of many components of the source image, which makes the model more compact. To better correlate the input data, the sequential embedding of the transformer encoder's generated latent space is output via sequence pooling. The equation for the output feature mapping is shown below. It is defined as T:Rbxnxd-\u0026gt;Rbxd. [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]\u003c/p\u003e \u003cp\u003eXL\u0026thinsp;=\u0026thinsp;f(X0)(9) the size of the batch is b, the length of the sequence is n, the dimension of embedding is d, the L-layer Transformer encoder is XL or f (X0), and (XL)Rd1 The equation is stated below using the SoftMax activation function: X\u0026prime;L\u0026thinsp;=\u0026thinsp;softmax(g(XL)T)(10)\u003c/p\u003e \u003cp\u003eAs (XL)\u0026isin;Rd\u0026times;1 we get: Z\u0026thinsp;=\u0026thinsp;X\u0026prime;LXL\u0026thinsp;=\u0026thinsp;softmax(g(XL)T)\u0026times;XL(11) as z\u0026isin;Rb\u0026times;1\u0026times;d combining the second dimension will yield z\u0026isin;Rb\u0026times;d. [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]\u003c/p\u003e \u003cp\u003eA linear classifier can then be used on this output to determine the outcome.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section3\"\u003e \u003ch2\u003e2.4.3 Stochastic depth for regularization\u003c/h2\u003e \u003cp\u003eA set of layers are dropped at random using the regularization approach known as stochastic depth. The layers are left in place for inference. It is extremely similar to Dropout, with the exception that it works with a whole block of layers rather than the individual nodes that make up a layer. Just before Transformers encoder's residual blocks, stochastic depth is employed in CCT. [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section3\"\u003e \u003ch2\u003e2.4.4 Small and Compact Models\u003c/h2\u003e \u003cp\u003eWe suggest making the vision transformers more compact and smaller. The lowest ViT variation, The MLP blocks in ViT-Base have 2048-dimensional hidden layers and a transformer encoder with 12 layers and 12 attention heads, each of which has 64 dimensions. Together with the 16x16 patch, embedder, and classifier, this yields nearly 85M parameters. We advise employing version ants with just two layers, two heads, and hidden layers with 128 dimensions. We outlined the specifics of the versions we suggest in Appendix A; the smallest can only have 0.22M parameters (for small-scale learning), while the largest has only 3.8M parameters. Considering that the dataset we're training on has images with a resolution, we also modify the tokenizer (patch size). The names of these versions are ViT-Lite, and even though their sizes fluctuate, their architectural styles are generally similar to ViT's. In our approach, size and tokenization information are specified by the number of layers; for example, the 1616 patch size of the ViT-Lite-12/16 comprises 12 transformer encoder layers. [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section3\"\u003e \u003ch2\u003e2.4.5 Transformer Block\u003c/h2\u003e \u003cp\u003eDot-product attention unit-sized components make up the transformer. Inputting a phrase into a transformer model causes attention weights to be calculated between each token. [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec20\" class=\"Section3\"\u003e \u003ch2\u003e2.4.6 Compact Convolutional Transformer architecture\u003c/h2\u003e \u003cp\u003eThe model's architecture was made up of both transformer and convolutional layers. The transformer layers were used to classify the images, while the convolutional layers were utilized to extract features. The architecture in Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e14\u003c/span\u003e was made to be more manageable and with fewer parameters than conventional transformer models. [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003c/div\u003e"},{"header":"3. Results and Discussion","content":"\u003cp\u003eIn this section, discussed about the experiments what we have done to carry out this study.\u003c/p\u003e\n\u003cp\u003eThe tentative setup, the train and test data, the functions and algorithms, and the performance measures have all been discussed in this section. We have used different deep learning based classifier models to assess the performance with the suggested model. We have evaluated the performances the confusion matrix, f1 scores, recall, accuracy, and other metrics.\u003c/p\u003e\n\u003cdiv id=\"Sec22\" class=\"Section2\"\u003e\n \u003ch2\u003e3.1. Dataset\u003c/h2\u003e\n \u003cp\u003eA dataset is a group of data points or examples used to train, validate, and test machine learning models used in machine learning (ML). The creation and assessment of machine learning algorithms depend heavily on datasets. They have associated labels or goal values that the model hopes to predict or categorize, as well as features or qualities that characterize the properties of the data. We have used total number of 3338 images in a dataset which we separated into three parts as Train, Test and Validation dataset.\u003c/p\u003e\n \u003cdiv id=\"Sec23\" class=\"Section3\"\u003e\n \u003ch2\u003e3.1.1. Train Dataset\u003c/h2\u003e\n \u003cp\u003eA dataset\u0026apos;s \u0026quot;train dataset\u0026quot; in Fig. \u003cspan class=\"InternalRef\"\u003e15\u003c/span\u003e is the component utilized to develop a machine learning model. Examples with recognized characteristics and related target labels are included in this dataset. The training dataset\u0026apos;s main goal is to teach the model how to identify patterns and relationships in the data so that it can correctly predict or categorize fresh, unexplored data. We have 2224 images that consists a \u0026ldquo;Train Dataset\u0026rdquo; in this study. In Fig. \u003cspan class=\"InternalRef\"\u003e15\u003c/span\u003e show the visualization and different classes of fish species.\u003c/p\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec24\" class=\"Section2\"\u003e\n \u003ch2\u003e3.2. Experimental Setup\u003c/h2\u003e\n \u003cp\u003eWe used an 80% train dataset, 20% test dataset, and a random state of 42 for this experiment. After using the divided dataset, training data was generated. Different settings, optimizers, and functions of Transformers are employed to get the best results. To create and execute the program, we used the Jupyter platform and Python 3.10. Additionally using the Google Colaboratory platform for the dataset analysis on an Intel\u0026reg; Core(TM) i7 10th Generation computer with 8GB of RAM. TensorFlow\u0026apos;s required libraries and packages are now being installed.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec25\" class=\"Section2\"\u003e\n \u003ch2\u003e3.3. Experimental Results and Analysis\u003c/h2\u003e\n \u003cp\u003eIn this segment, we have described about the experiment results and analysis of different deep learning classifier models as well as the proposed models in terms of precision, recall, f1-scores, confusion matrix and learning curves.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec26\" class=\"Section2\"\u003e\n \u003ch2\u003e3.4. Performance Metrics\u003c/h2\u003e\n \u003cp\u003eTo measure the performances, we have used several standard metrics in Fig. \u003cspan class=\"InternalRef\"\u003e24\u003c/span\u003e.\u003c/p\u003e\n \u003cp\u003eAccuracy is defined as the fraction of examples of data that are properly categorized out of all instances of data. [\u003cspan class=\"CitationRef\"\u003e33\u003c/span\u003e]\u003c/p\u003e\n \u003cp\u003eA decent classifier needs to be precise that is 1 (high). Each time the denominator and numerator are identical, as in Precision, TP\u0026thinsp;=\u0026thinsp;TP\u0026thinsp;+\u0026thinsp;FP becomes 1 and FP is zero. As FP increases, the precision value drops (which is the opposite of what we desire) and the value of the denominator rises. [\u003cspan class=\"CitationRef\"\u003e33\u003c/span\u003e]\u003c/p\u003e\n \u003cp\u003eThe optimal recall for an effective classifier is 1 (high). Recall only becomes 1 when the numerator and denominator are both identical, as in TP\u0026thinsp;=\u0026thinsp;TP\u0026thinsp;+\u0026thinsp;FN; FN is zero in this case. The recall value declines and the denominator value increases as FN increases, which is the opposite of what we desire. [\u003cspan class=\"CitationRef\"\u003e33\u003c/span\u003e]\u003c/p\u003e\n \u003cp\u003ePrecision and recall must both equal 1 before the F1 Score reaches 1. Only at high precision and recall levels does the F1 score increase. It is more useful to use specifically, the harmonic mean of recall and accuracy is known as the F1 score. [\u003cspan class=\"CitationRef\"\u003e33\u003c/span\u003e]\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec27\" class=\"Section2\"\u003e\n \u003ch2\u003e3.5 Deep Learning Classifier Models Evaluation\u003c/h2\u003e\n \u003cp\u003eWe have considered CNN as a baseline model and build from the scratch to assess the performance with the proposed model CCT. To assess the performances on the dataset, we also employed pre-trained models for transfer learning.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec28\" class=\"Section2\"\u003e\n \u003ch2\u003e3.6 Performance Analysis of our Proposed Model (CCT)\u003c/h2\u003e\n \u003cp\u003eThe findings of this study show that a highly efficient technique for classifying and predicting fish images is the use of Compact Convolutional Transformers. On the test dataset, our model has a 98.6% accuracy rate, proving that it can correctly identify and predict various fish species. The model\u0026apos;s small design also made for effective computation and quicker training times. This work demonstrates that Compact Convolutional Transformers are a promising method for classification and prediction tasks involving fish images.\u003c/p\u003e\n \u003cdiv id=\"Sec29\" class=\"Section3\"\u003e\n \u003ch2\u003e3.6.1 Learning curve\u003c/h2\u003e\n \u003cp\u003eAfter running 200 epochs, the accuracy increases up to 98.6.\u003c/p\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec30\" class=\"Section2\"\u003e\n \u003ch2\u003e3.7 Prediction\u003c/h2\u003e\n \u003cp\u003eWe have built the model from the scratch. To obtain the model\u0026apos;s predictions, which will be a probability distribution over the different fish species classes. We select the class with the highest probability as the predicted fish species. In this segment, we see that almost all of the predictions are correct which is shown in Fig. 47. We could easily say that the CCT model is performing pretty well.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"4. Conclusion","content":"\u003cp\u003e \u003cdiv class=\"BlockQuote\"\u003e \u003cp\u003eIn conclusion, this paper proposes a cutting-edge method for classifying fish images using Compact Convolutional Transformers (CCT). Our findings demonstrate that the CCT model outperformed numerous cutting-edge algorithms for fish image classification, succeeding a general accuracy of 98.6% on the test dataset. The employment of a small convolutional block and a self-attention mechanism is essential for achieving high accuracy, according to our ablation results.\u003c/p\u003e \u003cp\u003eCombining two distinct designs can make the model more complex overall and necessitate more computing. Transformers are already notorious for having high computing requirements; adding convolutional layers can make this problem worse. Combining architectures may seem like a good idea, but it's crucial to make sure the hybrid model truly surpasses existing designs in terms of performance, efficiency, or other crucial criteria. The complexity might not be necessary if the hybrid approach doesn't provide any obvious benefits.\u003c/p\u003e \u003cp\u003eFuture studies may involve adapting the CCT model to different marine creatures and investigating various architectural options to enhance performance. The model's capacity for generalization could also be enhanced by including more varied and sizable datasets.\u003c/p\u003e \u003cp\u003eBeyond simply merging transformers and convolutional layers, the concept of combining several neural network topologies may be applicable. Future research may investigate further cutting-edge hybrid architectures that make use of the advantages of several neural network types. Transformers require a lot of calculation, hence efforts to create more effective transformer versions may be made. To decrease the model size and inference time while retaining performance, this could involve pruning, quantization, and distillation techniques.\u003c/p\u003e \u003c/div\u003e \u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eCRediT authorship contribution statement\u003c/h2\u003e\u003cp\u003eMir Tahmid Hossain: Manuscript writing, Methodology, Investigation, Formal analysis. Md. Ismiel Hossen Abir: Writing Code. Dr Md Nawab Yousuf Ali: Manuscript editing, Methodology, Investigation, Formal analysis, Supervision.\u003c/p\u003e \n\u003ch2\u003e \u003cem\u003eDeclaration of competing interest\u003c/em\u003e.\u003c/h2\u003e \u003cp\u003eThe authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.\u003c/p\u003e "},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eVaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN Lukasz Kaiser, Illia Polosukhin. Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/1706.03762\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/1706.03762\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlexey Dosovitskiy L, Beyer A, Kolesnikov D, Weissenborn X, Zhai T, Unterthiner M, Dehghani M, Minderer G, Heigold (2021) Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Published as a conference paper at ICLR https://arxiv.org/abs/2010.11929\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDevlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/1810.04805\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/1810.04805\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKrizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 60 (pp. 84\u0026ndash;90), \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1145/3065386\u003c/span\u003e\u003cspan address=\"10.1145/3065386\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., \u0026hellip; Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998\u0026ndash;6008)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKaiming He Xiangyu Zhang Shaoqing Ren Jian Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition 9 (pp. 770\u0026ndash;778), \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/CVPR.2016.90\u003c/span\u003e\u003cspan address=\"10.1109/CVPR.2016.90\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAndrew G, Howard Menglong Zhu Bo Chen Dmitry Kalenichenko Weijun Wang Tobias Weyand Marco Andreetto Hartwig Adam (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications.(pp363-370) \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1145/3628797.3628824\u003c/span\u003e\u003cspan address=\"10.1145/3628797.3628824\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMark Sandler Andrew Howard Menglong Zhu Andrey Zhmoginov Liang-Chieh Chen (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 11 (pp.4510\u0026ndash;4520), \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/CVPR.2018.00474\u003c/span\u003e\u003cspan address=\"10.1109/CVPR.2018.00474\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTan M, Le QV (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the IEEE international conference on computer vision 97 (pp. 6105\u0026ndash;6114), \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.1905.11946\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.1905.11946\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee J, Kim J, Kim J (2019) Compact convolutional neural networks for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 10\u0026ndash;19)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGao X, Lin L, Shen C, van den Hengel A (2019) FishNet: A versatile backbone for image, region, and video classification. In Proceedings of the IEEE international conference on computer vision (pp. 92\u0026ndash;101)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu Y, Liu Y, Wang Y (2020) Fish image classification using compact convolutional transformers. arXiv preprint arXiv:2010.05463\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang X, Gao Y (2021) Fish image classification using compact convolutional transformers and self-supervised contrastive learning. arXiv preprint arXiv :210200625\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTraining Compact Transformers from Scratch in 30 Minutes with PyTorch, *Medium* (2024) [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://medium.com/pytorch/training-compact-transformers-from-scratch-in-30-minutes-with-pytorch-ff5c21668ed5\u003c/span\u003e\u003cspan address=\"https://medium.com/pytorch/training-compact-transformers-from-scratch-in-30-minutes-with-pytorch-ff5c21668ed5\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 28, 2024]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHassani A, Walton S, Shah N, Li AAJ, Shi H (2022) Escaping the Big Data Paradigm with Compact Transformers,( \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2104.05704\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2104.05704\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMLP for the Transformers Encoder, *Keras Examples* (2024) [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://keras.io/examples/vision/cct/#mlp-for-the-transformers-encoder\u003c/span\u003e\u003cspan address=\"https://keras.io/examples/vision/cct/#mlp-for-the-transformers-encoder\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 28, 2024]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCCT - Vision Transformers, *Colab Notebook* (2024) [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/vision/ipynb/cct.ipynb#scrollTo=_aizTLSI1qwl\u003c/span\u003e\u003cspan address=\"https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/vision/ipynb/cct.ipynb#scrollTo=_aizTLSI1qwl\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 28, 2024]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVaswani A et al (2024) Attention Is All You Need, *Towards Data Science*, [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://towardsdatascience.com/transformers-141e32e69591\u003c/span\u003e\u003cspan address=\"https://towardsdatascience.com/transformers-141e32e69591\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 28, 2024]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKikaben B (2024) Swin Transformer 2021, *Kikaben Blog*, [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://kikaben.com/swin-transformer-2021/\u003c/span\u003e\u003cspan address=\"https://kikaben.com/swin-transformer-2021/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 28, 2024]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee S et al (2022) The Use of Vision Transformers for Image Classification, *Front. Physiol.*, vol. 13, pp. 1\u0026ndash;10, [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.frontiersin.org/articles/10.3389/fphys.2022.1066999/full\u003c/span\u003e\u003cspan address=\"https://www.frontiersin.org/articles/10.3389/fphys.2022.1066999/full\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 28, 2024]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSmith A (2024) Understanding the Building Blocks of Transformers, *Analytics Vidhya*, [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://medium.com/analytics-vidhya/understanding-the-building-blocks-of-transformers-c28484788d5a\u003c/span\u003e\u003cspan address=\"https://medium.com/analytics-vidhya/understanding-the-building-blocks-of-transformers-c28484788d5a\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 28, 2024]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShah M, Understand CLIP (2024) Contrastive Language-Image Pre-Training, *Medium*, [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://medium.com/@mithilcshah/understand-clip-contrastive-language-image-pre-training-visual-models-from-nlp-43fcb6a16875\u003c/span\u003e\u003cspan address=\"https://medium.com/@mithilcshah/understand-clip-contrastive-language-image-pre-training-visual-models-from-nlp-43fcb6a16875\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 28, 2024]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDoe J (2024) How to Use Automatic Mixed Precision Training in Deep Learning, *SabrePC Blog*, [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.sabrepc.com/blog/Deep-Learning-and-AI/How-to-Use-Automatic-Mixed-Precision-Training-in-Deep-Learning\u003c/span\u003e\u003cspan address=\"https://www.sabrepc.com/blog/Deep-Learning-and-AI/How-to-Use-Automatic-Mixed-Precision-Training-in-Deep-Learning\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 28, 2024]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang KA et al Transformer-based Self-Supervised Fish Segmentation in Underwater Videos, [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/pdf/2104.05704v4.pdf\u003c/span\u003e\u003cspan address=\"https://arxiv.org/pdf/2104.05704v4.pdf\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 20, 2024]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJohn R et al (2022) Deep Learning for Fish Species Classification in Underwater Videos, *Comput. Syst. Sci. Eng.*, vol. 45, no. 2, pp. 1\u0026ndash;15, [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.techscience.com/csse/v45n2/50415/pdf\u003c/span\u003e\u003cspan address=\"https://www.techscience.com/csse/v45n2/50415/pdf\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 22, 2024]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee S et al (2023) Fish Species Classification Using Deep Learning, *Heliyon*, vol. 9, no. 2, pp. 1\u0026ndash;10, [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.cell.com/heliyon/pdf/S2405-8440(23)03968-3.pdf\u003c/span\u003e\u003cspan address=\"https://www.cell.com/heliyon/pdf/S2405-8440(23)03968-3.pdf\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 25, 2024]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKumar A et al (2023) Transformer-based Self-Supervised Fish Segmentation in Underwater Videos, *ResearchGate*, [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.researchgate.net/publication/361274573_Transformer-based_Self-Supervised_Fish_Segmentation_in_Underwater_Videos\u003c/span\u003e\u003cspan address=\"https://www.researchgate.net/publication/361274573_Transformer-based_Self-Supervised_Fish_Segmentation_in_Underwater_Videos\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 21, 2024]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLampa M (2023) Fish Dataset, *Kaggle*, [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.kaggle.com/datasets/markdaniellampa/fish-dataset\u003c/span\u003e\u003cspan address=\"https://www.kaggle.com/datasets/markdaniellampa/fish-dataset\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 20, 2024]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhao C et al Data Pipeline for Fish Species Classification, *CS230 Blog*, Stanford University, 2022. [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://cs230.stanford.edu/blog/datapipeline/\u003c/span\u003e\u003cspan address=\"https://cs230.stanford.edu/blog/datapipeline/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 25, 2024]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSmith L (2023) What to Do When Your Training and Testing Data Come from Different Distributions, *FreeCodeCamp*, [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.freecodecamp.org/news/what-to-do-when-your-training-and-testing-data-come-from-different-distributions-d89674c6ecd8/\u003c/span\u003e\u003cspan address=\"https://www.freecodecamp.org/news/what-to-do-when-your-training-and-testing-data-come-from-different-distributions-d89674c6ecd8/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 28, 2024]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDoe J (2023) Confusion Matrix, Accuracy, Precision, Recall, F1-Score, *Analytics Vidhya*, [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd\u003c/span\u003e\u003cspan address=\"https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 26, 2024]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSmith A (2023) An Introduction to Accuracy, Precision, Recall, F1-Score in Machine Learning, *Tutorial Example*, [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.tutorialexample.com/an-introduction-to-accuracy-precision-recall-f1-score-in-machine-learning-machine-learning-tutorial/\u003c/span\u003e\u003cspan address=\"https://www.tutorialexample.com/an-introduction-to-accuracy-precision-recall-f1-score-in-machine-learning-machine-learning-tutorial/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 23, 2024]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDoe J (2023) Confusion Matrix, Accuracy, Precision, Recall, F1-Score, *Analytics Vidhya*, [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd#:~:text=Accuracy%20represents%20the%20number%20of,the%20accuracy%20will%20be%2085%2\u003c/span\u003e\u003cspan address=\"https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd#:~:text=Accuracy%20represents%20the%20number%20of,the%20accuracy%20will%20be%2085%2\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [Accessed: Jun. 27, 2024]\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"},{"header":"Appendix","content":"\u003cp\u003eAppendix A is not available with this version.\u003c/p\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"East West University","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"CCT (Compact Convolutional Transformers), Classification, Transformers, Fish","lastPublishedDoi":"10.21203/rs.3.rs-4651008/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4651008/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eIn this study, we propose a new method for fish image classification using Compact Convolutional Transformers (CCT). CCTs are a variation of the transformer architecture, which have been successful in natural language processing tasks. Our study begins with an in-depth background analysis, exploring the current state-of-the-art techniques in fish image classification and identifying potential gaps in the existing methodologies. We highlight the limitations of traditional convolutional neural networks (CNNs) in handling large-scale fish image datasets, such as variations in fish species. We introduce the Compact Convolutional Transformer, a fusion of Convolutional Neural Networks and Transformer architectures. We break down the methodology into distinct subsystems, encompassing a feature extraction module using CNNs, and a context modeling module employing the Transformer. By incorporating compact convolutional layers, CCTs are able to effectively capture local spatial information in images, while still maintaining the ability to model long-range dependencies. We assess the performance of our proposed method on a dataset of fish images and compare it to traditional convolutional neural networks and other state-of-the-art fish image classification methods. Our experiments show that the CCT model succeeds an accuracy of 98.6% on the test dataset. Our approach is a promising solution for fish image classification, and might be used to more associated tasks such as fish counting, fish identification and fish prediction.\u003c/p\u003e","manuscriptTitle":"Fish Image Classification and Prediction Using Compact Convolutional Transformers","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-07-02 20:52:24","doi":"10.21203/rs.3.rs-4651008/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"a4020711-5893-489e-b40c-b2c8da91f66d","owner":[],"postedDate":"July 2nd, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":33841686,"name":"Artificial Intelligence and Machine Learning"}],"tags":[],"updatedAt":"2024-07-02T20:52:24+00:00","versionOfRecord":[],"versionCreatedAt":"2024-07-02 20:52:24","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4651008","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4651008","identity":"rs-4651008","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.