Improving the Quality of Skin Lesion Data for Training Vision-Language Models | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Improving the Quality of Skin Lesion Data for Training Vision-Language Models Atufigwege Mwakatapanya, Tess Watt, Christos Chrysoulas This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8871094/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Skin cancer diagnosis using machine learning faces significant challenges, primarily due to the lack of well-labelled and balanced skin lesion datasets. Most available datasets are limited to only two lesion types; melanoma and nevi, or they exhibit severe class and skin tone imbalance leading to biases during model training. Furthermore, vision-language models (VLMs) face an additional challenge in training using these datasets as they also lack semantic labelling for effective training. To address these challenges, several researchers have used general adversarial networks (GANs) to generate realistic synthetic images. While this can improve the diagnostic accuracy, it raises ethical and trust concerns especially in clinical settings. Moreover, applying GANs on imbalanced datasets amplifies the existing biases. This paper proposes an alternative approach by curating and combining the existing public datasets; HAM10000 and BCN20000, into a single well-labelled dataset called RHB, optimised for training Google’s Gemma 3 4B model. Figures Figure 1 Figure 2 Figure 3 Introduction According to The Skin Cancer Foundation 1 , skin cancer is the most common cancer in the United States (US) and globally. In the US alone more than 9500 people are diagnosed with skin cancer every day, with more than two people dying every hour 1 . While basal cell carcinoma (BCC) is the most common skin cancer, melanoma is the most dangerous of all the skin cancers. Almost 90% of non-melanoma skin cancers are due to exposure to Ultraviolet (UV) light. Although melanoma is the most dangerous skin cancer, if detected at an early stage, its five-year survival rate is 99% 1 . With this in mind, the accurate and early detection of skin cancer is crucial. Several machine learning (ML) studies have been conducted to accelerate the diagnosis of skin cancer cases 2–7 . The most researched ML model is convolutional neural networks (CNNs) with multimodal learning including vision-language-models (VLMs) gaining pace due to their ability to combine visual features and natural language processing. Since the inception of ML research aimed at skin disease diagnosis, the majority of the dermoscopic images collected focused on the classification of melanoma and non-melanoma lesions. Also, the datasets collected did not contain any metadata as the primary focus was to use image-only models in diagnosis, specifically CNNs. When the HAM10000 dataset 8 was published, it started to fill this gap by introducing more lesion types from two to seven lesions with metadata information on the patient’s age, gender and lesion location. The International Skin Imaging Collaboration (ISIC) challenge datasets released in 2019 and 2020 extended the focus to nine lesions including the out-of-distribution lesions 9 . A handful of dermatological datasets have been released so far, however, due to ethical and privacy concerns, most publicly available datasets lack diversity, consistent labelling and balanced class representation. These limitations post a challenge for the datasets to be reliable in machine learning. VLMs require high resolution images and metadata to achieve high performance. However, looking at the publicly available datasets, we can see how scarce the dermoscopic VLMs represent mostly two classes of lesions; melanoma and nevi have different labelling standards which makes it difficult to use them in combination. The use of generative adversarial networks (GANs) to generate synthetic images has been employed by several researchers 10 to address the lack of diverse, balanced and well-labelled data in skin disease diagnosis. The use of synthetic images in balancing the datasets has achieved promising performance however, there are certain limitations to it. Despite achieving great results, in some cases above 97% 11 , it might also amplify the existing datasets biases 12 . Furthermore, GANs like any other artificial intelligence (AI) techniques are prone to some degree of error during synthetic data generation which cannot be toleration in the medical field. Models trained using synthetic images being used to diagnose patients in real-life clinical settings might raise ethical concerns and trust issues as these models are used to make life and death decisions. In the long run, the use of GANs in dermatology will lead to more synthetic datasets which cannot reflect reality. This might lead to failure in identifying out-of-distribution lesions since there will not be enough real data to synthesise from. In this work, we introduce the RHB (Refined HAM10000 & BCN20000) skin lesion dataset, which can be used for the classification of 9 classes of skin lesions: basal cell carcinoma, benign keratosis, dermatofibroma, melanoma, nevus, scars, solar actinic keratosis, squamous cell carcinoma, and a class of unknown lesions. The paper is structured as follows: we review related work, describe our experimental setup, provide a discussion of the results and comparative assessment, and finally, we conclude our work. Related Work Publicly Available Skin Lesion Datasets There are a few skin lesion datasets that are publicly available. Some are available free of charge for research and academic purposes while others require payment. Table 1 describes some publicly available datasets. Table 1: A list of publicly available datasets Dataset Total Images Number of Classes HAM10000 8 10,000 7 BCN20000 13 18,946 9 ISIC Archive 9 Over 90,000 8 PH2 14 200 2 Dermofit Image Library 15 1,300 10 Med-Node 16 170 2 Derm7pt 17 1,011 2 There are other publicly available datasets with a mix of clinical, dermoscopic and pathological skin lesions such as the Interactive Atlas of Dermoscopy, Asan, Hallym, SD-198 & 260, Dermnet NZ, and The Cancer Genome Atlas 18 . Most of these datasets lack the diversity, balance and standard labelling required to train VLMs and yield accurate and unbiased classification. Challenges in Skin Lesion Datasets Several studies have highlighted the limitations of open-source skin lesion datasets. Natha et al. 2 demonstrated the potential of deep learning for melanoma detection but emphasized the need for diverse and well-labelled datasets. Tschandl et al. 8 introduced the HAM10000 dataset but noted its imbalance in lesion types and underrepresentation of certain skin tones. Li et al. 10 stated the challenges in their study as limited labelled data, imbalanced data, missing metadata, and noisy data. The ISIC Challenge datasets (2016-2023) 9 have made significant contributions to dermatological machine learning, but gaps remain in data standardisation, diversity, and label consistency. In their study to analyse the source of bias in skin lesion classification Bissoto et al. 19 discovered that the bias mainly comes from the datasets themselves rather than the algorithms. The image acquisition techniques and efforts to increase diversity like flipping, zooming, and cropping create bias towards certain lesions. This makes it difficult for models to overcome bias even after feeding them with additional clinical information like patient’s age, sex, lesion localisation and lesion border characteristics. These techniques, however, seem to be necessary as some images have visible artifacts introduced during acquisition like dark corners, marker ink, gel, water bubbles, colour charts, ruler marks and hair (see Fig. 1). Machine Learning for Skin Lesion Classification The use of ML has overcome most of the challenges faced in traditional methods of skin lesion classification by analysing many images at once, more accurately than human dermatologists. ML can identify patterns that the human eye cannot and can therefore be more accurate and objective. However, despite their promise in improving classification accuracy, existing ML models like random forest, support vector machines, and CNNs rely on specific datasets which are mostly designed to classify between melanoma and non-melanoma lesions 2 . In a study by Brinker et al. 3 , a CNN model performed better than 87% of 157 dermatologists in terms of sensitivity and specificity, but the classification was between melanoma and atypical nevi only. In overcoming the hinderance that single modality ML models face due to the lack of diverse and well-labelled datasets, Remya et al. 5 introduced a framework that combines visual data and patient metadata like age, sex, and lesion location to improve the prediction accuracy. Vision-Language Models A vision-language model (VLM) is an advanced artificial intelligence model that integrates visual (image and video) and textual data to perform tasks requiring both modalities. These models have gained popularity in recent years due to their ability to handle complex tasks such as image captioning, visual question answering (VQA), and text-to-image generation 21 . Models such as Contrastive Language-Image Pre-training (CLIP) have demonstrated robust performance in image classification tasks by leveraging large-scale paired image-text datasets 22 . However, despite the advancement of VLMs in image-text tasks, very few studies have been conducted on domain-specific medical VLMs 23 . For VLMs focusing on the dermatology domain, studies are still in the early stages, and they face some common challenges such as the lack of well-labelled and diverse skin lesion datasets. The available datasets lack textual descriptions that align with visual features, making it difficult to train VLMs effectively. Addressing these gaps by improving dataset quality and incorporating structured metadata will enhance the usability of VLMs in dermatology. SkinGPT-4 introduced by Zhou et al. 24 leveraged MiniGPT-4 25 by fine-tuning it on a collected of 22,742 publicly available and 30,187 proprietary skin disease images. The evaluation on the model was conducted by involving dermatologists who were to rate the model’s results on a Likert scale. Around 75% of the ratings were “Strongly Agree” and “Agree”. During the ImageCLEF 2024 MEDIQA-MAGIC Challenge, Cieplicka et al. 26 presented a solution to classify skin diseases using small-scale multimodal models (moondream2 27 and TinyLLava 28 ). The target was to fine-tune these models which are designed to be used in resource-limited devices like Raspberry Pi, edge devices, or mobile phones in VQA tasks. The results indicated that fine-tuning these models enhances their domain-specific knowledge in dermatology however, since the images and text prompts were used repetitively during training there is potential for overfitting. Cirone et al. 29 compared the performance of generic-domain GPT-4V against LLaVA 30 in differentiating melanoma from nevi. The results indicated that GPT-4V outperformed LLaVA in different settings by an overall accuracy of 85% to 45% respectively. However, both models failed to correctly classify darkened pigmented lesions. Generative Adversarial Networks Generative adversarial networks (GANs) consist of two neural networks, the generator and the discriminator, opposing each other. Simply put, the generator tries to make real-life synthetic data from the noise of real data while the discriminator tries to identify whether the data generated is real or fake. The original GAN uses a multi-layer perceptron and does not require any other input apart from the data itself. To address the lack of diverse, balanced and well-labelled data in skin lesion datasets, the use of GANs to generate synthetic images has been employed by several researchers 10 . Hasan et al. 31 and Pham et al. 4 suggest the use of augmentation techniques to address the lack of accurate, manually annotated dermatological datasets, on which deep learning models are heavily dependent. Udrea and Mitra 32 introduced the use of GANs in images captured using mobile devices to classify between melanoma and non-melanoma skin lesions. Their target was to test whether the use of clinical images (contrary to dermoscopic images) in detecting pigmented skin lesions will have the required results after employing a GAN on the dataset. The results were over 90% accurate. Wang et al. 7 generated an additional 5000 melanoma images using the StyleGAN2-ADA network 33 and combined them with the ISIC 2020 Challenge dataset 34 , to tackle the imbalance in the original ISIC 2020 dataset. Literature Conclusion Although studies on the use of VLMs for classifying skin lesions are still in their early stages, these models hold significant potential by leveraging natural language processing (NLP) into their architecture. However, a major challenge lies in the limited availability of suitable skin lesion datasets for training VLMs. While GANs can create synthetic images to address this gap, their use raises ethical concerns. This makes real-world implementation challenging. It is therefore crucial to have a solution for the dataset limitations that ensures both performance and practical usability of VLMs in medical applications. Our work directly addresses this research gap. Methods This section presents the methodology employed in the collection, preprocessing, and integration of the skin lesion datasets used in this research. The data cleaning process focused on eliminating duplicate images generated through augmentation methods (e.g., rotation, zooming, and padding) that could introduce biases and lead to model overfitting. Priority was given to preserving image quality, reducing visual artifacts, and removing dark microscope borders using both automated and manual techniques. Data Collection To adhere to the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA), publicly available datasets with license allowing research of academic usage collected from the official websites were used. For this work two datasets were used: HAM10000 and BCN20000. The HAM10000 dataset contains data from human subjects however, the data has been anonymised and approved by the institutional ethics board at the University of Queensland (Protocol-No. 2017001223) and the ethics committee at the Medical University of Vienna (Protocol-No. 1804/2017). The dataset was downloaded from the following link: https://api.isic-archive.com/collections/66/. The BCN20000 dataset also contains data collected from human subjects which has also been anonymised. It obtained institutional ethics approval (HCB/2019/0413) from the Hospital Clínic in Barcelona. The dataset was downloaded from the following link: https://api.isic-archive.com/collections/249/. Both datasets are under the CC-BY-NC 4.0 licence, meaning they can be used, modified, and shared for academic purposes. Data Cleaning Removal of Duplicate Images and Quality Filtering In both the HAM10000 and BCN20000 datasets, duplicates of the same images were created to balance and diversify lesion representation. This was done by removing artifacts, cropping, rotating and zooming in on the original images. To prevent overfitting and minimise bias towards larger lesion classes, duplicate images were removed to keep only a single image per set (see Fig. 2). This was completed using scikit-image, opencv-python, and other common Python libraries (numpy, pandas, shutil) to separate the images and process the datasets to match the remaining images. The images were removed by considering the following factors of priority: images with less artifacts were given the highest priority, then images with higher quality, and lastly images with no dark edges. During this process, clean metadata files with no duplicate image records for both datasets were produced and stored for later use. A script for this task is available at: https://github.com/atu88/gemma-lesion-classifier. Image Standardisation After the removal of duplicate images, the next step was to standardise the images to the same format, resolution and lesion visibility. The primary focus was to remove dark edges as much as possible while keeping the lesions intact and centred. This task was performed using XnView MP. For the HAM10000 dataset, most images had small or no dark edges, which were removed by cropping the images from 600*450 pixels to 450*450 pixels while preserving their quality. For the BCN20000 dataset, most images had dark edges of different sizes. To deal with this, a visual analysis was conducted on a few samples and images which could be cropped to 800*800 pixels while preserving their quality. A manual review was then conducted and in cases where the lesions were cropped out, original images were manually cropped to replace the over-cropped images. The rest of the images with dark edges that needed cropping but could not be batch-processed without affecting the lesion (approximately 3,200 images), were manually cropped and centred whenever possible. The dark edges in some images could not be completely removed as doing so would lead to cropping out part of the lesions. The BCN20000 images were then combined and batch-resized to 450*450 pixels to have the same resolution as the HAM10000 images. Data Merging and Labelling The processed images from HAM10000 and BCN20000 were merged to form a single dataset called RHB (Refined HAM10000 & BCN20000). The analysis and processing of metadata files were conducted using OpenRefine 35 . During the deduplication process, two clean metadata files corresponding to the datasets were produced from the original HAM10000 and BCN20000 metadata files. These files had some differences in lesion naming. They were transformed to have uniform naming as stipulated below: In the BCN20000 metadata file there is a “Melanoma metastasis” class. This class is not in the HAM10000 metadata file. In both metadata files there is a “Melanoma NOS” class, where NOS means ‘Not Otherwise Specified’ and the lesion has not been further categorised into a specific subtype. These classes were all renamed to “Melanoma”. In the BCN20000 metadata file there are the classes “Seborrheic keratosis” and “Solar lentigo” while in the HAM10000 metadata file there is the class “Pigmented benign keratosis”. All these lesions fall under the benign keratosis lesion type, so these classes were all renamed to “Benign keratosis”. In the BCN20000 metadata file there is a “Scar” class which is not in the HAM10000 metadata file. This was left as-is. In both metadata files there are lesion types that appear ‘blank’. These were renamed to “Unknown” for easy analysis and to avoid model errors during training. In both metadata files there is a “Squamous cell carcinoma, NOS” class. These were renamed to “Squamous cell carcinoma”. The two metadata files were merged to form a single file. The lesion distribution after combining and refining the datasets is shown in Table 2. Table 2: RHB dataset lesion distribution Lesion Type Count Basal cell carcinoma (BCC) 1,632 Benign keratosis (BKL) 1,198 Dermatofibroma (DF) 127 Melanoma (MEL) 1,538 Nevus (NV) 7,107 Scar (SCAR) 96 Solar or actinic keratosis (AKIEC) 402 Squamous cell carcinoma (SCC) 301 Unknown (UNK) 509 Total 12,910 By combining and curating the HAM10000 and BCN20000 datasets into the RHB dataset, we address the lack of high-quality, open-source skin lesion datasets available. The RHB dataset contains no duplicate or low-quality images, the labels are consistent and well structured, and it is publicly available. It consists of a higher number of images than the HAM10000 dataset and contains more classes, making it suitable for a wider range of dermatological conditions. Images are of higher quality than the BCN20000 dataset, making it more reliable for machine learning applications. Experimental Setup This section outlines the dataset preparation process for model fine-tuning, and the fine-tuning environment (the model and hardware requirements). Dataset Preparation After the dataset pre-processing described in the previous section, four datasets were obtained; the original HAM10000 dataset, the original BCN20000 dataset, the BalancedHB dataset, and the RHB dataset which combines the two original datasets. The RHB dataset is available at: https://huggingface.co/datasets/atufigwege/refined-dataset/tree/main. The RHB dataset was further processed to obtain the BalancedHB dataset, to eliminate bias. The balance threshold was set at 300 minimum images and 500 maximum images for each class of lesion. Classes with under 300 images (after duplicates removal) were removed from the dataset to maintain balance. This threshold was set to keep as many images as possible while maintaining a close balance among the classes. The BalancedHB dataset is available at: https://huggingface.co/datasets/atufigwege/balanced-dataset/tree/main. The class distribution for all four datasets is provided in Fig. 3. As can be seen in Fig. 3, the two most prevalent classes amongst the four datasets are nevi (NV) and melanoma (MEL). This is significant as discussed in previous sections, many skin lesion classification systems are designed to focus on the classification between melanoma and non-melanoma lesions. Our proposed datasets, RHB and BalancedHB, were designed with the aim of making skin lesion classification systems more nuanced, focusing on classes that are often overlooked. Training The four datasets were used to fine-tune Google’s Gemma 3 4B model 36 . The steps below explain how the training (fine-tuning) and testing (inference) process was conducted. A HF account was created and four model repositories for each dataset were created. These repositories were used to save the fine-tuned models. Using a Python script, each dataset was split into train and test sets by a ratio of 80:20, while making sure they are stratified to balance the lesion distribution. Corresponding train and test CSV metadata files were also created during this split. Since Gemma 3 works with JSONL files, the training metadata files were converted from CSV to JSONL. The train and test directories were zipped and uploaded to Google Drive together with the training metadata file. A Python fine-tuning script was created in Google Colab using the official Gemma 3 guide 1 . After fine-tuning was finished, the fine-tuned model was uploaded to its respective HF repository for inference. An inference script was then created by following the same guide and was used to test the saved fine-tuned model using the test set, and the prediction results were saved to a CSV file. Scripts for duplicates removal, train-test split, JSONL conversion, fine-tuning, and inference are available at: https://github.com/atu88/gemma-lesion-classifier. Results Table 3 shows a summary of the accuracy and the F1 scores across the four datasets. Table 3: Summary of evaluation results across the datasets Dataset Accuracy (%) Macro F1-score Weighted F1-score HAM10000 67 0.45 0.70 BCN20000 59 0.36 0.59 RHB 68 0.35 0.68 BalancedHB 38 0.35 0.35 The HAM10000 dataset had the highest F1 scores (both macro and weighted) indicating fair balance between the dominant and minority classes, however, it has lower accuracy compared to the RHB dataset meaning less overall generalisation. The BCN20000 dataset had the lowest accuracy and weighted F1 score plus almost the same macro F1 score as the RHB dataset, making it worse than the HAM10000 dataset and the RHB dataset. The RHB dataset has the highest accuracy implying better generalisation, however, it has the lowest macro F1 score indicating the lowest individual lesion performance. Overall, HAM10000 is the best performing dataset, however, it has the advantage of additional augmented images which improves overall performance. On the other hand, the BalancedHB dataset eliminates the bias by balancing all the classes, however, it has the lowest performance of all the datasets. This is due to the small number of images left after balancing the classes. To address class imbalance in the lesion classification and improve dataset performance without augmenting the images, a class-weighted loss function was applied to the RHB dataset (see Table 4). This method assigns higher penalties to minority classes during training based on their inverse frequency 6 . Class-weighted loss improved the model’s accuracy to 72% and weighted F1 score to 71%, outperforming the HAM10000 dataset. However, the macro F1 score did not significantly improve (38%), implying the bias towards the dominant classes was still high. The evaluation script is available at: https://github.com/atu88/gemma-lesion-classifier. Table 4: Classification results on the RHB dataset after applying the class-weighted loss function. Class Precision Recall F1-Score Support BCC 0.49 0.94 0.65 326 BKL 0.50 0.53 0.51 240 DF 0.00 0.00 0.00 25 MEL 0.53 0.37 0.44 308 NV 0.92 0.87 0.90 1421 SCAR 0.00 0.00 0.00 19 AKIEC 0.34 0.20 0.25 81 SCC 0.44 0.18 0.26 60 UNK 0.53 0.36 0.43 102 Accuracy 0.72 2582 Macro avg 0.42 0.38 0.38 2582 Weighted avg 0.72 0.72 0.71 2582 Discussion The aim of this research was to find, curate and combine two skin lesion datasets to produce one diverse, balanced, and well-labelled dataset suitable for VLM training. Two datasets, HAM10000 and BCN20000 were used in this research. The HAM10000 dataset in its original form performed better than the BCN20000 dataset. This was likely due to the high-quality images in the HAM10000 dataset, and that it has been well pre-processed by removing artifacts and dark edges. Furthermore, the BCN20000 dataset has many duplicate images. There are only 29% of unique images for BCN20000 while HAM10000 has 75% unique images. This heavy augmentation amplifies noise in the BCN20000 dataset. When considering the RHB dataset which combines the two datasets, its accuracy is higher compared to HAM10000 but has lower F1 scores. This is most likely due to the large number of artifact images from the BCN20000 dataset. The poor performance of the BalancedHB dataset indicates that sample size matters in VLM training, however, the quality of images also matters, as shown by the poor performance of the BCN20000 dataset despite having the largest sample size of 18,946 images. Limitations At the time of this research, a suitable skin lesion dataset with all the required metadata and variation in skin tones to include in the experiments for diversity, could not be found. This is because skin lesions/cancer affect light-skinned people (Caucasians) more than dark-skinned people 37 with 20-30% of all lesions in Caucasians being cancerous while just 1-2% being cancerous in Black people. Future Work Even though this work is considered extremely important for the community and a lot of traction is expected, we do not stop here. More effort should be put on collecting more skin lesion images, especially for the minority lesions to improve the dataset size and balance without relying on augmentation. One dataset with a great representation of the minority lesions like dermatofibroma (1247), and squamous cell carcinoma (1231) is the Asan dataset 18 however, the process of obtaining such a dataset is rather long and not assured as the dataset is not made public. This might take a while however, with the intentions of integrating these models in real-world clinical systems, it is a necessary step to take to increase trust and assurance of these models as well as eliminate ethical concerns. The BCN20000 dataset has many artifact-heavy images. Instead of just removing dark borders, these images can be further processed to remove the artifacts using techniques such as DullRazor 38 or SharpRazor 39 . Furthermore, the metadata used during the experiments did not have relevant diagnostic information which could be used to train the model on open-ended prompts to produce detailed diagnostic reports which are more useful than just the classification of lesions. Looking for ways to improve these datasets with diagnostic details could improve the reliability of these models. Declarations Data Availability The public repository at https://huggingface.co/datasets/atufigwege/refined-dataset contains the RHB dataset used for this paper. The dataset contains the following classes: Basal Cell Carcinoma (BCC) Benign Keratosis (BKL) Dermatofibroma (DF) Melanoma (MEL) Nevus (NV) Scars (SCAR) Actinic Keratosis (AKIEC) Squamous Cell Carcinoma (SCC) Unknown (UNK) Code Availability The repository at https://github.com/atu88/gemma-lesion-classifier contains the following Python scripts: Image quality filtering and deduplication (duplicates_removal.py) Metadata CSV to JSONL conversion (jsonl_convert.py) Train/test dataset split with image copying (train_test_split.py) Model fine-tuning script (gemma_finetuning.py) Model inference script (gemma_inference.py) Performance evaluation script (results_evaluation.py) Contributions A.M. generated the RHB and BalancedHB datasets and ran the experiments, T.W. drafted the initial version of this manuscript, and C.C. supervised this research. All authors reviewed and edited the manuscript. Competing Interests The authors declare no competing interests. Acknowledgements The authors would like to thank Ada Grecner for reviewing draft versions of this article. Funding The authors received no specific funding for this study. Ethics Statement Not applicable. References Skin Cancer Facts & Statistics. https://www.skincancer.org/skin-cancer-information/skin-cancer-facts/. Natha, P. et al. Boosting skin cancer diagnosis accuracy with ensemble approach. Sci. Rep. 15 , 1–25 (2025). Brinker, T. J. et al. Deep learning outperformed 136 of 157 dermatologists in a head-to-head dermoscopic melanoma image classification task. Eur. J. Cancer 113 , 47–54 (2019). Pham, T. C., Luong, C. M., Visani, M. & Hoang, V. D. Deep CNN and Data Augmentation for Skin Lesion Classification. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 10752 LNAI , 573–582 (2018). Remya, S., Anjali, T. & Sugumaran, V. A Novel Transfer Learning Framework for Multimodal Skin Lesion Analysis. IEEE Access 12 , 50738–50754 (2024). Nguyen, V. D., Bui, N. D. & Do, H. K. Skin Lesion Classification on Imbalanced Data Using Deep Learning with Soft Attention. Sensors 2022, Vol. 22, Page 7530 22 , 7530 (2022). Wang, R. et al. A novel approach for melanoma detection utilizing GAN synthesis and vision transformer. Comput. Biol. Med. 176 , 108572 (2024). Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data 2018 5:1 5 , 1–9 (2018). ISIC Challenge. https://challenge.isic-archive.com/data/#2019. Li, H., Pan, Y., Zhao, J. & Zhang, L. Skin disease diagnosis with deep learning: A review. Neurocomputing 464 , 364–393 (2021). Sarker, M. M. K. et al. SLSNet: Skin lesion segmentation using a lightweight generative adversarial network. Expert Syst. Appl. 183 , 115433 (2021). Gilani, S. Q., Umair, M., Naqvi, M., Marques, O. & Kim, H. C. Adversarial Training Based Domain Adaptation of Skin Cancer Images. Life 2024, Vol. 14, Page 1009 14 , 1009 (2024). Hernández-Pérez, C. et al. BCN20000: Dermoscopic Lesions in the Wild. Scientific Data 2024 11:1 11 , 1–9 (2024). Mendonca, T., Ferreira, P. M., Marques, J. S., Marcal, A. R. S. & Rozeira, J. PH2 - A dermoscopic image database for research and benchmarking. Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS 5437–5440 (2013) doi:10.1109/EMBC.2013.6610779. DERMOFIT. https://homepages.inf.ed.ac.uk/rbf/DERMOFIT/. Giotis, I. et al. MED-NODE: A computer-assisted melanoma diagnosis system using non-dermoscopic images. Expert Syst. Appl. 42 , 6578–6585 (2015). point criteria evaluation Database. https://derm.cs.sfu.ca/Welcome.html. Goyal, M., Knackstedt, T., Yan, S. & Hassanpour, S. Artificial intelligence-based image classification methods for diagnosis of skin cancer: Challenges and opportunities. Comput. Biol. Med. 127 , 104065 (2020). Bissoto, A., Fornaciali, M., Valle, E. & Avila, S. (De)Constructing Bias on Skin Lesion Datasets. 0–0 Preprint at http://www.cancer.net/cancer-types/melanoma/ (2019). Singh, L., Janghel, R. R. & Sahu, S. P. An Empirical Review on Evaluating the Impact of Image Segmentation on the Classification Performance for Skin Lesion Detection. IETE Technical Review (Institution of Electronics and Telecommunication Engineers, India) 40 , 190–201 (2023). What are Vision-Language Models? | NVIDIA Glossary. https://www.nvidia.com/en-us/glossary/vision-language-models/. Radford, A. et al. Learning Transferable Visual Models From Natural Language Supervision. 8748–8763 Preprint at https://proceedings.mlr.press/v139/radford21a.html (2021). Van, M. H., Verma, P. & Wu, X. On Large Visual Language Models for Medical Imaging Analysis: An Empirical Study. Proceedings - 2024 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies, CHASE 2024 172–176 (2024) doi:10.1109/CHASE60773.2024.00029. Zhou, J. et al. SkinGPT-4: An Interactive Dermatology Diagnostic System with Visual Large Language Model. https://arxiv.org/pdf/2304.10691 (2023). Zhu, D., Chen, J., Shen, X., Li, X. & Elhoseiny, M. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. 12th International Conference on Learning Representations, ICLR 2024 https://arxiv.org/pdf/2304.10592 (2023). Cieplicka, P., Kłos, J. & Morawski, M. VisionQAries at MEDIQA-MAGIC 2024: Small Vision Language Models for Dermatological Diagnosis Notebook for the ImageCLEF Lab at CLEF 2024. https://github.com/julklos (2024). vikhyatk/moondream2 · Hugging Face. https://huggingface.co/vikhyatk/moondream2. Zhou, B. et al. TinyLLaVA: A Framework of Small-scale Large Multimodal Models. https://arxiv.org/pdf/2402.14289 (2024). Cirone, K., Akrout, M., Abid, L. & Oakley, A. Assessing the Utility of Multimodal Large Language Models (GPT-4 Vision and Large Language and Vision Assistant) in Identifying Melanoma Across Different Skin Tones. JMIR Dermatol. 7 , e55508 (2024). Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual Instruction Tuning. Adv. Neural Inf. Process. Syst. 36 , (2023). Hasan, M. K., Elahi, M. T. E., Alam, M. A., Jawad, M. T. & Martí, R. DermoExpert: Skin lesion classification using a hybrid convolutional neural network through segmentation, transfer learning, and augmentation. Inform. Med. Unlocked 28 , 100819 (2022). Udrea, A. & Mitra, G. D. Generative Adversarial Neural Networks for Pigmented and Non-Pigmented Skin Lesions Detection in Clinical Images. Proceedings - 2017 21st International Conference on Control Systems and Computer, CSCS 2017 364–368 (2017) doi:10.1109/CSCS.2017.56. Karras, T. et al. Training Generative Adversarial Networks with Limited Data. Adv. Neural Inf. Process. Syst. 2020-December , (2020). ISIC Challenge. https://challenge.isic-archive.com/data/. OpenRefine. https://openrefine.org/. Team, G. et al. Gemma 3 Technical Report. https://arxiv.org/pdf/2503.19786 (2025). Gloster, H. M. & Neal, K. Skin cancer in skin of color. J. Am. Acad. Dermatol. 55 , 741–760 (2006). Lee, T., Ng, V., Gallagher, R., Coldman, A. & McLean, D. Dullrazor®: A software approach to hair removal from images. Comput. Biol. Med. 27 , 533–543 (1997). Kasmi, R. et al. SharpRazor: Automatic removal of hair and ruler marks from dermoscopy images. Skin Research and Technology 29 , e13203 (2023). Footnotes https://ai.google.dev/gemma/docs/core/huggingface_vision_finetune_qlora Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8871094","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":596258547,"identity":"159ac884-4264-4571-9e38-f5768a0574ad","order_by":0,"name":"Atufigwege Mwakatapanya","email":"","orcid":"","institution":"Heriot-Watt University","correspondingAuthor":false,"prefix":"","firstName":"Atufigwege","middleName":"","lastName":"Mwakatapanya","suffix":""},{"id":596258548,"identity":"b8d7ea88-4861-45a7-be5a-347cb5982f11","order_by":1,"name":"Tess Watt","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAq0lEQVRIiWNgGAWjYBAC+wM8BkDKJgHKT8CtFAbYGIBaDiSkka7lMGlaDD9//HE+z+AA88MPjG1pxGjh3SxxIOF2scEBNmMJxrYcorRsAGlJ3HaAwYyBsa2COFt+HEg4B9TC/o1YLfzfgLYcAGrhAdlCjMOY+coszqQlJ+4/zFMskXCOGO+z9xjfqLCxS5zZ3r7xw4eyZMJaGJiRGQlEaBgFo2AUjIJRQAQAACk8N3jiujNfAAAAAElFTkSuQmCC","orcid":"","institution":"Heriot-Watt University","correspondingAuthor":true,"prefix":"","firstName":"Tess","middleName":"","lastName":"Watt","suffix":""},{"id":596258549,"identity":"26fb0b44-d4c9-4a7b-ba23-466ee121c6f1","order_by":2,"name":"Christos Chrysoulas","email":"","orcid":"","institution":"Heriot-Watt University","correspondingAuthor":false,"prefix":"","firstName":"Christos","middleName":"","lastName":"Chrysoulas","suffix":""}],"badges":[],"createdAt":"2026-02-13 11:09:04","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8871094/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8871094/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":103590095,"identity":"65599c09-af25-4bb0-b678-e9dad89b9a19","added_by":"auto","created_at":"2026-02-27 12:01:52","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":867999,"visible":true,"origin":"","legend":"\u003cp\u003eArtefacts of skin lesion images introduced during acquisition \u003csup\u003e20\u003c/sup\u003e.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-8871094/v1/2ebeb22c085c9924aca999d2.png"},{"id":103590097,"identity":"4b73ca1b-d2d6-43f6-848f-1e8a76458cd2","added_by":"auto","created_at":"2026-02-27 12:01:52","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":429392,"visible":true,"origin":"","legend":"\u003cp\u003eSample HAM10000 duplicate nevus image removal.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-8871094/v1/a9a1674faecda31edcba691d.png"},{"id":103590096,"identity":"6bd84480-68a7-4603-9759-8446456f3762","added_by":"auto","created_at":"2026-02-27 12:01:52","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":121088,"visible":true,"origin":"","legend":"\u003cp\u003eLesion distribution among the four datasets.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-8871094/v1/157646ce3c69e87c953bec91.png"},{"id":104398349,"identity":"fb9e5041-e25e-4887-9641-10b286cc2bee","added_by":"auto","created_at":"2026-03-11 12:01:57","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2908830,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8871094/v1/9066c65f-df1d-4644-801f-8573a158e5a9.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Improving the Quality of Skin Lesion Data for Training Vision-Language Models","fulltext":[{"header":"Introduction","content":"\u003cp\u003eAccording to The Skin Cancer Foundation\u003csup\u003e1\u003c/sup\u003e, skin cancer is the most common cancer in the United States (US) and globally. In the US alone more than 9500 people are diagnosed with skin cancer every day, with more than two people dying every hour\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e. While basal cell carcinoma (BCC) is the most common skin cancer, melanoma is the most dangerous of all the skin cancers. Almost 90% of non-melanoma skin cancers are due to exposure to Ultraviolet (UV) light. Although melanoma is the most dangerous skin cancer, if detected at an early stage, its five-year survival rate is 99%\u003csup\u003e1\u003c/sup\u003e. With this in mind, the accurate and early detection of skin cancer is crucial. Several machine learning (ML) studies have been conducted to accelerate the diagnosis of skin cancer cases\u003csup\u003e2\u0026ndash;7\u003c/sup\u003e. The most researched ML model is convolutional neural networks (CNNs) with multimodal learning including vision-language-models (VLMs) gaining pace due to their ability to combine visual features and natural language processing.\u003c/p\u003e\n\u003cp\u003eSince the inception of ML research aimed at skin disease diagnosis, the majority of the dermoscopic images collected focused on the classification of melanoma and non-melanoma lesions. Also, the datasets collected did not contain any metadata as the primary focus was to use image-only models in diagnosis, specifically CNNs. When the HAM10000 dataset\u003csup\u003e8\u003c/sup\u003e was published, it started to fill this gap by introducing more lesion types from two to seven lesions with metadata information on the patient\u0026rsquo;s age, gender and lesion location. The International Skin Imaging Collaboration (ISIC) challenge datasets released in 2019 and 2020 extended the focus to nine lesions including the out-of-distribution lesions\u003csup\u003e9\u003c/sup\u003e. A handful of dermatological datasets have been released so far, however, due to ethical and privacy concerns, most publicly available datasets lack diversity, consistent labelling and balanced class representation. These limitations post a challenge for the datasets to be reliable in machine learning. VLMs require high resolution images and metadata to achieve high performance. However, looking at the publicly available datasets, we can see how scarce the dermoscopic VLMs represent mostly two classes of lesions; melanoma and nevi have different labelling standards which makes it difficult to use them in combination.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe use of generative adversarial networks (GANs) to generate synthetic images has been employed by several researchers\u003csup\u003e10\u003c/sup\u003e to address the lack of diverse, balanced and well-labelled data in skin disease diagnosis. The use of synthetic images in balancing the datasets has achieved promising performance however, there are certain limitations to it. Despite achieving great results, in some cases above 97%\u003csup\u003e11\u003c/sup\u003e, it might also amplify the existing datasets biases\u003csup\u003e12\u003c/sup\u003e. Furthermore, GANs like any other artificial intelligence (AI) techniques are prone to some degree of error during synthetic data generation which cannot be toleration in the medical field. Models trained using synthetic images being used to diagnose patients in real-life clinical settings might raise ethical concerns and trust issues as these models are used to make life and death decisions. In the long run, the use of GANs in dermatology will lead to more synthetic datasets which cannot reflect reality. This might lead to failure in identifying out-of-distribution lesions since there will not be enough real data to synthesise from.\u003c/p\u003e\n\u003cp\u003eIn this work, we introduce the RHB (Refined HAM10000 \u0026amp; BCN20000) skin lesion dataset, which can be used for the classification of 9 classes of skin lesions: basal cell carcinoma, benign keratosis, dermatofibroma, melanoma, nevus, scars, solar actinic keratosis, squamous cell carcinoma, and a class of unknown lesions. The paper is structured as follows: we review related work, describe our experimental setup, provide a discussion of the results and comparative assessment, and finally, we conclude our work.\u003c/p\u003e"},{"header":"Related Work","content":"\u003ch2\u003ePublicly Available Skin Lesion Datasets\u003c/h2\u003e\n\u003cp\u003eThere are a few skin lesion datasets that are publicly available. Some are available free of charge for research and academic purposes while others require payment. Table 1 describes some publicly available datasets.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eTable 1: A list of publicly available datasets\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eDataset\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eTotal Images\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eNumber of Classes\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003eHAM10000\u003csup\u003e8\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003e10,000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003e7\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003eBCN20000\u003csup\u003e13\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003e18,946\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003e9\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003eISIC Archive\u003csup\u003e9\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003eOver 90,000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003ePH2\u003csup\u003e14\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003e200\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003eDermofit Image Library\u003csup\u003e15\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003e1,300\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003e10\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003eMed-Node\u003csup\u003e16\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003e170\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003eDerm7pt\u003csup\u003e17\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003e1,011\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 33.3333%;\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eThere are other publicly available datasets with a mix of clinical, dermoscopic and pathological skin lesions such as the Interactive Atlas of Dermoscopy, Asan, Hallym, SD-198 \u0026amp; 260, Dermnet NZ, and The Cancer Genome Atlas\u003csup\u003e18\u003c/sup\u003e. Most of these datasets lack the diversity, balance and standard labelling required to train VLMs and yield accurate and unbiased classification.\u0026nbsp;\u003c/p\u003e\n\u003ch2\u003eChallenges in Skin Lesion Datasets\u003c/h2\u003e\n\u003cp\u003eSeveral studies have highlighted the limitations of open-source skin lesion datasets. Natha et al.\u003csup\u003e2\u003c/sup\u003e demonstrated the potential of deep learning for melanoma detection but emphasized the need for diverse and well-labelled datasets. Tschandl et al.\u003csup\u003e8\u003c/sup\u003e introduced the HAM10000 dataset but noted its imbalance in lesion types and underrepresentation of certain skin tones. Li et al.\u003csup\u003e10\u003c/sup\u003e stated the challenges in their study as limited labelled data, imbalanced data, missing metadata, and noisy data. The ISIC Challenge datasets (2016-2023) \u003csup\u003e9\u003c/sup\u003e have made significant contributions to dermatological machine learning, but gaps remain in data standardisation, diversity, and label consistency.\u003c/p\u003e\n\u003cp\u003eIn their study to analyse the source of bias in skin lesion classification Bissoto et al.\u003csup\u003e19\u003c/sup\u003e discovered that the bias mainly comes from the datasets themselves rather than the algorithms. The image acquisition techniques and efforts to increase diversity like flipping, zooming, and cropping create bias towards certain lesions. This makes it difficult for models to overcome bias even after feeding them with additional clinical information like patient\u0026rsquo;s age, sex, lesion localisation and lesion border characteristics. These techniques, however, seem to be necessary as some images have visible artifacts introduced during acquisition like dark corners, marker ink, gel, water bubbles, colour charts, ruler marks and hair (see Fig. 1).\u003c/p\u003e\n\u003ch2\u003eMachine Learning for Skin Lesion Classification\u003c/h2\u003e\n\u003cp\u003eThe use of ML has overcome most of the challenges faced in traditional methods of skin lesion classification by analysing many images at once, more accurately than human dermatologists. ML can identify patterns that the human eye cannot and can therefore be more accurate and objective. However, despite their promise in improving classification accuracy, existing ML models like random forest, support vector machines, and CNNs rely on specific datasets which are mostly designed to classify between melanoma and non-melanoma lesions\u003csup\u003e2\u003c/sup\u003e. In a study by Brinker et al.\u003csup\u003e3\u003c/sup\u003e, a CNN model performed better than 87% of 157 dermatologists in terms of sensitivity and specificity, but the classification was between melanoma and atypical nevi only. In overcoming the hinderance that single modality ML models face due to the lack of diverse and well-labelled datasets, Remya et al.\u003csup\u003e5\u003c/sup\u003e introduced a framework that combines visual data and patient metadata like age, sex, and lesion location to improve the prediction accuracy.\u0026nbsp;\u003c/p\u003e\n\u003ch2\u003eVision-Language Models\u003c/h2\u003e\n\u003cp\u003eA vision-language model (VLM) is an advanced artificial intelligence model that integrates visual (image and video) and textual data to perform tasks requiring both modalities. These models have gained popularity in recent years due to their ability to handle complex tasks such as image captioning, visual question answering (VQA), and text-to-image generation\u003csup\u003e21\u003c/sup\u003e. Models such as Contrastive Language-Image Pre-training (CLIP) have demonstrated robust performance in image classification tasks by leveraging large-scale paired image-text datasets\u003csup\u003e22\u003c/sup\u003e. However, despite the advancement of VLMs in image-text tasks, very few studies have been conducted on domain-specific medical VLMs\u003csup\u003e23\u003c/sup\u003e. For VLMs focusing on the dermatology domain, studies are still in the early stages, and they face some common challenges such as the lack of well-labelled and diverse skin lesion datasets. The available datasets lack textual descriptions that align with visual features, making it difficult to train VLMs effectively. Addressing these gaps by improving dataset quality and incorporating structured metadata will enhance the usability of VLMs in dermatology.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eSkinGPT-4 introduced by Zhou et al.\u003csup\u003e24\u003c/sup\u003e leveraged MiniGPT-4\u003csup\u003e25\u003c/sup\u003e by fine-tuning it on a collected of 22,742 publicly available and 30,187 proprietary skin disease images. The evaluation on the model was conducted by involving dermatologists who were to rate the model\u0026rsquo;s results on a Likert scale. Around 75% of the ratings were \u0026ldquo;Strongly Agree\u0026rdquo; and \u0026ldquo;Agree\u0026rdquo;. During the ImageCLEF 2024 MEDIQA-MAGIC Challenge, Cieplicka et al.\u003csup\u003e26\u003c/sup\u003e presented a solution to classify skin diseases using small-scale multimodal models (moondream2\u003csup\u003e27\u003c/sup\u003e and TinyLLava\u003csup\u003e28\u003c/sup\u003e). The target was to fine-tune these models which are designed to be used in resource-limited devices like Raspberry Pi, edge devices, or mobile phones in VQA tasks. The results indicated that fine-tuning these models enhances their domain-specific knowledge in dermatology however, since the images and text prompts were used repetitively during training there is potential for overfitting. Cirone et al.\u003csup\u003e29\u003c/sup\u003e compared the performance of generic-domain GPT-4V against LLaVA\u003csup\u003e30\u003c/sup\u003e in differentiating melanoma from nevi. The results indicated that GPT-4V outperformed LLaVA in different settings by an overall accuracy of 85% to 45% respectively. However, both models failed to correctly classify darkened pigmented lesions.\u0026nbsp;\u003c/p\u003e\n\u003ch2\u003eGenerative Adversarial Networks\u003c/h2\u003e\n\u003cp\u003eGenerative adversarial networks (GANs) consist of two neural networks, the generator and the discriminator, opposing each other. Simply put, the generator tries to make real-life synthetic data from the noise of real data while the discriminator tries to identify whether the data generated is real or fake. The original GAN uses a multi-layer perceptron and does not require any other input apart from the data itself. To address the lack of diverse, balanced and well-labelled data in skin lesion datasets, the use of GANs to generate synthetic images has been employed by several researchers\u003csup\u003e10\u003c/sup\u003e. Hasan et al.\u003csup\u003e31\u003c/sup\u003e and Pham et al.\u003csup\u003e4\u003c/sup\u003e suggest the use of augmentation techniques to address the lack of accurate, manually annotated dermatological datasets, on which deep learning models are heavily dependent. Udrea and Mitra\u003csup\u003e32\u003c/sup\u003e introduced the use of GANs in images captured using mobile devices to classify between melanoma and non-melanoma skin lesions. Their target was to test whether the use of clinical images (contrary to dermoscopic images) in detecting pigmented skin lesions will have the required results after employing a GAN on the dataset. The results were over 90% accurate. Wang et al.\u003csup\u003e7\u003c/sup\u003e generated an additional 5000 melanoma images using the StyleGAN2-ADA network\u003csup\u003e33\u003c/sup\u003e and combined them with the ISIC 2020 Challenge dataset\u003csup\u003e34\u003c/sup\u003e, to tackle the imbalance in the original ISIC 2020 dataset.\u0026nbsp;\u003c/p\u003e\n\u003ch2\u003eLiterature Conclusion\u003c/h2\u003e\n\u003cp\u003eAlthough studies on the use of VLMs for classifying skin lesions are still in their early stages, these models hold significant potential by leveraging natural language processing (NLP) into their architecture. However, a major challenge lies in the limited availability of suitable skin lesion datasets for training VLMs. While GANs can create synthetic images to address this gap, their use raises ethical concerns. This makes real-world implementation challenging. It is therefore crucial to have a solution for the dataset limitations that ensures both performance and practical usability of VLMs in medical applications. Our work directly addresses this research gap.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003eThis section presents the methodology employed in the collection, preprocessing, and integration of the skin lesion datasets used in this research. The data cleaning process focused on eliminating duplicate images generated through augmentation methods (e.g., rotation, zooming, and padding) that could introduce biases and lead to model overfitting. Priority was given to preserving image quality, reducing visual artifacts, and removing dark microscope borders using both automated and manual techniques.\u003c/p\u003e\n\u003ch2\u003eData Collection\u003c/h2\u003e\n\u003cp\u003eTo adhere to the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA), publicly available datasets with license allowing research of academic usage collected from the official websites were used. For this work two datasets were used: HAM10000 and BCN20000. The HAM10000 dataset contains data from human subjects however, the data has been anonymised and approved by the institutional ethics board at the University of Queensland (Protocol-No. 2017001223) and the ethics committee at the Medical University of Vienna (Protocol-No. 1804/2017). The dataset was downloaded from the following link: https://api.isic-archive.com/collections/66/. The BCN20000 dataset also contains data collected from human subjects which has also been anonymised. It obtained institutional ethics approval (HCB/2019/0413) from the Hospital Cl\u0026iacute;nic in Barcelona. The dataset was downloaded from the following link: https://api.isic-archive.com/collections/249/. Both datasets are under the CC-BY-NC 4.0 licence, meaning they can be used, modified, and shared for academic purposes.\u003c/p\u003e\n\u003ch2\u003eData Cleaning\u003c/h2\u003e\n\u003ch3\u003eRemoval of Duplicate Images and Quality Filtering\u003c/h3\u003e\n\u003cp\u003eIn both the HAM10000 and BCN20000 datasets, duplicates of the same images were created to balance and diversify lesion representation. This was done by removing artifacts, cropping, rotating and zooming in on the original images. To prevent overfitting and minimise bias towards larger lesion classes, duplicate images were removed to keep only a single image per set (see Fig. 2). This was completed using scikit-image, opencv-python, and other common Python libraries (numpy, pandas, shutil) to separate the images and process the datasets to match the remaining images. The images were removed by considering the following factors of priority: images with less artifacts were given the highest priority, then images with higher quality, and lastly images with no dark edges. During this process, clean metadata files with no duplicate image records for both datasets were produced and stored for later use. A script for this task is available at: https://github.com/atu88/gemma-lesion-classifier.\u003c/p\u003e\n\u003ch3\u003eImage Standardisation\u003c/h3\u003e\n\u003cp\u003eAfter the removal of duplicate images, the next step was to standardise the images to the same format, resolution and lesion visibility. The primary focus was to remove dark edges as much as possible while keeping the lesions intact and centred. This task was performed using XnView MP. For the HAM10000 dataset, most images had small or no dark edges, which were removed by cropping the images from 600*450 pixels to 450*450 pixels while preserving their quality. For the BCN20000 dataset, most images had dark edges of different sizes. To deal with this, a visual analysis was conducted on a few samples and images which could be cropped to 800*800 pixels while preserving their quality. A manual review was then conducted and in cases where the lesions were cropped out, original images were manually cropped to replace the over-cropped images. The rest of the images with dark edges that needed cropping but could not be batch-processed without affecting the lesion (approximately 3,200 images), were manually cropped and centred whenever possible. The dark edges in some images could not be completely removed as doing so would lead to cropping out part of the lesions. The BCN20000 images were then combined and batch-resized to 450*450 pixels to have the same resolution as the HAM10000 images.\u003c/p\u003e\n\u003ch2\u003eData Merging and Labelling\u003c/h2\u003e\n\u003cp\u003eThe processed images from HAM10000 and BCN20000 were merged to form a single dataset called RHB (Refined HAM10000 \u0026amp; BCN20000). The analysis and processing of metadata files were conducted using OpenRefine\u003csup\u003e35\u003c/sup\u003e. During the deduplication process, two clean metadata files corresponding to the datasets were produced from the original HAM10000 and BCN20000 metadata files. These files had some differences in lesion naming. They were transformed to have uniform naming as stipulated below:\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003eIn the BCN20000 metadata file there is a \u0026ldquo;Melanoma metastasis\u0026rdquo; class. This class is not in the HAM10000 metadata file. In both metadata files there is a \u0026ldquo;Melanoma NOS\u0026rdquo; class, where NOS means \u0026lsquo;Not Otherwise Specified\u0026rsquo; and the lesion has not been further categorised into a specific subtype. These classes were all renamed to \u0026ldquo;Melanoma\u0026rdquo;.\u003c/li\u003e\n \u003cli\u003eIn the BCN20000 metadata file there are the classes \u0026ldquo;Seborrheic keratosis\u0026rdquo; and \u0026ldquo;Solar lentigo\u0026rdquo; while in the HAM10000 metadata file there is the class \u0026ldquo;Pigmented benign keratosis\u0026rdquo;. All these lesions fall under the benign keratosis lesion type, so these classes were all renamed to \u0026ldquo;Benign keratosis\u0026rdquo;.\u003c/li\u003e\n \u003cli\u003eIn the BCN20000 metadata file there is a \u0026ldquo;Scar\u0026rdquo; class which is not in the HAM10000 metadata file. This was left as-is.\u003c/li\u003e\n \u003cli\u003eIn both metadata files there are lesion types that appear \u0026lsquo;blank\u0026rsquo;. These were renamed to \u0026ldquo;Unknown\u0026rdquo; for easy analysis and to avoid model errors during training.\u003c/li\u003e\n \u003cli\u003eIn both metadata files there is a \u0026ldquo;Squamous cell carcinoma, NOS\u0026rdquo; class. These were renamed to \u0026ldquo;Squamous cell carcinoma\u0026rdquo;.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe two metadata files were merged to form a single file. The lesion distribution after combining and refining the datasets is shown in Table 2.\u003c/p\u003e\n\u003cp\u003eTable 2: RHB dataset lesion distribution\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eLesion Type\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eCount\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eBasal cell carcinoma (BCC)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003e1,632\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eBenign keratosis (BKL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003e1,198\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eDermatofibroma (DF)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003e127\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eMelanoma (MEL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003e1,538\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eNevus (NV)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003e7,107\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eScar (SCAR)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003e96\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eSolar or actinic keratosis (AKIEC)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003e402\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eSquamous cell carcinoma (SCC)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003e301\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eUnknown (UNK)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003e509\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eTotal\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e12,910\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eBy combining and curating the HAM10000 and BCN20000 datasets into the RHB dataset, we address the lack of high-quality, open-source skin lesion datasets available. The RHB dataset contains no duplicate or low-quality images, the labels are consistent and well structured, and it is publicly available. It consists of a higher number of images than the HAM10000 dataset and contains more classes, making it suitable for a wider range of dermatological conditions. Images are of higher quality than the BCN20000 dataset, making it more reliable for machine learning applications.\u003c/p\u003e"},{"header":"Experimental Setup","content":"\u003cp\u003eThis section outlines the dataset preparation process for model fine-tuning, and the fine-tuning environment (the model and hardware requirements).\u003c/p\u003e\n\u003ch2\u003eDataset Preparation\u003c/h2\u003e\n\u003cp\u003eAfter the dataset pre-processing described in the previous section, four datasets were obtained; the original HAM10000 dataset, the original BCN20000 dataset, the BalancedHB dataset, and the RHB dataset which combines the two original datasets. The RHB dataset is available at: https://huggingface.co/datasets/atufigwege/refined-dataset/tree/main.\u003c/p\u003e\n\u003cp\u003eThe RHB dataset was further processed to obtain the BalancedHB dataset, to eliminate bias. The balance threshold was set at 300 minimum images and 500 maximum images for each class of lesion. Classes with under 300 images (after duplicates removal) were removed from the dataset to maintain balance. This threshold was set to keep as many images as possible while maintaining a close balance among the classes. The BalancedHB dataset is available at: https://huggingface.co/datasets/atufigwege/balanced-dataset/tree/main. The class distribution for all four datasets is provided in Fig. 3.\u003c/p\u003e\n\u003cp\u003eAs can be seen in Fig. 3, the two most prevalent classes amongst the four datasets are nevi (NV) and melanoma (MEL). This is significant as discussed in previous sections, many skin lesion classification systems are designed to focus on the classification between melanoma and non-melanoma lesions. Our proposed datasets, RHB and BalancedHB, were designed with the aim of making skin lesion classification systems more nuanced, focusing on classes that are often overlooked.\u003c/p\u003e\n\u003ch2\u003eTraining\u003c/h2\u003e\n\u003cp\u003eThe four datasets were used to fine-tune Google\u0026rsquo;s Gemma 3 4B model\u003csup\u003e36\u003c/sup\u003e. The steps below explain how the training (fine-tuning) and testing (inference) process was conducted.\u003c/p\u003e\n\u003col\u003e\n \u003cli\u003eA HF account was created and four model repositories for each dataset were created. These repositories were used to save the fine-tuned models.\u003c/li\u003e\n \u003cli\u003eUsing a Python script, each dataset was split into train and test sets by a ratio of 80:20, while making sure they are stratified to balance the lesion distribution. Corresponding train and test CSV metadata files were also created during this split. Since Gemma 3 works with JSONL files, the training metadata files were converted from CSV to JSONL.\u003c/li\u003e\n \u003cli\u003eThe train and test directories were zipped and uploaded to Google Drive together with the training metadata file.\u003c/li\u003e\n \u003cli\u003eA Python fine-tuning script was created in Google Colab using the official Gemma 3 guide\u003csup\u003e1\u003c/sup\u003e.\u003c/li\u003e\n \u003cli\u003eAfter fine-tuning was finished, the fine-tuned model was uploaded to its respective HF repository for inference.\u003c/li\u003e\n \u003cli\u003eAn inference script was then created by following the same guide and was used to test the saved fine-tuned model using the test set, and the prediction results were saved to a CSV file.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eScripts for duplicates removal, train-test split, JSONL conversion, fine-tuning, and inference are available at: https://github.com/atu88/gemma-lesion-classifier.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003eTable 3 shows a summary of the accuracy and the F1 scores across the four datasets.\u003c/p\u003e\n\u003cp\u003eTable 3: Summary of evaluation results across the datasets\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eDataset\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAccuracy (%)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eMacro F1-score\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eWeighted F1-score\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003eHAM10000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003e67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.45\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.70\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003eBCN20000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003e59\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003e0.36\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003e0.59\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003eRHB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e68\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003e0.35\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003e0.68\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003eBalancedHB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003e38\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003e0.35\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25%;\"\u003e\n \u003cp\u003e0.35\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eThe HAM10000 dataset had the highest F1 scores (both macro and weighted) indicating fair balance between the dominant and minority classes, however, it has lower accuracy compared to the RHB dataset meaning less overall generalisation. The BCN20000 dataset had the lowest accuracy and weighted F1 score plus almost the same macro F1 score as the RHB dataset, making it worse than the HAM10000 dataset and the RHB dataset. The RHB dataset has the highest accuracy implying better generalisation, however, it has the lowest macro F1 score indicating the lowest individual lesion performance. Overall, HAM10000 is the best performing dataset, however, it has the advantage of additional augmented images which improves overall performance. On the other hand, the BalancedHB dataset eliminates the bias by balancing all the classes, however, it has the lowest performance of all the datasets. This is due to the small number of images left after balancing the classes.\u003c/p\u003e\n\u003cp\u003eTo address class imbalance in the lesion classification and improve dataset performance without augmenting the images, a class-weighted loss function was applied to the RHB dataset (see Table 4). This method assigns higher penalties to minority classes during training based on their inverse frequency\u003csup\u003e6\u003c/sup\u003e. Class-weighted loss improved the model\u0026rsquo;s accuracy to 72% and weighted F1 score to 71%, outperforming the HAM10000 dataset. However, the macro F1 score did not significantly improve (38%), implying the bias towards the dominant classes was still high. The evaluation script is available at: https://github.com/atu88/gemma-lesion-classifier.\u003c/p\u003e\n\u003cp\u003eTable 4: Classification results on the RHB dataset after applying the class-weighted loss function.\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eClass\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e\u003cstrong\u003ePrecision\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eRecall\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eF1-Score\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eSupport\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003eBCC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.49\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.94\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.65\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e326\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003eBKL\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.53\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.51\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e240\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003eDF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e25\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003eMEL\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.53\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.37\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.44\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e308\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003eNV\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.92\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.87\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.90\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e1421\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003eSCAR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e19\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003eAKIEC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.34\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.20\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.25\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e81\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003eSCC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.44\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.18\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.26\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e60\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003eUNK\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.53\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.36\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.43\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e102\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAccuracy\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.72\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e2582\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eMacro avg\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.42\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.38\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.38\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e2582\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eWeighted avg\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.72\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.72\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e0.71\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 20%;\"\u003e\n \u003cp\u003e2582\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e"},{"header":"Discussion","content":"\u003cp\u003eThe aim of this research was to find, curate and combine two skin lesion datasets to produce one diverse, balanced, and well-labelled dataset suitable for VLM training. Two datasets, HAM10000 and BCN20000 were used in this research. The HAM10000 dataset in its original form performed better than the BCN20000 dataset. This was likely due to the high-quality images in the HAM10000 dataset, and that it has been well pre-processed by removing artifacts and dark edges. Furthermore, the BCN20000 dataset has many duplicate images. There are only 29% of unique images for BCN20000 while HAM10000 has 75% unique images. This heavy augmentation amplifies noise in the BCN20000 dataset. When considering the RHB dataset which combines the two datasets, its accuracy is higher compared to HAM10000 but has lower F1 scores. This is most likely due to the large number of artifact images from the BCN20000 dataset. The poor performance of the BalancedHB dataset indicates that sample size matters in VLM training, however, the quality of images also matters, as shown by the poor performance of the BCN20000 dataset despite having the largest sample size of 18,946 images.\u003c/p\u003e\n\u003ch2\u003eLimitations\u003c/h2\u003e\n\u003cp\u003eAt the time of this research, a suitable skin lesion dataset with all the required metadata and variation in skin tones to include in the experiments for diversity, could not be found. This is because skin lesions/cancer affect light-skinned people (Caucasians) more than dark-skinned people\u003csup\u003e37\u003c/sup\u003e with 20-30% of all lesions in Caucasians being cancerous while just 1-2% being cancerous in Black people.\u003c/p\u003e\n\u003ch2\u003eFuture Work\u003c/h2\u003e\n\u003cp\u003eEven though this work is considered extremely important for the community and a lot of traction is expected, we do not stop here. More effort should be put on collecting more skin lesion images, especially for the minority lesions to improve the dataset size and balance without relying on augmentation. One dataset with a great representation of the minority lesions like dermatofibroma (1247), and squamous cell carcinoma (1231) is the Asan dataset\u003csup\u003e18\u003c/sup\u003e however, the process of obtaining such a dataset is rather long and not assured as the dataset is not made public. This might take a while however, with the intentions of integrating these models in real-world clinical systems, it is a necessary step to take to increase trust and assurance of these models as well as eliminate ethical concerns.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe BCN20000 dataset has \u0026nbsp;many artifact-heavy images. Instead of just removing dark borders, these images can be further processed to remove the artifacts using techniques such as DullRazor\u003csup\u003e38\u003c/sup\u003e or SharpRazor\u003csup\u003e39\u003c/sup\u003e. Furthermore, the metadata used during the experiments did not have relevant diagnostic information which could be used to train the model on open-ended prompts to produce detailed diagnostic reports which are more useful than just the classification of lesions. Looking for ways to improve these datasets with diagnostic details could improve the reliability of these models.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eData Availability\u003c/h2\u003e\n\u003cp\u003eThe public repository at https://huggingface.co/datasets/atufigwege/refined-dataset contains the RHB dataset used for this paper. The dataset contains the following classes:\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003eBasal Cell Carcinoma (BCC)\u003c/li\u003e\n \u003cli\u003eBenign Keratosis (BKL)\u003c/li\u003e\n \u003cli\u003eDermatofibroma (DF)\u003c/li\u003e\n \u003cli\u003eMelanoma (MEL)\u003c/li\u003e\n \u003cli\u003eNevus (NV)\u003c/li\u003e\n \u003cli\u003eScars (SCAR)\u003c/li\u003e\n \u003cli\u003eActinic Keratosis (AKIEC)\u003c/li\u003e\n \u003cli\u003eSquamous Cell Carcinoma (SCC)\u003c/li\u003e\n \u003cli\u003eUnknown (UNK)\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2\u003eCode Availability\u003c/h2\u003e\n\u003cp\u003eThe repository at https://github.com/atu88/gemma-lesion-classifier contains the following Python scripts:\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003eImage quality filtering and deduplication (duplicates_removal.py)\u003c/li\u003e\n \u003cli\u003eMetadata CSV to JSONL conversion (jsonl_convert.py)\u003c/li\u003e\n \u003cli\u003eTrain/test dataset split with image copying (train_test_split.py)\u003c/li\u003e\n \u003cli\u003eModel fine-tuning script (gemma_finetuning.py)\u003c/li\u003e\n \u003cli\u003eModel inference script (gemma_inference.py)\u003c/li\u003e\n \u003cli\u003ePerformance evaluation script (results_evaluation.py)\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2\u003eContributions\u003c/h2\u003e\n\u003cp\u003eA.M. generated the RHB and BalancedHB datasets and ran the experiments, T.W. drafted the initial version of this manuscript, and C.C. supervised this research. All authors reviewed and edited the manuscript.\u003c/p\u003e\n\u003ch2\u003eCompeting Interests\u003c/h2\u003e\n\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e\n\u003ch2\u003eAcknowledgements\u0026nbsp;\u003c/h2\u003e\n\u003cp\u003eThe authors would like to thank Ada Grecner for reviewing draft versions of this article.\u003c/p\u003e\n\u003ch2\u003eFunding\u003c/h2\u003e\n\u003cp\u003eThe authors received no specific funding for this study.\u003c/p\u003e\n\u003ch2\u003eEthics Statement\u003c/h2\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eSkin Cancer Facts \u0026amp; Statistics. https://www.skincancer.org/skin-cancer-information/skin-cancer-facts/.\u003c/li\u003e\n\u003cli\u003eNatha, P. \u003cem\u003eet al.\u003c/em\u003e Boosting skin cancer diagnosis accuracy with ensemble approach. \u003cem\u003eSci. Rep.\u003c/em\u003e \u003cstrong\u003e15\u003c/strong\u003e, 1\u0026ndash;25 (2025).\u003c/li\u003e\n\u003cli\u003eBrinker, T. J. \u003cem\u003eet al.\u003c/em\u003e Deep learning outperformed 136 of 157 dermatologists in a head-to-head dermoscopic melanoma image classification task. \u003cem\u003eEur. J. Cancer\u003c/em\u003e \u003cstrong\u003e113\u003c/strong\u003e, 47\u0026ndash;54 (2019).\u003c/li\u003e\n\u003cli\u003ePham, T. C., Luong, C. M., Visani, M. \u0026amp; Hoang, V. D. Deep CNN and Data Augmentation for Skin Lesion Classification. \u003cem\u003eLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)\u003c/em\u003e \u003cstrong\u003e10752 LNAI\u003c/strong\u003e, 573\u0026ndash;582 (2018).\u003c/li\u003e\n\u003cli\u003eRemya, S., Anjali, T. \u0026amp; Sugumaran, V. A Novel Transfer Learning Framework for Multimodal Skin Lesion Analysis. \u003cem\u003eIEEE Access\u003c/em\u003e \u003cstrong\u003e12\u003c/strong\u003e, 50738\u0026ndash;50754 (2024).\u003c/li\u003e\n\u003cli\u003eNguyen, V. D., Bui, N. D. \u0026amp; Do, H. K. Skin Lesion Classification on Imbalanced Data Using Deep Learning with Soft Attention. \u003cem\u003eSensors 2022, Vol. 22, Page 7530\u003c/em\u003e \u003cstrong\u003e22\u003c/strong\u003e, 7530 (2022).\u003c/li\u003e\n\u003cli\u003eWang, R. \u003cem\u003eet al.\u003c/em\u003e A novel approach for melanoma detection utilizing GAN synthesis and vision transformer. \u003cem\u003eComput. Biol. Med.\u003c/em\u003e \u003cstrong\u003e176\u003c/strong\u003e, 108572 (2024).\u003c/li\u003e\n\u003cli\u003eTschandl, P., Rosendahl, C. \u0026amp; Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. \u003cem\u003eScientific Data 2018 5:1\u003c/em\u003e \u003cstrong\u003e5\u003c/strong\u003e, 1\u0026ndash;9 (2018).\u003c/li\u003e\n\u003cli\u003eISIC Challenge. https://challenge.isic-archive.com/data/#2019.\u003c/li\u003e\n\u003cli\u003eLi, H., Pan, Y., Zhao, J. \u0026amp; Zhang, L. Skin disease diagnosis with deep learning: A review. \u003cem\u003eNeurocomputing\u003c/em\u003e \u003cstrong\u003e464\u003c/strong\u003e, 364\u0026ndash;393 (2021).\u003c/li\u003e\n\u003cli\u003eSarker, M. M. K. \u003cem\u003eet al.\u003c/em\u003e SLSNet: Skin lesion segmentation using a lightweight generative adversarial network. \u003cem\u003eExpert Syst. Appl.\u003c/em\u003e \u003cstrong\u003e183\u003c/strong\u003e, 115433 (2021).\u003c/li\u003e\n\u003cli\u003eGilani, S. Q., Umair, M., Naqvi, M., Marques, O. \u0026amp; Kim, H. C. Adversarial Training Based Domain Adaptation of Skin Cancer Images. \u003cem\u003eLife 2024, Vol. 14, Page 1009\u003c/em\u003e \u003cstrong\u003e14\u003c/strong\u003e, 1009 (2024).\u003c/li\u003e\n\u003cli\u003eHern\u0026aacute;ndez-P\u0026eacute;rez, C. \u003cem\u003eet al.\u003c/em\u003e BCN20000: Dermoscopic Lesions in the Wild. \u003cem\u003eScientific Data 2024 11:1\u003c/em\u003e \u003cstrong\u003e11\u003c/strong\u003e, 1\u0026ndash;9 (2024).\u003c/li\u003e\n\u003cli\u003eMendonca, T., Ferreira, P. M., Marques, J. S., Marcal, A. R. S. \u0026amp; Rozeira, J. PH2 - A dermoscopic image database for research and benchmarking. \u003cem\u003eProceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS\u003c/em\u003e 5437\u0026ndash;5440 (2013) doi:10.1109/EMBC.2013.6610779.\u003c/li\u003e\n\u003cli\u003eDERMOFIT. https://homepages.inf.ed.ac.uk/rbf/DERMOFIT/.\u003c/li\u003e\n\u003cli\u003eGiotis, I. \u003cem\u003eet al.\u003c/em\u003e MED-NODE: A computer-assisted melanoma diagnosis system using non-dermoscopic images. \u003cem\u003eExpert Syst. Appl.\u003c/em\u003e \u003cstrong\u003e42\u003c/strong\u003e, 6578\u0026ndash;6585 (2015).\u003c/li\u003e\n\u003cli\u003epoint criteria evaluation Database. https://derm.cs.sfu.ca/Welcome.html.\u003c/li\u003e\n\u003cli\u003eGoyal, M., Knackstedt, T., Yan, S. \u0026amp; Hassanpour, S. Artificial intelligence-based image classification methods for diagnosis of skin cancer: Challenges and opportunities. \u003cem\u003eComput. Biol. Med.\u003c/em\u003e \u003cstrong\u003e127\u003c/strong\u003e, 104065 (2020).\u003c/li\u003e\n\u003cli\u003eBissoto, A., Fornaciali, M., Valle, E. \u0026amp; Avila, S. (De)Constructing Bias on Skin Lesion Datasets. 0\u0026ndash;0 Preprint at http://www.cancer.net/cancer-types/melanoma/ (2019).\u003c/li\u003e\n\u003cli\u003eSingh, L., Janghel, R. R. \u0026amp; Sahu, S. P. An Empirical Review on Evaluating the Impact of Image Segmentation on the Classification Performance for Skin Lesion Detection. \u003cem\u003eIETE Technical Review (Institution of Electronics and Telecommunication Engineers, India)\u003c/em\u003e \u003cstrong\u003e40\u003c/strong\u003e, 190\u0026ndash;201 (2023).\u003c/li\u003e\n\u003cli\u003eWhat are Vision-Language Models? | NVIDIA Glossary. https://www.nvidia.com/en-us/glossary/vision-language-models/.\u003c/li\u003e\n\u003cli\u003eRadford, A. \u003cem\u003eet al.\u003c/em\u003e Learning Transferable Visual Models From Natural Language Supervision. 8748\u0026ndash;8763 Preprint at https://proceedings.mlr.press/v139/radford21a.html (2021).\u003c/li\u003e\n\u003cli\u003eVan, M. H., Verma, P. \u0026amp; Wu, X. On Large Visual Language Models for Medical Imaging Analysis: An Empirical Study. \u003cem\u003eProceedings - 2024 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies, CHASE 2024\u003c/em\u003e 172\u0026ndash;176 (2024) doi:10.1109/CHASE60773.2024.00029.\u003c/li\u003e\n\u003cli\u003eZhou, J. \u003cem\u003eet al.\u003c/em\u003e SkinGPT-4: An Interactive Dermatology Diagnostic System with Visual Large Language Model. https://arxiv.org/pdf/2304.10691 (2023).\u003c/li\u003e\n\u003cli\u003eZhu, D., Chen, J., Shen, X., Li, X. \u0026amp; Elhoseiny, M. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. \u003cem\u003e12th International Conference on Learning Representations, ICLR 2024\u003c/em\u003e https://arxiv.org/pdf/2304.10592 (2023).\u003c/li\u003e\n\u003cli\u003eCieplicka, P., Kłos, J. \u0026amp; Morawski, M. VisionQAries at MEDIQA-MAGIC 2024: Small Vision Language Models for Dermatological Diagnosis Notebook for the ImageCLEF Lab at CLEF 2024. https://github.com/julklos (2024).\u003c/li\u003e\n\u003cli\u003evikhyatk/moondream2 \u0026middot; Hugging Face. https://huggingface.co/vikhyatk/moondream2.\u003c/li\u003e\n\u003cli\u003eZhou, B. \u003cem\u003eet al.\u003c/em\u003e TinyLLaVA: A Framework of Small-scale Large Multimodal Models. https://arxiv.org/pdf/2402.14289 (2024).\u003c/li\u003e\n\u003cli\u003eCirone, K., Akrout, M., Abid, L. \u0026amp; Oakley, A. Assessing the Utility of Multimodal Large Language Models (GPT-4 Vision and Large Language and Vision Assistant) in Identifying Melanoma Across Different Skin Tones. \u003cem\u003eJMIR Dermatol.\u003c/em\u003e \u003cstrong\u003e7\u003c/strong\u003e, e55508 (2024).\u003c/li\u003e\n\u003cli\u003eLiu, H., Li, C., Wu, Q. \u0026amp; Lee, Y. J. Visual Instruction Tuning. \u003cem\u003eAdv. Neural Inf. Process. Syst.\u003c/em\u003e \u003cstrong\u003e36\u003c/strong\u003e, (2023).\u003c/li\u003e\n\u003cli\u003eHasan, M. K., Elahi, M. T. E., Alam, M. A., Jawad, M. T. \u0026amp; Mart\u0026iacute;, R. DermoExpert: Skin lesion classification using a hybrid convolutional neural network through segmentation, transfer learning, and augmentation. \u003cem\u003eInform. Med. Unlocked\u003c/em\u003e \u003cstrong\u003e28\u003c/strong\u003e, 100819 (2022).\u003c/li\u003e\n\u003cli\u003eUdrea, A. \u0026amp; Mitra, G. D. Generative Adversarial Neural Networks for Pigmented and Non-Pigmented Skin Lesions Detection in Clinical Images. \u003cem\u003eProceedings - 2017 21st International Conference on Control Systems and Computer, CSCS 2017\u003c/em\u003e 364\u0026ndash;368 (2017) doi:10.1109/CSCS.2017.56.\u003c/li\u003e\n\u003cli\u003eKarras, T. \u003cem\u003eet al.\u003c/em\u003e Training Generative Adversarial Networks with Limited Data. \u003cem\u003eAdv. Neural Inf. Process. Syst.\u003c/em\u003e \u003cstrong\u003e2020-December\u003c/strong\u003e, (2020).\u003c/li\u003e\n\u003cli\u003eISIC Challenge. https://challenge.isic-archive.com/data/.\u003c/li\u003e\n\u003cli\u003eOpenRefine. https://openrefine.org/.\u003c/li\u003e\n\u003cli\u003eTeam, G. \u003cem\u003eet al.\u003c/em\u003e Gemma 3 Technical Report. https://arxiv.org/pdf/2503.19786 (2025).\u003c/li\u003e\n\u003cli\u003eGloster, H. M. \u0026amp; Neal, K. Skin cancer in skin of color. \u003cem\u003eJ. Am. Acad. Dermatol.\u003c/em\u003e \u003cstrong\u003e55\u003c/strong\u003e, 741\u0026ndash;760 (2006).\u003c/li\u003e\n\u003cli\u003eLee, T., Ng, V., Gallagher, R., Coldman, A. \u0026amp; McLean, D. Dullrazor\u0026reg;: A software approach to hair removal from images. \u003cem\u003eComput. Biol. Med.\u003c/em\u003e \u003cstrong\u003e27\u003c/strong\u003e, 533\u0026ndash;543 (1997).\u003c/li\u003e\n\u003cli\u003eKasmi, R. \u003cem\u003eet al.\u003c/em\u003e SharpRazor: Automatic removal of hair and ruler marks from dermoscopy images. \u003cem\u003eSkin Research and Technology\u003c/em\u003e \u003cstrong\u003e29\u003c/strong\u003e, e13203 (2023).\u003c/li\u003e\n\u003c/ol\u003e"},{"header":"Footnotes","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://ai.google.dev/gemma/docs/core/huggingface_vision_finetune_qlora\u003c/span\u003e\u003cspan address=\"https://ai.google.dev/gemma/docs/core/huggingface_vision_finetune_qlora\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-8871094/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8871094/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eSkin cancer diagnosis using machine learning faces significant challenges, primarily due to the lack of well-labelled and balanced skin lesion datasets. Most available datasets are limited to only two lesion types; melanoma and nevi, or they exhibit severe class and skin tone imbalance leading to biases during model training. Furthermore, vision-language models (VLMs) face an additional challenge in training using these datasets as they also lack semantic labelling for effective training. To address these challenges, several researchers have used general adversarial networks (GANs) to generate realistic synthetic images. While this can improve the diagnostic accuracy, it raises ethical and trust concerns especially in clinical settings. Moreover, applying GANs on imbalanced datasets amplifies the existing biases. This paper proposes an alternative approach by curating and combining the existing public datasets; HAM10000 and BCN20000, into a single well-labelled dataset called RHB, optimised for training Google\u0026rsquo;s Gemma 3 4B model.\u003c/p\u003e","manuscriptTitle":"Improving the Quality of Skin Lesion Data for Training Vision-Language Models","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-27 12:01:47","doi":"10.21203/rs.3.rs-8871094/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"9c84b973-a3c7-4f25-8e45-87465fc372ff","owner":[],"postedDate":"February 27th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-02-27T12:01:48+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-27 12:01:47","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8871094","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8871094","identity":"rs-8871094","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.