RobustOVS: Open-Vocabulary Segmentation with Robustly Semantic-Assisted Calibration

doi:10.21203/rs.3.rs-6850046/v1

RobustOVS: Open-Vocabulary Segmentation with Robustly Semantic-Assisted Calibration

2025 · doi:10.21203/rs.3.rs-6850046/v1

preprint OA: closed

Full text JSON View at publisher

Full text 133,115 characters · extracted from preprint-html · click to expand

RobustOVS: Open-Vocabulary Segmentation with Robustly Semantic-Assisted Calibration | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article RobustOVS: Open-Vocabulary Segmentation with Robustly Semantic-Assisted Calibration Ruihan Wang, Guodong Wang, Mingtao Liu This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6850046/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 10 Mar, 2026 Read the published version in Multimedia Systems → Version 1 posted 11 You are reading this latest preprint version Abstract Open-vocabulary semantic segmentation has emerged as a transformative approach in the field of image segmentation. Open-vocabulary segmentation models (OVS) leverage pre-trained vision-language models, such as CLIP, to classify mask regions. However, these models face performance limitations when aligning visual content with the infinite semantics of text. To address this challenge, we propose the Robust Open-Vocabulary Segmentation Model (RobustOVS), which not only preserves CLIP’s generalization capabilities but also enhances computational efficiency. Training such models typically demands computational resources that are beyond the reach of most research labs. RobustOVS tackles this limitation by employing a streamlined and efficient network architecture, significantly reducing training requirements. The additional parameters of RobustOVS can be trained and fine-tuned on a single GPU within 50 hours, demonstrating its feasibility and practicality for standard research environments.In RobustOVS, we introduce a high-performance multi-scale feature pyramid network that effectively extracts semantically rich features through a combination of deformable convolutions and context-based self-modulation. This enables robust matching between masked image regions and nouns in image captions. Experiments reveal that mask prompt fine-tuning yields substantial improvements without modifying any weights of the CLIP model, while further boosting the performance of fully fine-tuned models. Notably, we benchmarked the RobustOVS architecture across several popular open-vocabulary semantic segmentation datasets. RobustOVS consistently delivered outstanding performance on all tasks and datasets, surpassing task-specific architectures while requiring even fewer computational resources. Open-vocabulary semantic segmentation Vision-language models Multi-scale feature pyramid network Figures Figure 1 1. Introduction Semantic segmentation is a fundamental task in the field of computer vision, aiming to assign each pixel in an input image to a specific semantic category. This task requires models to not only identify the categories of objects in an image but also segment the boundaries of these objects with precision. Despite significant advancements in this area in recent years [ 32 ], [ 33 ], [ 34 ], [ 35 ], [ 36 ], [ 37 ], [ 38 ], [ 39 ], [ 1 ], [ 2 ], [ 3 ], [ 4 ], [ 5 ], traditional semantic segmentation models are generally trained on predefined categories. When encountering new, unseen categories during inference, these models often struggle to adapt. To address this challenge, researchers have begun exploring open-vocabulary semantic segmentation (OVS) [ 40 ], [ 41 ], [ 42 ], [ 43 ], [ 44 ], [ 45 ], [ 46 ], [ 47 ]. Unlike conventional models, modern semantic segmentation systems often handle thousands of categories, leveraging textual input as guidance to segment arbitrary objects. The vision-language model CLIP [ 6 ] learns rich multimodal features from billions of image-text pairs. CLIP’s key innovation lies in its robust zero-shot learning capability. While traditional computer vision models require task-specific supervised training, CLIP processes diverse downstream tasks without targeted training by learning from large-scale image-text pairs. For instance, CLIP can classify images directly based on natural language descriptions without the need for annotated data for each category. However, recognizing unseen categories accurately without external knowledge remains challenging. Early research proposed leveraging pre-trained vision-language models for OVS [ 7 ], [ 8 ], [ 9 ], [ 10 ]. Applying CLIP to OVS poses challenges, as CLIP is trained via contrastive learning at the image level. Consequently, pre-trained CLIP struggles to achieve satisfactory classification results on masked images due to its lack of pixel-level recognition capabilities required for semantic segmentation. Two-stage approaches have shown promise [ 40 ], [ 41 ], [ 48 ], [ 49 ]: first generating category-agnostic mask proposals, followed by open-vocabulary classification using pre-trained CLIP. While this approach has achieved progress, aligning visual content with unrestricted textual input remains suboptimal and requires substantial computational resources during training. Our analysis identifies two key limitations: (1) The regions recognized by CLIP for mask classification often do not overlap with the actual mask regions, indicating domain discrepancies in CLIP’s pre-trained visual inputs. (2) The proposal embeddings in segmentation models are tuned for the training semantic space, making the model insensitive to new vocabulary. These domain biases in pre-trained CLIP not only hinder the alignment of segmentation results with textual descriptions but also waste computational resources, making it difficult for standard research labs to meet the computational demands of training such models. To address these challenges, we propose Robust Open-Vocabulary Segmentation Model (RobustOVS), a network architecture that mitigates domain biases in pre-trained CLIP models and aligns model performance with unrestricted textual semantics beyond predefined vocabularies. In summary our contributions include: (1) We propose RobustOVS, a model designed to reduce performance constraints when aligning visual content with unlimited textual semantics, thus expanding the semantic space. (2) We introduce a semantic integration module, which embeds global semantic awareness from the original CLIP into the proposal embeddings of two distinct semantic segmentation modules, enhancing OVS performance. (3) We develop a novel two-stage OVS framework that processes low-resolution and high-resolution images simultaneously using multiple advanced semantic segmentation models. This reduces computational overhead significantly without sacrificing performance. (4) RobustOVS achieves new state-of-the-art results across popular OVS benchmarks, including ADE20K-847 [ 50 ] and Pascal Context-459 [ 51 ]. By integrating the global semantics of pre-trained CLIP into segmentation proposals and employing an efficient cross-attention mechanism combined with prototype selection strategies, RobustOVS effectively reduces computational demands while alleviating domain biases. This innovation makes OVS more accessible for standard research environments while achieving superior performance. 2. Related Works The vision-language pretraining model [ 63 ], [ 41 ], [ 46 ] has emerged as a key research focus in the interdisciplinary domain of computer vision and natural language processing. Pretrained vision-language models such as CLIP [ 6 ] leverage contrastive learning to associate images with textual descriptions, demonstrating exceptional cross-modal alignment capabilities. Beyond excelling in image classification tasks, CLIP has been widely applied in diverse domains, including image generation [ 11 ], object detection [ 12 ], [ 13 ], and image segmentation [ 7 ], [ 14 ], [ 8 ], [ 15 ], [ 9 ], [ 16 ], [ 17 ], [ 10 ]. RegionCLIPovseg [ 13 ] further extends CLIP’s application to object detection by fine-tuning on region proposals, significantly enhancing detection performance. The OVSeg approach introduces a mask prompt tuning strategy that allows CLIP's weights to be shared across multi-task environments without full fine-tuning. Open-vocabulary segmentation (OVS) aims to understand images for arbitrary categories described by text. Early methods like ZS3Net [ 18 ] and SPNet [ 19 ] employed word embeddings to align visual and semantic features, while GroupViT [ 20 ] used text supervision to group image segmentation masks. With the advent of CLIP, methods such as LSeg [ 9 ] and OpenSeg [ 8 ] have harnessed CLIP’s text encoder to further improve segmentation tasks. These approaches align textual embeddings with pixel-level or segment-level visual features, enhancing segmentation accuracy. Recent two-stage open-vocabulary segmentation methods, such as ZSSeg [ 18 ] and ZegFormer [ 19 ], generate class-agnostic mask proposals and leverage CLIP’s pretrained model for open-vocabulary classification, achieving notable progress. However, their performance on occluded images remains limited. Our proposed mask prompt tuning strategy utilizes blank regions in occluded images to improve segmentation performance without modifying CLIP’s weights. Prompt tuning, an emerging adaptation technique for large-scale pretrained models, was initially applied in natural language processing [ 21 ], [ 22 ], [ 12 ] and has since been extended to the vision domain. In vision tasks, CoOp [ 24 ] adapts CLIP by adding learnable vectors before class tokens, improving performance in visual recognition tasks. Our mask prompt tuning strategy extends this concept by focusing on occluded images and substituting occlusion markers, leading to enhanced segmentation accuracy. Despite advancements in many areas, existing methods still face challenges in addressing domain shifts and overfitting. Our proposed RobustOVS framework leverages CLIP’s global semantic priors to calibrate intra- and inter-class spatial relationships, significantly advancing the state of open-vocabulary segmentation. 3. RobustOVS Method Figure 2 . illustrates the workflow of RobustOVS. This framework introduces innovations to the two-stage paradigm. First, we employ various segmentation models to simultaneously generate a set of class-agnostic mask proposals and corresponding proposal embeddings for images at different resolutions. These proposal embeddings are aligned with linguistic features for model classification. In RobustOVS, we propose an advanced Semantic Integration Module (SIM) that transfers global semantic priors from CLIP [ 64 ] to the proposal embeddings' FN layer, calibrating the model's feature space for both in-vocabulary and out-of-vocabulary semantics. Processed sub-images are subsequently sent to CLIP for mask-level classification, and the classification results from CLIP and the proposal embeddings are combined for output. 3.1. Semantic Integration Module (SIM) In model classification, learnable proposal embeddings often face semantic overfitting to training data, limiting their adaptability to novel categories. To address this challenge, we introduce the Semantic Integration Module (SIM). At its core, SIM leverages CLIP's prior knowledge to refine the semantic responses of mask proposal embeddings. SIM extracts implicit semantics from input images using a frozen CLIP model and generates hierarchical features that integrate spatial tokens and a general CLS token, enhancing the proposal embeddings' effectiveness. To optimize feature integration for high-level semantic alignment, we design a low-frequency enhancement structure to reduce potential texture noise. This involves applying Fourier Transform to the features, followed by Gaussian filtering for low-frequency enhancement. The processed features are concatenated and injected into the proposal embeddings, further aligned through a multi-head cross-attention mechanism. Finally, CLIP's visual embeddings are introduced to bridge the gap between visual and linguistic spaces, producing fully aligned proposal embeddings. 3.2. Efficient Network Structure Completing open-vocabulary semantic segmentation requires dividing an image into regions with similar features, as seen in semantic segmentation tasks where pixels sharing similar semantic attributes are grouped into the same category. Beyond basic semantic segmentation, open-vocabulary semantic segmentation also involves distinguishing instances within the same category, akin to the demands of panoptic segmentation. This module aims to develop a model that segments an image into distinct regions with unique masks and assigns class probabilities to each region. To achieve this, we adopt the MaskFormer framework, whose core component is a transformer decoder. The decoder uses N learnable queries and high-resolution image features as input to refine these queries, which are subsequently used to generate predictions. Through multiple transformer blocks, the decoder attends to feature representations and models relationships among different objects. Each block calculates cross-attention between image features and object queries, relying on computationally intensive dot products. While MaskFormer demonstrates significant performance, its efficiency suffers when handling large input features in segmentation tasks. To address this issue, we propose a more efficient network structure based on the Prototype-enhanced Mask Cross-Attention (PEM-CA) module. PEM-CA exploits the intrinsic redundancy of image features in segmentation tasks, significantly reducing the number of input tokens in the attention layers via a prototype selection mechanism. Inspired by recent advances in efficient attention modules, PEM-CA redesigns the cross-attention operation by modeling interactions using computationally lightweight element-wise operations. To further enhance efficiency, we employ a fully convolutional Feature Pyramid Network (FPN) and introduce a Context Self-Adjustment Module (CSM) and deformable convolutions to restore contextual information and dynamicity, improving performance while controlling computational overhead. 3.3. Two-Stage Open-Vocabulary Semantic Segmentation Model To enhance segmentation efficiency, we introduce a two-stage open-vocabulary semantic segmentation model. During training, a clear image is artificially degraded to generate a low-quality version. The clear image is input into a network structure based on MpFormer to obtain mask features and token features, which contribute to the final segmentation results. Additionally, RobustOVS extracts low-level features from the image encoder and incorporates them with the mask features and token features for consistency loss calculation. The degraded image is fed into the RobustSAM model, specifically its PEM network structure, to extract corresponding features. Since the input is of low quality, the output features are expected to include degradation information that hinders segmentation. To mitigate this degradation, we design an efficient fusion module to reduce the consistency loss between the features output by MpFormer and those from the PEM network structure. This approach improves feature alignment and segmentation performance. 4. Experiments 4.1 Experimental setup Training Dataset We trained the RobustOVS model on the COCOovseg dataset [ 25 ]. Initially, the panoptic segmentation module of the RobustOVS model was trained using segmentation labels from the COCO-Stuffovseg dataset [ 26 ]. Subsequently, we fine-tuned the CLIP model using the mask-category dataset derived from COCO Captions [ 27 ]. This dataset consists of 118k training images annotated with 171 valid categories, covering a wide range of content from objects (e.g., orange, car) to materials (e.g., sky, road). Unless specified otherwise, all 171 categories were utilized during training. Test Dataset Module To evaluate the effectiveness of our method, we conducted experiments on several popular image benchmarks, including ADE20K150 [ 28 ], ADE20K847 [ 28 ], Pascal VOCovseg [ 29 ], Pascal Context-59 [ 30 ], and Pascal Context-459 [ 30 ]. ADE20K: A pixel-level densely annotated dataset for scene understanding, comprising 20k training images, 2k validation images, and 3k test images with diverse annotations of indoor and outdoor scenes. We evaluated two category versions: 150 common categories (A-150) and 847 more diverse categories (A-847). Pascal VOC: A classic segmentation dataset with 11,185 training images and 1,449 validation images. We evaluated on the 1.5k validation images annotated with 20 categories (PAS-20). Pascal Context: An extended version of Pascal VOC 2010 that provides annotations for the entire scene, containing 4,998 training images and 5,005 validation images. We evaluated on the commonly used PC-59 and the more challenging PC-459 versions. 4.2 Implementation Details The RobustOVS model consists of two primary components: a segmentation model and a CLIP model adapted for masks. The segmentation model is based on derivatives of MaskFormer, specifically Mpformer [ 68 ] and Pem [ 69 ], while the CLIP model is implemented using OpenCLIP. Final category predictions are made through an ensemble approach that integrates outputs from the segmentation model and CLIP. For the segmentation model: The Mpformer component uses Swin Transformer-Base [ 70 ] as the backbone, initialized with weights pre-trained on ImageNet-21K. The model was trained with the AdamW [ 31 ] optimizer, employing a polynomial learning rate schedule with an initial learning rate of 6 × 10⁻⁵ and a weight decay of 0.01. Input images were resized to 640 pixels on the shorter side and cropped to 640 × 640. The batch size was 32, and training ran for a total of 120k iterations. Data augmentation included random flipping and color jittering, while other hyperparameters followed Mask2Former and MaskFormer settings. The loss function combined Dice loss and cross-entropy loss for segmentation tasks (weights of 5 and 2, respectively) and cross-entropy loss for clasr with an initial learning rate of 2 × 10⁻², no weight decay, and cosine annealing. Input size was 224 × sification tasks. For the CLIP model: The architecture used ViT-L/14, implemented with OpenCLIP. We explored three adaptation strategies: Mask Prompt Tuning (MPT), Full Model Fine-Tuning (FT), and a combination of both (MPT + FT). MPT initialized learnable tokens randomly and applied deep prompting, with a default prompt depth of 3 unless specified. Training used the AdamW optimize224, batch size 256, and training spanned 5 epochs. For FT, the training process was similar, but the initial learning rate was reduced to 5 × 10⁻⁶, and the weight decay was increased to 0.2. For the MPT + FT method, the model was initialized with a fully fine-tuned CLIP and further refined using mask prompt tuning to improve stability and performance. The text encoder of CLIP remained frozen in all experiments. Finally, segmentation and classification predictions were combined to enhance overall performance, leveraging both pixel-level and semantic-level understanding. Table 1 Comparison with State-of-the-Art Methods. ADE, PC, and VOC denote the ADE20K [ 28 ], Pascal Context [ 30 ], and Pascal VOC [ 29 ] datasets, respectively. Method VL-Model Training Dataset ADE-150 ADE-847 PC-59 PC-459 VOC Group-VIT[ 71 ] rand. init. CC12M + YFCC - - 22.4 - 52.3 LSeg+[ 44 ] ALIGN RN101 COCO 13 2.5 36 5.2 59 OpenSeg[ 11 ] ALIGN RN101 COCO 15.3 4 36.9 6.5 60 LSeg+[ 44 ] ALIGN EN-B7 COCO 18 3.8 46.5 7.8 - OpenSeg[ 11 ] ALIGN EN-B7 COCO 21.1 6.3 42.1 9 - OpenSeg[ 11 ] ALIGN EN-B7 COCO + Loc. Narr. 28.6 8.8 48.2 12.2 72.2 SimSeg[ 49 ] CLIP ViT-B/16 COCO 20.5 7 47.7 8.7 88.4 SimSeg[ 49 ] CLIP ViT-B/16 COCO 21.1 6.9 51.9 9.7 91.8 OVSeg[ 72 ] CLIP ViT-B/16 COCO 24.8 7.1 53.3 11 92.6 MAFT[ 73 ] CLIP ViT-B/16 COCO 29.1 10.1 53.5 12.8 90 SAN[ 74 ] CLIP ViT-B/16 COCO 27.5 10.1 53.8 12.6 94 MaskCLIP[ 75 ] CLIP ViT-L/14 COCO 23.7 8.2 45.9 10 - SimSeg†[ 49 ] CLIP ViT-L/14 COCO 21.7 7.1 52.2 10.2 92.3 OVSeg[ 72 ] CLIP ViT-L/14 COCO 29.6 9 55.7 12.4 94.5 ODISE[ 48 ] CLIP ViT-L/14 COCO 29.9 11.1 57.3 14.5 - SAN[ 74 ] CLIP ViT-L/14 COCO 32.1 12.4 57.7 15.7 94.6 RobustOVS(Ours) CLIP ViT-L/14 COCO 33.59 14.17 58.63 16.85 96.32 4.3 Main Results In comparison with current state-of-the-art methods, the RobustOVS model demonstrates superior performance across multiple datasets. Notably, unlike other models, RobustOVS is entirely trained using a single GPU, eliminating the reliance on large-scale computational resources. On the ADE-150 dataset, RobustOVS achieves a mean Intersection over Union (mIoU) of 33.59, slightly outperforming the current state-of-the-art method's 33.5, with a marginal improvement of 0.27%. For the ADE-847 dataset, the model attains an mIoU of 14.17, representing a 1.21% increase over the leading method's 14.0. On the PC-59 dataset, RobustOVS achieves an mIoU of 58.63, slightly lower than the state-of-the-art method's 59.3, reflecting a decline of -1.13%. However, this performance gap highlights the model's exceptional computational efficiency and robustness under single-GPU training. On the PC-459 dataset, RobustOVS achieves an mIoU of 16.85, surpassing the state-of-the-art method's 16.7 with a 0.90% improvement. Meanwhile, on the VOC dataset, RobustOVS attains an mIoU of 96.32, slightly below the leading method's 97.2, with a decrease of -0.91%. Although RobustOVS shows slightly lower performance on some larger datasets (e.g., PC-59 and VOC), its lightweight training strategy and outstanding performance underscore its strong generalization ability and robustness, especially on smaller datasets where its advantages are more pronounced. These characteristics demonstrate that RobustOVS not only reduces training costs but also provides an efficient solution for research scenarios with limited resources. Table 2 Comparison of computational resources between ODISE [ 79 ] and RobustOVS. Method GPU Type Training Time GPU Memory Usage Parameters ODISE 8×A100 120h 320GB 1.2B RobustOVS 1×RTX3090 50h 24GB 0.4B To further highlight the computational efficiency of RobustOVS, we present a direct comparison of its resource requirements against a representative state-of-the-art model, ODISE. As shown in the Table 2 , ODISE requires 8 A100 GPUs, 120 hours of training time, and 320 GB of GPU memory to reach its performance, along with a model size of 1.2 billion parameters. In contrast, RobustOVS is trained entirely on a single RTX3090 GPU within only 50 hours, using just 24 GB of GPU memory and containing only 0.4 billion parameters. This stark difference underscores the significant reduction in computational cost and hardware demands achieved by RobustOVS, making it a more accessible and scalable solution for practical applications with limited resources. 4.4 Ablation studies To evaluate the impact of different core modules on semantic segmentation performance, we designed three model frameworks and conducted comparative experiments on multiple datasets. First, we proposed a model based on the PEM (Efficient Semantic Segmentation Module), aiming to achieve a balanced performance with lower computational costs. Second, we developed a model based on MPFormer (High-Precision Semantic Segmentation Module), which enhances overall performance by improving segmentation accuracy but comes with relatively higher computational overhead. Finally, we introduced a hybrid framework that combines the strengths of PEM and MPFormer, aiming to balance accuracy and efficiency by integrating the characteristics of both modules. The experimental results reveal distinct trends. The PEM-based model demonstrated generally lower mIoU performance across all datasets compared to the state-of-the-art (SOTA), particularly on the ADE150 and VOC datasets, with mIoU scores of 32.76% and 94.02%, respectively, falling short of SOTA by 2.21% and 3.27%. These results indicate that while the PEM approach is computationally efficient, its accuracy limitations hinder optimal performance in more complex tasks. In contrast, the MPFormer-based model exhibited improvements in accuracy, especially on the VOC dataset, where it achieved an mIoU of 95.01%, approximately 2.25% higher than the PEM-based model. However, despite its ability to enhance segmentation accuracy, the increased computational cost limited performance gains on simpler datasets, such as ADE847 and PC59. Notably, on the ADE847 dataset, the MPFormer achieved an mIoU of 13.21%, still 5.64% below SOTA, highlighting challenges in certain scenarios. The hybrid framework outperformed the individual module-based models, particularly on the ADE150 and PC59 datasets, achieving mIoU scores of 33.59% and 58.63%, representing improvements of 0.27% and 0.90% over SOTA, respectively. These results demonstrate that by leveraging the strengths of both PEM and MPFormer, the hybrid approach can deliver more accurate segmentation results while maintaining reasonable efficiency. On the VOC dataset, the hybrid model achieved an mIoU of 96.32%, closely approaching the SOTA score of 97.2%, with a minor gap of 0.91%, suggesting a well-balanced trade-off between accuracy and efficiency. Overall, the hybrid model demonstrated superior performance across all datasets, particularly those requiring higher precision. This underscores the potential of integrating diverse modules to overcome the limitations of individual components, enhancing the overall performance of semantic segmentation models. Notably, in open-vocabulary semantic segmentation tasks, the hybrid approach offers improved robustness and adaptability. Table 3 Ablation Study on ADE-847 Dataset Configuration ADE-847 mIoU Training Time (hours) Baseline 12.1 45 +SIM 13.5 (+ 1.4) 48 (+ 3) +SIM + PEM-CA 14.2 (+ 0.7) 50 (+ 2) In addition to accuracy, we also conducted an ablation study focusing on the effectiveness of our proposed modules SIM (Semantic Injection Module) and PEM-CA (Cross-Attention enhanced PEM) in the context of training efficiency and mIoU performance on the ADE-847 dataset. As shown in Table 3 , the baseline model achieves a mean Intersection over Union (mIoU) of 12.1% with a training time of 45 hours. Incorporating the SIM module improves the mIoU to 13.5%, representing a gain of 1.4%, with only a slight increase in training time to 48 hours. When both SIM and PEM-CA modules are combined, the mIoU further increases to 14.2%, demonstrating a total gain of 2.1% over the baseline, while the training time rises modestly to 50 hours. These results highlight the complementary nature of SIM and PEM-CA in enhancing segmentation accuracy with relatively minimal training overhead. Table 4 Comparison on Open-Vocabulary Segmentation Metrics Method Pem-ovs Mp-ovs Robust-ovs Ade150 32.76 32.57 33.59 Ade847 13.11 13.21 14.17 Pc59 57.79 57.73 58.63 Voc 94.02 95.01 96.32 Furthermore, we evaluated model performance on multiple datasets using three open-vocabulary segmentation metrics—PemOVS, MpOVS, and RobustOVS—to assess robustness across varied domains. As shown in Table 4 , the hybrid model consistently achieves higher scores across all metrics, particularly excelling on the Pascal VOC and PC59 datasets. On VOC, it achieves 96.32% RobustOVS, underscoring its strong generalization capability. These results reaffirm the effectiveness of combining lightweight and high-precision modules in enhancing both accuracy and robustness across diverse segmentation challenges. 5. Conclusion This paper investigates the problem of open-vocabulary semantic segmentation and proposes an enhanced network called the Robust Open-Vocabulary Segmentation Model (RobustOVS). By introducing a semantic integration module, RobustOVS effectively incorporates the global semantic awareness of the original CLIP model into two distinct semantic segmentation modules, thereby improving the performance of open-vocabulary semantic segmentation. Additionally, we present a novel two-stage architecture that leverages both low-resolution and high-resolution image processing techniques, reducing computational costs without compromising accuracy. Experimental results demonstrate that RobustOVS achieves state-of-the-art performance on popular benchmarks such as ADE20K-847 and Pascal Context-459, validating its effectiveness in expanding the semantic space and enhancing model robustness. Overall, RobustOVS not only achieves significant advancements in handling occluded images but also demonstrates that open-vocabulary general models can achieve performance comparable to specialized supervised models, highlighting its broad application prospects and potential. Declarations Declaration of Generative AI and AI-assisted Technologies in the Writing Process During the preparation of this work, the author(s) used ChatGPT-4.0 to improve the language and readability of the manuscript. After using this tool, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication. Author Contribution R.W. conceived the study, designed the methodology, and conducted the experiments. G.W. contributed to the theoretical framework and data analysis. M.L. assisted with implementation and validation. All authors participated in writing and critically reviewing the manuscript. Acknowledgements This work was supported by the National Natural Science Foundation of China(62172247) and the Qingdao Natural Science Foundation(No. 23- 2-1-163-zyyd-jch) and the Textile Plus Joint Research Program of Qingdao University (No. FZ2024101). References Chen, L.-C., Papandreou, G., Murphy, I.K., Alan, L., Yuille: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 1, 2, 4, 6 (2017) Ding, J., Xue, N., Xia, G.-S., Dai, D.: Decoupling zero-shot semantic segmentation. CVPR. 1 (3), 5, 6 (2022) Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I.: John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. IJCV. 111 (2), 98–136 (2015) Roozbeh Mottaghi, X., Chen, X., Liu, N.-G., Cho, S.-W., Lee, S., Fidler, R., Urtasun, Alan, L.: Yuille. The role of context for object detection and semantic segmentation in the wild. CVPR. 5 , 6 (2014) Xu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Han Hu, and, Bai, X.: A simple baseline for zeroshot semantic segmentation with pre-trained vision-language model. arXiv preprint arXiv:2112 14757. 2 (1), 7 (2021) Wang, L., Lu, H., Wang, Y., Feng, M., Wang, D.: Baocai Yin, and Xiang Ruan. Learning to detect salient objects with image-level supervision. In CVPR, pages 136–145, 3 (2017) Ding, J., Xue, N., Xia, G.-S., Dai, D.: Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11583–11592, 1, 3, 4, 6, 7, 11, 12 (2022) Golnaz Ghiasi, X., Gu, Y., Cui, Tsung-Yi, Lin: Open-vocabulary image segmentation. arXiv preprint arXiv:2112.12143, 2021. 1, 2, 3, 4, 6, 7 Boyi Li, K.Q., Weinberger, S., Belongie, V., Koltun, Ranftl, R.: Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546, 2022. 1, 3, 6, 7 Xu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Han Hu, and, Bai, X.: A simple baseline for zeroshot semantic segmentation with pre-trained vision-language model. arXiv preprint arXiv:2112 14757. 3 (1), 7 (2021) Katherine Crowson, S., Biderman, D., Kornis, D., Stander, E., Hallahan, L., Castricato, Raff, E.: Vqgan-clip: Open domain image generation and editing with natural language guidance. arXiv preprint arXiv:2204 08583, 3 (2022) Gu, X., Lin, T.-Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021. 3, 11 Yiwu Zhong, J., Yang, P., Zhang, C., Li, N., Codella, L.H., Li, L., Zhou, X., Dai, L., Yuan, Y., Li, et al.: Regionclip: Regionbased language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803, 3, 4, 5 (2022) Ding, Z., Wang, J., Tu, Z.: Openvocabulary panoptic segmentation with maskclip. arXiv preprint arXiv:2208.08984, 3 (2022) Kim, K., Oh, Y., and Jong Chul Ye:. Zegot: Zeroshot segmentation through optimal transport of text prompts. arXiv preprint arXiv:2301.12171, 3 (2023) Huaishao Luo, J., Bao, Y., Wu, X., He, Li, T.: Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. arXiv preprint arXiv:2211.14813, 3 (2022) Xu, M., Zhang, Z., Wei, F., Han Hu, and, Bai, X.: Side adapter network for open-vocabulary semantic segmentation. arXiv preprint arXiv:2302 12242, 3 (2023) Maxime Bucher, T.-H., Vu, M., Cord, Perez, P.: Zero-shot semantic segmentation. Adv. Neural. Inf. Process. Syst. 32 (3), 6, 7 (2019) Yongqin Xian, S., Choudhury, Y., He, B., Schiele, and Zeynep Akata:. Semantic projection network for zero-and few-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer VisionPattern Recognition, pages 8256–8265, 3, 6, 7 (2019) Xu, J., Mello, S.D., Liu, S., Byeon, W., Breuel, T., Kautz, J., Wang, X.: Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18134–18144, 3 (2022) Brian, Lester: Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104 08691, 3 (2021) Xiang Lisa Li and Percy Liang: Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101 00190, 3 (2021) Pengfei Liu, W., Yuan, J., Fu, Z., Jiang, H., Hayashi, Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107 13586, 3 (2021) Kaiyang Zhou, J., Yang, C.C., Loy, Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vision. 130 (9), 2337–2348 (2022) Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P.: and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 5 (2014) Holger Caesar, J., Uijlings, Ferrari, V.: Cocostuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1209–1218, 2, 4, 5 (2018) Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollar, P.: and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 2, 5, 7 Bolei Zhou, H., Zhao, X., Puig, T., Xiao, S., Fidler, A., Barriuso, Torralba, A.: Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vision. 127 (3), 302–321 (2019) Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Andrew Zisserman: The pascal visual object classes (voc) challenge. Int. J. Comput. Vision. 88 (2), 303–338 (2010) Roozbeh Mottaghi, X., Chen, X., Liu, N.-G., Cho, S.-W., Lee, S., Fidler: Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 891–898, 2, 5 (2014) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101: 6 (2017) Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K.: and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 1 (2018) Ding, H., Jiang, X., Shuai, B., Liu, A.Q., Wang, G.: Context contrasted feature and gated multiscale aggregation for scene segmentation. In CVPR, 1 (2018) Fang, Y., Zhu, F., Cheng, B., Liu, L., Wei, Y., Zhao, Y.: Locating noise is halfway denoising for semi-supervised segmentatio. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1 (2023) Guo, M.-H., Lu, C., Hou, Q., Liu, Z.-N., Cheng, M.-M., Shi-Min, H.: Segnext: Rethinking convolutional attention design for semantic segmentation. arXiv preprint arXiv:2209.08575, 1 (2022) Jonathan, L.: Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 1 (2015) Mengxue Qu, Y., Wu, Y., Wei, W., Liu, X., Liang, Zhao, Y.: Learning to segment every referring object point by point. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1 (2023) Olaf Ronneberger, P., Fischer, Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 1 (2015) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In NIPS, 1 (2021) Ding, J., Xue, N., Xia, G.-S., Dai, D.: Decoupling zero-shot semantic segmentation. CVPR. 1 (3), 5, 6 (2022) Golnaz Ghiasi, X., Gu, Y., Cui, Tsung-Yi, Lin: Open-vocabulary image segmentation. arXiv preprint arXiv: 2112.12143, 2021. 1, 2, 6 Kunyang Han, Y., Liu, J.H., Liew, H., Ding, J., Liu, Y., Wang, Y., Tang, Y., Yang, J., Feng, Y., Zhao, et al.: Global knowledge calibration for fast open-vocabulary segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 797–807, 1, 2 (2023) He, S., Ding, H., Jiang, W.: Primitive generation and semantic-related alignment for universal zero-shot segmentation. In CVPR, 1 (2023) Li, B., Weinberger, K.Q., Belongie, S.J., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. ICLR. 3 (1), 6 (2022) Liu, Y., Zhang, C., Wang, Y., Wang, J., Yang, Y., and Yansong Tang:. Universal segmentation at arbitrary granularity with language instruction. arXiv preprint arXiv:2312 01623, 1 (2023) Yongqin Xian, S., Choudhury, Y., He, B., Schiele, Akata, Z.: Semantic projection network for zero- and few-label semantic segmentation. CVPR. 1 , 2 (2019) Hui Zhang and Henghui Ding: Prototypical matching and open set rejection for zero-shot semantic segmentation. In ICCV, 1 (2021) Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, pages 2955–2966, 1, 2, 6 (2023) Xu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Han Hu, and, Bai, X.: A simple baseline for zeroshot semantic segmentation with pre-trained vision-language model. arXiv preprint arXiv:2112 14757. 2 (1), 7 (2021) Bolei Zhou, H., Zhao, X., Puig, S., Fidler: Adela Barriuso, and Antonio Torralba. Scene parsing through ADE20K dataset. CVPR. 2 (6), 8 (2017) Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I.: John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. IJCV. 111 (2), 98–136 (2015) Chen, L., Yang, Q., Ding, K., et al.: Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation[J]. (2025). arXiv preprint arXiv:2501.17642 Pang, L., Yao, J., Li, K., et al.: SPECIAL: Zero-shot Hyperspectral Image Classification With CLIP[J]. (2025). arXiv preprint arXiv:2501.16222 Sun, H., Gong, R., Nejjar, I., et al.: DynAlign: Unsupervised Dynamic Taxonomy Alignment for Cross-Domain Segmentation[J]. (2025). arXiv preprint arXiv:2501.16410 Zhang, D., Feng, T., Xue, L., et al.: Parameter-Efficient Fine-Tuning for Foundation Models[J]. (2025). arXiv preprint arXiv:2501.13787 Li, K., Cao, X., Deng, Y., et al.: DynamicEarth: How Far are We from Open-Vocabulary Change Detection?[J]. arXiv preprint arXiv:2501.12931, 2025. Zermatten, V., Castillo-Navarro, J., Marcos, D., et al.: Learning transferable land cover semantics for open vocabulary interactions with remote sensing images[J]. ISPRS J. Photogrammetry Remote Sens. 220 , 621–636 (2025) Choi, J., Lee, S., Lee, M., et al.: Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation[J]. (2025). arXiv preprint arXiv:2501.09688 Bai, M., Yu, X., Wang, Y., et al.: Enhancing pixel-level analysis in medical imaging through visual instruction tuning: introducing PLAMi[J]. Visual Comput., : 1–17. (2024) Zhou, E., Su, Q., Chi, C., et al.: Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection[J]. (2024). arXiv preprint arXiv:2412.04455 Huang, C., Yan, S., Burgard, W.: BYE: Build Your Encoder with One Sequence of Exploration Data for Long-Term Dynamic Scene Understanding[J]. (2024). arXiv preprint arXiv:2412.02449 Dao, S.D., Shi, H., Phung, D.Q., et al.: CA-Ovs: Cluster and Adapt Mask Proposals for Open-Vocabulary Semantic Segmentation[C]//Proceedings of the 6th ACM International Conference on Multimedia in Asia. : 1–8. (2024) Maxime Bucher, T.-H., Vu: Matthieu Cord, and Patrick Perez. Zero-shot semantic segmentation. In NeurIPS, 2 (2019) Alec Radford, J.W., Kim, C., Hallacy, A., Ramesh, G., Goh, S., Agarwal, G., Sastry, A., Askell: Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. ICML. 2 (1), 6 (2021) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In CVPR, pages 770–778, 6 (2016) Mingxing Tan and Quoc Le: Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105–6114, 6 (2019) Bowen Cheng, A.G., Schwing, Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2 (2021) Zhang, H., Li, F., Xu, H., Huang, S., Liu, S., Lionel, M., Ni, Zhang, L.: Mp-former: Mask-piloted transformer for image segmentation. arXiv preprint (2023). arXiv:2303.07336 Cavagnero, N., Rosi, G., Cuttano, C., et al.: Pem: Prototype-based efficient maskformer for image segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. : 15804–15813. (2024) Liu, Z., Lin, Y., Cao, Y., et al.: Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF international conference on computer vision. : 10012–10022. (2021) Xu, J., Mello, S.D., Liu, S., Byeon, W., Breuel, T.M., Kautz, J., Wang, X.: Groupvit: Semantic segmentation emerges from text supervision. In CVPR, 6 (2022) Feng Liang, B., Wu, X., Dai, K., Li, Y., Zhao, H., Zhang, P., Zhang, P., Vajda, Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted CLIP. arXiv preprint arXiv:2210.04150, 2022. 2, 5, 6, 7 Siyu, J., Wei, Y., Wang, Y., Zhao, Y., Humphrey, Shi: Learning mask-aware clip representations for zero-shot segmentation. arXiv preprint arXiv:2310 00240. 6 (5), 7 (2023) Xu, M., Zhang, Z., Wei, F., Han Hu, and, Bai, X.: Side adapter network for open-vocabulary semantic segmentation. In CVPR, pages 2945–2954, 2, 5, 6, 7 (2023) Zheng Ding, J., Wang, Tu, Z.: Open vocabulary panoptic segmentation with maskclip. arXiv preprint arXiv:2208.08984, 6 (2022) Xu, Y.-H., Wang, Z.-H., Wang, Z.-R., Fan, R., Wang, X.A.: Recommendation Algorithm Based on a Self-supervised Learning Pretrain Transformer Xu, Y.H., Wang, Z.H., Wang, Z.R., Guo, Y.L., Fan, R., Tian, H.Y., Wang: Xing SimDCL: dropout-based simple graph contrastive learning for recommendation Chen, H., Zhang, F., Li, Q., Li, X., Ding, Y., Zhang, D., Cheng, J., Wang: Xing Triple confidence-aware encoder-decoder model for commonsense knowledge graph completion Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X.: Shalini De Mello. Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models.arXiv: 2303. 04803, 3 (2023) Additional Declarations No competing interests reported. Supplementary Files floatimage1.png Cite Share Download PDF Status: Published Journal Publication published 10 Mar, 2026 Read the published version in Multimedia Systems → Version 1 posted Editorial decision: Revision requested 25 Dec, 2025 Reviews received at journal 16 Dec, 2025 Reviewers agreed at journal 01 Dec, 2025 Reviews received at journal 01 Dec, 2025 Reviewers agreed at journal 24 Nov, 2025 Reviews received at journal 11 Aug, 2025 Reviewers agreed at journal 02 Aug, 2025 Reviewers invited by journal 02 Aug, 2025 Editor assigned by journal 25 Jun, 2025 Submission checks completed at journal 11 Jun, 2025 First submitted to journal 08 Jun, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6850046","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":494799897,"identity":"1a077762-bc72-4d25-9ebb-cd90ed5a4e43","order_by":0,"name":"Ruihan Wang","email":"","orcid":"","institution":"Qingdao University","correspondingAuthor":false,"prefix":"","firstName":"Ruihan","middleName":"","lastName":"Wang","suffix":""},{"id":494799898,"identity":"715a3d71-b1ca-4242-afc8-fbff739ebebf","order_by":1,"name":"Guodong Wang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAz0lEQVRIiWNgGAWjYDCCAzDG8QYgYWBBlBbGBjDjDEizgQQpWm4kgEgitPAdb37+4OOew3Z9N59f3fCjQIKBv707Aa8WyTPHDBtnPDucPPN2TtnNHqDDJM6c3YBXi8GNHMZmngO3kw1u56Td4AFqMZDIJVbLzTNpN/+QosXO4Ab7sdtE2QLyy8wZB/4nSJ7JYbstYyDBQ9AvwBB78OHDgTR7vuPHn91888dGjr+9F78WGEhsYOAxADF4iFIOAvYMDOwPiFY9CkbBKBgFIwsAAEXbU/xEhLPIAAAAAElFTkSuQmCC","orcid":"","institution":"Qingdao University","correspondingAuthor":true,"prefix":"","firstName":"Guodong","middleName":"","lastName":"Wang","suffix":""},{"id":494799899,"identity":"15d28b18-90c0-4086-ad4c-585e87642635","order_by":2,"name":"Mingtao Liu","email":"","orcid":"","institution":"Linyi University","correspondingAuthor":false,"prefix":"","firstName":"Mingtao","middleName":"","lastName":"Liu","suffix":""}],"badges":[],"createdAt":"2025-06-09 02:53:12","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6850046/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6850046/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1007/s00530-026-02267-0","type":"published","date":"2026-03-10T16:00:10+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":88757513,"identity":"5b2621c4-ed57-4a25-9565-7fa2e65358a8","added_by":"auto","created_at":"2025-08-11 07:27:05","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":132850,"visible":true,"origin":"","legend":"\u003cp\u003e\u0026nbsp;Workflow of RobustOVS. First, two segmentation models are used simultaneously to generate class-agnostic masks and corresponding proposal embeddings from images of varying clarity to facilitate cross-modal alignment. To prevent collapse into known categories, the proposal embeddings are calibrated by integrating CLIP's global semantic priors within a semantic integration module. Additionally, cropped and masked images are fed into the context-shifted CLIP for domain-adaptive classification. Finally, the matched feature maps are fused through a fusion module to produce the final results.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-6850046/v1/01d43fe7ee06ab131cf97c9e.png"},{"id":104739956,"identity":"068e0bc3-eae4-46e8-a337-92f20bbd3d0d","added_by":"auto","created_at":"2026-03-16 16:14:00","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":918688,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6850046/v1/d65fe86e-10e0-4e04-88ab-29632a268c8f.pdf"},{"id":88755328,"identity":"0d727d8a-100c-421d-b6e5-1534e2724f76","added_by":"auto","created_at":"2025-08-11 07:11:05","extension":"png","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":544881,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-6850046/v1/1cd9b06e7a6c1b83f8d770d9.png"}],"financialInterests":"No competing interests reported.","formattedTitle":"RobustOVS: Open-Vocabulary Segmentation with Robustly Semantic-Assisted Calibration","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eSemantic segmentation is a fundamental task in the field of computer vision, aiming to assign each pixel in an input image to a specific semantic category. This task requires models to not only identify the categories of objects in an image but also segment the boundaries of these objects with precision. Despite significant advancements in this area in recent years [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e], [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e], [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e], [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e], [\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e], [\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e], [\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e], [\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e], [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e], [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e], [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e], [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e], [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e], traditional semantic segmentation models are generally trained on predefined categories. When encountering new, unseen categories during inference, these models often struggle to adapt.\u003c/p\u003e\u003cp\u003eTo address this challenge, researchers have begun exploring open-vocabulary semantic segmentation (OVS) [\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e], [\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e], [\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e], [\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e], [\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e], [\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e], [\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e], [\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e]. Unlike conventional models, modern semantic segmentation systems often handle thousands of categories, leveraging textual input as guidance to segment arbitrary objects.\u003c/p\u003e\u003cp\u003eThe vision-language model CLIP [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e] learns rich multimodal features from billions of image-text pairs. CLIP\u0026rsquo;s key innovation lies in its robust zero-shot learning capability. While traditional computer vision models require task-specific supervised training, CLIP processes diverse downstream tasks without targeted training by learning from large-scale image-text pairs. For instance, CLIP can classify images directly based on natural language descriptions without the need for annotated data for each category. However, recognizing unseen categories accurately without external knowledge remains challenging. Early research proposed leveraging pre-trained vision-language models for OVS [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e], [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eApplying CLIP to OVS poses challenges, as CLIP is trained via contrastive learning at the image level. Consequently, pre-trained CLIP struggles to achieve satisfactory classification results on masked images due to its lack of pixel-level recognition capabilities required for semantic segmentation. Two-stage approaches have shown promise [\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e], [\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e], [\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e], [\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e]: first generating category-agnostic mask proposals, followed by open-vocabulary classification using pre-trained CLIP. While this approach has achieved progress, aligning visual content with unrestricted textual input remains suboptimal and requires substantial computational resources during training.\u003c/p\u003e\u003cp\u003eOur analysis identifies two key limitations: (1) The regions recognized by CLIP for mask classification often do not overlap with the actual mask regions, indicating domain discrepancies in CLIP\u0026rsquo;s pre-trained visual inputs. (2) The proposal embeddings in segmentation models are tuned for the training semantic space, making the model insensitive to new vocabulary.\u003c/p\u003e\u003cp\u003eThese domain biases in pre-trained CLIP not only hinder the alignment of segmentation results with textual descriptions but also waste computational resources, making it difficult for standard research labs to meet the computational demands of training such models.\u003c/p\u003e\u003cp\u003eTo address these challenges, we propose Robust Open-Vocabulary Segmentation Model (RobustOVS), a network architecture that mitigates domain biases in pre-trained CLIP models and aligns model performance with unrestricted textual semantics beyond predefined vocabularies.\u003c/p\u003e\u003cp\u003eIn summary our contributions include: (1) We propose RobustOVS, a model designed to reduce performance constraints when aligning visual content with unlimited textual semantics, thus expanding the semantic space. (2) We introduce a semantic integration module, which embeds global semantic awareness from the original CLIP into the proposal embeddings of two distinct semantic segmentation modules, enhancing OVS performance. (3) We develop a novel two-stage OVS framework that processes low-resolution and high-resolution images simultaneously using multiple advanced semantic segmentation models. This reduces computational overhead significantly without sacrificing performance. (4) RobustOVS achieves new state-of-the-art results across popular OVS benchmarks, including ADE20K-847 [\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e] and Pascal Context-459 [\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eBy integrating the global semantics of pre-trained CLIP into segmentation proposals and employing an efficient cross-attention mechanism combined with prototype selection strategies, RobustOVS effectively reduces computational demands while alleviating domain biases. This innovation makes OVS more accessible for standard research environments while achieving superior performance.\u003c/p\u003e"},{"header":"2. Related Works","content":"\u003cp\u003eThe vision-language pretraining model [\u003cspan citationid=\"CR63\" class=\"CitationRef\"\u003e63\u003c/span\u003e], [\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e], [\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e] has emerged as a key research focus in the interdisciplinary domain of computer vision and natural language processing. Pretrained vision-language models such as CLIP [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e] leverage contrastive learning to associate images with textual descriptions, demonstrating exceptional cross-modal alignment capabilities. Beyond excelling in image classification tasks, CLIP has been widely applied in diverse domains, including image generation [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e], object detection [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e], [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e], and image segmentation [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e], [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e], [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e], [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e], [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e], [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. RegionCLIPovseg [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e] further extends CLIP\u0026rsquo;s application to object detection by fine-tuning on region proposals, significantly enhancing detection performance. The OVSeg approach introduces a mask prompt tuning strategy that allows CLIP's weights to be shared across multi-task environments without full fine-tuning.\u003c/p\u003e\u003cp\u003eOpen-vocabulary segmentation (OVS) aims to understand images for arbitrary categories described by text. Early methods like ZS3Net [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e] and SPNet [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e] employed word embeddings to align visual and semantic features, while GroupViT [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e] used text supervision to group image segmentation masks. With the advent of CLIP, methods such as LSeg [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e] and OpenSeg [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e] have harnessed CLIP\u0026rsquo;s text encoder to further improve segmentation tasks. These approaches align textual embeddings with pixel-level or segment-level visual features, enhancing segmentation accuracy.\u003c/p\u003e\u003cp\u003eRecent two-stage open-vocabulary segmentation methods, such as ZSSeg [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e] and ZegFormer [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e], generate class-agnostic mask proposals and leverage CLIP\u0026rsquo;s pretrained model for open-vocabulary classification, achieving notable progress. However, their performance on occluded images remains limited. Our proposed mask prompt tuning strategy utilizes blank regions in occluded images to improve segmentation performance without modifying CLIP\u0026rsquo;s weights.\u003c/p\u003e\u003cp\u003ePrompt tuning, an emerging adaptation technique for large-scale pretrained models, was initially applied in natural language processing [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e], [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e], [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e] and has since been extended to the vision domain. In vision tasks, CoOp [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e] adapts CLIP by adding learnable vectors before class tokens, improving performance in visual recognition tasks. Our mask prompt tuning strategy extends this concept by focusing on occluded images and substituting occlusion markers, leading to enhanced segmentation accuracy. Despite advancements in many areas, existing methods still face challenges in addressing domain shifts and overfitting. Our proposed RobustOVS framework leverages CLIP\u0026rsquo;s global semantic priors to calibrate intra- and inter-class spatial relationships, significantly advancing the state of open-vocabulary segmentation.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e"},{"header":"3. RobustOVS","content":"\u003cp\u003e\u003cb\u003eMethod\u003c/b\u003e\u003c/p\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e2\u003c/span\u003e. illustrates the workflow of RobustOVS. This framework introduces innovations to the two-stage paradigm. First, we employ various segmentation models to simultaneously generate a set of class-agnostic mask proposals and corresponding proposal embeddings for images at different resolutions. These proposal embeddings are aligned with linguistic features for model classification. In RobustOVS, we propose an advanced Semantic Integration Module (SIM) that transfers global semantic priors from CLIP [\u003cspan citationid=\"CR64\" class=\"CitationRef\"\u003e64\u003c/span\u003e] to the proposal embeddings' FN layer, calibrating the model's feature space for both in-vocabulary and out-of-vocabulary semantics. Processed sub-images are subsequently sent to CLIP for mask-level classification, and the classification results from CLIP and the proposal embeddings are combined for output.\u003c/p\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e3.1. Semantic Integration Module (SIM)\u003c/h2\u003e\u003cp\u003eIn model classification, learnable proposal embeddings often face semantic overfitting to training data, limiting their adaptability to novel categories. To address this challenge, we introduce the Semantic Integration Module (SIM). At its core, SIM leverages CLIP's prior knowledge to refine the semantic responses of mask proposal embeddings. SIM extracts implicit semantics from input images using a frozen CLIP model and generates hierarchical features that integrate spatial tokens and a general CLS token, enhancing the proposal embeddings' effectiveness.\u003c/p\u003e\u003cp\u003eTo optimize feature integration for high-level semantic alignment, we design a low-frequency enhancement structure to reduce potential texture noise. This involves applying Fourier Transform to the features, followed by Gaussian filtering for low-frequency enhancement. The processed features are concatenated and injected into the proposal embeddings, further aligned through a multi-head cross-attention mechanism. Finally, CLIP's visual embeddings are introduced to bridge the gap between visual and linguistic spaces, producing fully aligned proposal embeddings.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\u003ch2\u003e3.2. Efficient Network Structure\u003c/h2\u003e\u003cp\u003eCompleting open-vocabulary semantic segmentation requires dividing an image into regions with similar features, as seen in semantic segmentation tasks where pixels sharing similar semantic attributes are grouped into the same category. Beyond basic semantic segmentation, open-vocabulary semantic segmentation also involves distinguishing instances within the same category, akin to the demands of panoptic segmentation. This module aims to develop a model that segments an image into distinct regions with unique masks and assigns class probabilities to each region.\u003c/p\u003e\u003cp\u003eTo achieve this, we adopt the MaskFormer framework, whose core component is a transformer decoder. The decoder uses N learnable queries and high-resolution image features as input to refine these queries, which are subsequently used to generate predictions. Through multiple transformer blocks, the decoder attends to feature representations and models relationships among different objects. Each block calculates cross-attention between image features and object queries, relying on computationally intensive dot products. While MaskFormer demonstrates significant performance, its efficiency suffers when handling large input features in segmentation tasks.\u003c/p\u003e\u003cp\u003eTo address this issue, we propose a more efficient network structure based on the Prototype-enhanced Mask Cross-Attention (PEM-CA) module. PEM-CA exploits the intrinsic redundancy of image features in segmentation tasks, significantly reducing the number of input tokens in the attention layers via a prototype selection mechanism. Inspired by recent advances in efficient attention modules, PEM-CA redesigns the cross-attention operation by modeling interactions using computationally lightweight element-wise operations. To further enhance efficiency, we employ a fully convolutional Feature Pyramid Network (FPN) and introduce a Context Self-Adjustment Module (CSM) and deformable convolutions to restore contextual information and dynamicity, improving performance while controlling computational overhead.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003e3.3. Two-Stage Open-Vocabulary Semantic Segmentation Model\u003c/h2\u003e\u003cp\u003eTo enhance segmentation efficiency, we introduce a two-stage open-vocabulary semantic segmentation model. During training, a clear image is artificially degraded to generate a low-quality version. The clear image is input into a network structure based on MpFormer to obtain mask features and token features, which contribute to the final segmentation results. Additionally, RobustOVS extracts low-level features from the image encoder and incorporates them with the mask features and token features for consistency loss calculation.\u003c/p\u003e\u003cp\u003eThe degraded image is fed into the RobustSAM model, specifically its PEM network structure, to extract corresponding features. Since the input is of low quality, the output features are expected to include degradation information that hinders segmentation. To mitigate this degradation, we design an efficient fusion module to reduce the consistency loss between the features output by MpFormer and those from the PEM network structure. This approach improves feature alignment and segmentation performance.\u003c/p\u003e\u003c/div\u003e"},{"header":"4. Experiments","content":"\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\u003ch2\u003e4.1 Experimental setup\u003c/h2\u003e\u003cp\u003e\u003cb\u003eTraining Dataset\u003c/b\u003e We trained the RobustOVS model on the COCOovseg dataset [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]. Initially, the panoptic segmentation module of the RobustOVS model was trained using segmentation labels from the COCO-Stuffovseg dataset [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. Subsequently, we fine-tuned the CLIP model using the mask-category dataset derived from COCO Captions [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e]. This dataset consists of 118k training images annotated with 171 valid categories, covering a wide range of content from objects (e.g., orange, car) to materials (e.g., sky, road). Unless specified otherwise, all 171 categories were utilized during training.\u003c/p\u003e\u003cp\u003e\u003cb\u003eTest Dataset\u003c/b\u003e Module To evaluate the effectiveness of our method, we conducted experiments on several popular image benchmarks, including ADE20K150 [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e], ADE20K847 [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e], Pascal VOCovseg [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e], Pascal Context-59 [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e], and Pascal Context-459 [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e]. ADE20K: A pixel-level densely annotated dataset for scene understanding, comprising 20k training images, 2k validation images, and 3k test images with diverse annotations of indoor and outdoor scenes. We evaluated two category versions: 150 common categories (A-150) and 847 more diverse categories (A-847). Pascal VOC: A classic segmentation dataset with 11,185 training images and 1,449 validation images. We evaluated on the 1.5k validation images annotated with 20 categories (PAS-20). Pascal Context: An extended version of Pascal VOC 2010 that provides annotations for the entire scene, containing 4,998 training images and 5,005 validation images. We evaluated on the commonly used PC-59 and the more challenging PC-459 versions.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\u003ch2\u003e4.2 Implementation Details\u003c/h2\u003e\u003cp\u003eThe RobustOVS model consists of two primary components: a segmentation model and a CLIP model adapted for masks. The segmentation model is based on derivatives of MaskFormer, specifically Mpformer [\u003cspan citationid=\"CR68\" class=\"CitationRef\"\u003e68\u003c/span\u003e] and Pem [\u003cspan citationid=\"CR69\" class=\"CitationRef\"\u003e69\u003c/span\u003e], while the CLIP model is implemented using OpenCLIP. Final category predictions are made through an ensemble approach that integrates outputs from the segmentation model and CLIP.\u003c/p\u003e\u003cp\u003eFor the segmentation model: The Mpformer component uses Swin Transformer-Base [\u003cspan citationid=\"CR70\" class=\"CitationRef\"\u003e70\u003c/span\u003e] as the backbone, initialized with weights pre-trained on ImageNet-21K. The model was trained with the AdamW [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e] optimizer, employing a polynomial learning rate schedule with an initial learning rate of 6 \u0026times; 10⁻⁵ and a weight decay of 0.01. Input images were resized to 640 pixels on the shorter side and cropped to 640 \u0026times; 640. The batch size was 32, and training ran for a total of 120k iterations.\u003c/p\u003e\u003cp\u003eData augmentation included random flipping and color jittering, while other hyperparameters followed Mask2Former and MaskFormer settings. The loss function combined Dice loss and cross-entropy loss for segmentation tasks (weights of 5 and 2, respectively) and cross-entropy loss for clasr with an initial learning rate of 2 \u0026times; 10⁻\u0026sup2;, no weight decay, and cosine annealing. Input size was 224 \u0026times; sification tasks. For the CLIP model: The architecture used ViT-L/14, implemented with OpenCLIP. We explored three adaptation strategies: Mask Prompt Tuning (MPT), Full Model Fine-Tuning (FT), and a combination of both (MPT\u0026thinsp;+\u0026thinsp;FT). MPT initialized learnable tokens randomly and applied deep prompting, with a default prompt depth of 3 unless specified.\u003c/p\u003e\u003cp\u003eTraining used the AdamW optimize224, batch size 256, and training spanned 5 epochs. For FT, the training process was similar, but the initial learning rate was reduced to 5 \u0026times; 10⁻⁶, and the weight decay was increased to 0.2. For the MPT\u0026thinsp;+\u0026thinsp;FT method, the model was initialized with a fully fine-tuned CLIP and further refined using mask prompt tuning to improve stability and performance. The text encoder of CLIP remained frozen in all experiments. Finally, segmentation and classification predictions were combined to enhance overall performance, leveraging both pixel-level and semantic-level understanding.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eComparison with State-of-the-Art Methods. ADE, PC, and VOC denote the ADE20K [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e], Pascal Context [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e], and Pascal VOC [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e] datasets, respectively.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"8\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMethod\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eVL-Model\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eTraining Dataset\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eADE-150\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eADE-847\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003ePC-59\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e\u003cp\u003ePC-459\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c8\"\u003e\u003cp\u003eVOC\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGroup-VIT[\u003cspan citationid=\"CR71\" class=\"CitationRef\"\u003e71\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003erand. init.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCC12M\u0026thinsp;+\u0026thinsp;YFCC\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e-\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e-\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e22.4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e-\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e52.3\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLSeg+[\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eALIGN RN101\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCOCO\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e13\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e2.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e36\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e5.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e59\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eOpenSeg[\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eALIGN RN101\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCOCO\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e15.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e36.9\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e6.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e60\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLSeg+[\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eALIGN EN-B7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCOCO\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e18\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e3.8\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e46.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e7.8\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e-\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eOpenSeg[\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eALIGN EN-B7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCOCO\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e21.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e6.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e42.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e9\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e-\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eOpenSeg[\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eALIGN EN-B7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCOCO\u0026thinsp;+\u0026thinsp;Loc. Narr.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e28.6\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e8.8\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e48.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e12.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e72.2\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSimSeg[\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eCLIP ViT-B/16\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCOCO\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e20.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e47.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e8.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e88.4\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSimSeg[\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eCLIP ViT-B/16\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCOCO\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e21.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e6.9\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e51.9\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e9.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e91.8\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eOVSeg[\u003cspan citationid=\"CR72\" class=\"CitationRef\"\u003e72\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eCLIP ViT-B/16\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCOCO\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e24.8\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e7.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e53.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e11\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e92.6\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMAFT[\u003cspan citationid=\"CR73\" class=\"CitationRef\"\u003e73\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eCLIP ViT-B/16\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCOCO\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e29.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e10.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e53.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e12.8\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e90\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSAN[\u003cspan citationid=\"CR74\" class=\"CitationRef\"\u003e74\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eCLIP ViT-B/16\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCOCO\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e27.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e10.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e53.8\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e12.6\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e94\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMaskCLIP[\u003cspan citationid=\"CR75\" class=\"CitationRef\"\u003e75\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eCLIP ViT-L/14\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCOCO\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e23.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e8.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e45.9\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e10\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e-\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSimSeg\u0026dagger;[\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eCLIP ViT-L/14\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCOCO\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e21.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e7.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e52.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e10.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e92.3\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eOVSeg[\u003cspan citationid=\"CR72\" class=\"CitationRef\"\u003e72\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eCLIP ViT-L/14\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCOCO\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e29.6\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e9\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e55.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e12.4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e94.5\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eODISE[\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eCLIP ViT-L/14\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCOCO\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e29.9\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e11.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e57.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e14.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e-\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSAN[\u003cspan citationid=\"CR74\" class=\"CitationRef\"\u003e74\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eCLIP ViT-L/14\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCOCO\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e32.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e12.4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e57.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e15.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e94.6\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRobustOVS(Ours)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eCLIP ViT-L/14\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCOCO\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e\u003cb\u003e33.59\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e\u003cb\u003e14.17\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e58.63\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e\u003cb\u003e16.85\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e\u003cb\u003e96.32\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\u003ch2\u003e4.3 Main Results\u003c/h2\u003e\u003cp\u003eIn comparison with current state-of-the-art methods, the RobustOVS model demonstrates superior performance across multiple datasets. Notably, unlike other models, RobustOVS is entirely trained using a single GPU, eliminating the reliance on large-scale computational resources. On the ADE-150 dataset, RobustOVS achieves a mean Intersection over Union (mIoU) of 33.59, slightly outperforming the current state-of-the-art method's 33.5, with a marginal improvement of 0.27%. For the ADE-847 dataset, the model attains an mIoU of 14.17, representing a 1.21% increase over the leading method's 14.0. On the PC-59 dataset, RobustOVS achieves an mIoU of 58.63, slightly lower than the state-of-the-art method's 59.3, reflecting a decline of -1.13%. However, this performance gap highlights the model's exceptional computational efficiency and robustness under single-GPU training. On the PC-459 dataset, RobustOVS achieves an mIoU of 16.85, surpassing the state-of-the-art method's 16.7 with a 0.90% improvement. Meanwhile, on the VOC dataset, RobustOVS attains an mIoU of 96.32, slightly below the leading method's 97.2, with a decrease of -0.91%. Although RobustOVS shows slightly lower performance on some larger datasets (e.g., PC-59 and VOC), its lightweight training strategy and outstanding performance underscore its strong generalization ability and robustness, especially on smaller datasets where its advantages are more pronounced. These characteristics demonstrate that RobustOVS not only reduces training costs but also provides an efficient solution for research scenarios with limited resources.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eComparison of computational resources between ODISE [\u003cspan citationid=\"CR79\" class=\"CitationRef\"\u003e79\u003c/span\u003e] and RobustOVS.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"5\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMethod\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eGPU Type\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eTraining Time\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eGPU Memory Usage\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eParameters\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eODISE\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e8\u0026times;A100\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e120h\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e320GB\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e1.2B\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRobustOVS\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e1\u0026times;RTX3090\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e50h\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e24GB\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0.4B\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eTo further highlight the computational efficiency of RobustOVS, we present a direct comparison of its resource requirements against a representative state-of-the-art model, ODISE. As shown in the Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, ODISE requires 8 A100 GPUs, 120 hours of training time, and 320 GB of GPU memory to reach its performance, along with a model size of 1.2\u0026nbsp;billion parameters. In contrast, RobustOVS is trained entirely on a single RTX3090 GPU within only 50 hours, using just 24 GB of GPU memory and containing only 0.4\u0026nbsp;billion parameters. This stark difference underscores the significant reduction in computational cost and hardware demands achieved by RobustOVS, making it a more accessible and scalable solution for practical applications with limited resources.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003e4.4 Ablation studies\u003c/h2\u003e\u003cp\u003eTo evaluate the impact of different core modules on semantic segmentation performance, we designed three model frameworks and conducted comparative experiments on multiple datasets. First, we proposed a model based on the PEM (Efficient Semantic Segmentation Module), aiming to achieve a balanced performance with lower computational costs. Second, we developed a model based on MPFormer (High-Precision Semantic Segmentation Module), which enhances overall performance by improving segmentation accuracy but comes with relatively higher computational overhead. Finally, we introduced a hybrid framework that combines the strengths of PEM and MPFormer, aiming to balance accuracy and efficiency by integrating the characteristics of both modules.\u003c/p\u003e\u003cp\u003eThe experimental results reveal distinct trends. The PEM-based model demonstrated generally lower mIoU performance across all datasets compared to the state-of-the-art (SOTA), particularly on the ADE150 and VOC datasets, with mIoU scores of 32.76% and 94.02%, respectively, falling short of SOTA by 2.21% and 3.27%. These results indicate that while the PEM approach is computationally efficient, its accuracy limitations hinder optimal performance in more complex tasks. In contrast, the MPFormer-based model exhibited improvements in accuracy, especially on the VOC dataset, where it achieved an mIoU of 95.01%, approximately 2.25% higher than the PEM-based model. However, despite its ability to enhance segmentation accuracy, the increased computational cost limited performance gains on simpler datasets, such as ADE847 and PC59. Notably, on the ADE847 dataset, the MPFormer achieved an mIoU of 13.21%, still 5.64% below SOTA, highlighting challenges in certain scenarios.\u003c/p\u003e\u003cp\u003eThe hybrid framework outperformed the individual module-based models, particularly on the ADE150 and PC59 datasets, achieving mIoU scores of 33.59% and 58.63%, representing improvements of 0.27% and 0.90% over SOTA, respectively. These results demonstrate that by leveraging the strengths of both PEM and MPFormer, the hybrid approach can deliver more accurate segmentation results while maintaining reasonable efficiency. On the VOC dataset, the hybrid model achieved an mIoU of 96.32%, closely approaching the SOTA score of 97.2%, with a minor gap of 0.91%, suggesting a well-balanced trade-off between accuracy and efficiency. Overall, the hybrid model demonstrated superior performance across all datasets, particularly those requiring higher precision. This underscores the potential of integrating diverse modules to overcome the limitations of individual components, enhancing the overall performance of semantic segmentation models. Notably, in open-vocabulary semantic segmentation tasks, the hybrid approach offers improved robustness and adaptability.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eAblation Study on ADE-847 Dataset\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eConfiguration\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eADE-847 mIoU\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eTraining Time (hours)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBaseline\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e12.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e45\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e+SIM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e13.5 (+\u0026thinsp;1.4)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e48 (+\u0026thinsp;3)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e+SIM\u0026thinsp;+\u0026thinsp;PEM-CA\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e14.2 (+\u0026thinsp;0.7)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e50 (+\u0026thinsp;2)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eIn addition to accuracy, we also conducted an ablation study focusing on the effectiveness of our proposed modules SIM (Semantic Injection Module) and PEM-CA (Cross-Attention enhanced PEM) in the context of training efficiency and mIoU performance on the ADE-847 dataset. As shown in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, the baseline model achieves a mean Intersection over Union (mIoU) of 12.1% with a training time of 45 hours. Incorporating the SIM module improves the mIoU to 13.5%, representing a gain of 1.4%, with only a slight increase in training time to 48 hours. When both SIM and PEM-CA modules are combined, the mIoU further increases to 14.2%, demonstrating a total gain of 2.1% over the baseline, while the training time rises modestly to 50 hours. These results highlight the complementary nature of SIM and PEM-CA in enhancing segmentation accuracy with relatively minimal training overhead.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eComparison on Open-Vocabulary Segmentation Metrics\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"4\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMethod\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003ePem-ovs\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMp-ovs\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eRobust-ovs\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAde150\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e32.76\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e32.57\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e33.59\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAde847\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e13.11\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e13.21\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e14.17\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePc59\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e57.79\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e57.73\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e58.63\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eVoc\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e94.02\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e95.01\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e96.32\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eFurthermore, we evaluated model performance on multiple datasets using three open-vocabulary segmentation metrics\u0026mdash;PemOVS, MpOVS, and RobustOVS\u0026mdash;to assess robustness across varied domains. As shown in Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, the hybrid model consistently achieves higher scores across all metrics, particularly excelling on the Pascal VOC and PC59 datasets. On VOC, it achieves 96.32% RobustOVS, underscoring its strong generalization capability. These results reaffirm the effectiveness of combining lightweight and high-precision modules in enhancing both accuracy and robustness across diverse segmentation challenges.\u003c/p\u003e\u003c/div\u003e"},{"header":"5. Conclusion","content":"\u003cp\u003eThis paper investigates the problem of open-vocabulary semantic segmentation and proposes an enhanced network called the Robust Open-Vocabulary Segmentation Model (RobustOVS). By introducing a semantic integration module, RobustOVS effectively incorporates the global semantic awareness of the original CLIP model into two distinct semantic segmentation modules, thereby improving the performance of open-vocabulary semantic segmentation. Additionally, we present a novel two-stage architecture that leverages both low-resolution and high-resolution image processing techniques, reducing computational costs without compromising accuracy.\u003c/p\u003e\u003cp\u003eExperimental results demonstrate that RobustOVS achieves state-of-the-art performance on popular benchmarks such as ADE20K-847 and Pascal Context-459, validating its effectiveness in expanding the semantic space and enhancing model robustness. Overall, RobustOVS not only achieves significant advancements in handling occluded images but also demonstrates that open-vocabulary general models can achieve performance comparable to specialized supervised models, highlighting its broad application prospects and potential.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eDeclaration of Generative AI and AI-assisted Technologies in the Writing Process\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eDuring the preparation of this work, the author(s) used ChatGPT-4.0 to improve the language and readability of the manuscript. After using this tool, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eR.W. conceived the study, designed the methodology, and conducted the experiments. G.W. contributed to the theoretical framework and data analysis. M.L. assisted with implementation and validation. All authors participated in writing and critically reviewing the manuscript.\u003c/p\u003e\u003ch2\u003eAcknowledgements\u003c/h2\u003e\u003cp\u003eThis work was supported by the National Natural Science Foundation of China(62172247) and the Qingdao Natural Science Foundation(No. 23- 2-1-163-zyyd-jch) and the Textile Plus Joint Research Program of Qingdao University (No. FZ2024101).\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eChen, L.-C., Papandreou, G., Murphy, I.K., Alan, L., Yuille: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834\u0026ndash;848, 1, 2, 4, 6 (2017)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDing, J., Xue, N., Xia, G.-S., Dai, D.: Decoupling zero-shot semantic segmentation. CVPR. \u003cb\u003e1\u003c/b\u003e(3), 5, 6 (2022)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eEveringham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I.: John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. IJCV. \u003cb\u003e111\u003c/b\u003e(2), 98\u0026ndash;136 (2015)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRoozbeh Mottaghi, X., Chen, X., Liu, N.-G., Cho, S.-W., Lee, S., Fidler, R., Urtasun, Alan, L.: Yuille. The role of context for object detection and semantic segmentation in the wild. CVPR. \u003cb\u003e5\u003c/b\u003e, 6 (2014)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eXu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Han Hu, and, Bai, X.: A simple baseline for zeroshot semantic segmentation with pre-trained vision-language model. arXiv preprint arXiv:2112 14757. \u003cb\u003e2\u003c/b\u003e(1), 7 (2021)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWang, L., Lu, H., Wang, Y., Feng, M., Wang, D.: Baocai Yin, and Xiang Ruan. Learning to detect salient objects with image-level supervision. In CVPR, pages 136\u0026ndash;145, 3 (2017)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDing, J., Xue, N., Xia, G.-S., Dai, D.: Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11583\u0026ndash;11592, 1, 3, 4, 6, 7, 11, 12 (2022)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGolnaz Ghiasi, X., Gu, Y., Cui, Tsung-Yi, Lin: Open-vocabulary image segmentation. arXiv preprint arXiv:2112.12143, 2021. 1, 2, 3, 4, 6, 7\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBoyi Li, K.Q., Weinberger, S., Belongie, V., Koltun, Ranftl, R.: Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546, 2022. 1, 3, 6, 7\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eXu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Han Hu, and, Bai, X.: A simple baseline for zeroshot semantic segmentation with pre-trained vision-language model. arXiv preprint arXiv:2112 14757. \u003cb\u003e3\u003c/b\u003e(1), 7 (2021)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKatherine Crowson, S., Biderman, D., Kornis, D., Stander, E., Hallahan, L., Castricato, Raff, E.: Vqgan-clip: Open domain image generation and editing with natural language guidance. arXiv preprint arXiv:2204 08583, 3 (2022)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGu, X., Lin, T.-Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021. 3, 11\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYiwu Zhong, J., Yang, P., Zhang, C., Li, N., Codella, L.H., Li, L., Zhou, X., Dai, L., Yuan, Y., Li, et al.: Regionclip: Regionbased language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793\u0026ndash;16803, 3, 4, 5 (2022)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDing, Z., Wang, J., Tu, Z.: Openvocabulary panoptic segmentation with maskclip. arXiv preprint arXiv:2208.08984, 3 (2022)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKim, K., Oh, Y., and Jong Chul Ye:. Zegot: Zeroshot segmentation through optimal transport of text prompts. arXiv preprint arXiv:2301.12171, 3 (2023)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHuaishao Luo, J., Bao, Y., Wu, X., He, Li, T.: Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. arXiv preprint arXiv:2211.14813, 3 (2022)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eXu, M., Zhang, Z., Wei, F., Han Hu, and, Bai, X.: Side adapter network for open-vocabulary semantic segmentation. arXiv preprint arXiv:2302 12242, 3 (2023)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMaxime Bucher, T.-H., Vu, M., Cord, Perez, P.: Zero-shot semantic segmentation. Adv. Neural. Inf. Process. Syst. \u003cb\u003e32\u003c/b\u003e(3), 6, 7 (2019)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYongqin Xian, S., Choudhury, Y., He, B., Schiele, and Zeynep Akata:. Semantic projection network for zero-and few-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer VisionPattern Recognition, pages 8256\u0026ndash;8265, 3, 6, 7 (2019)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eXu, J., Mello, S.D., Liu, S., Byeon, W., Breuel, T., Kautz, J., Wang, X.: Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18134\u0026ndash;18144, 3 (2022)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBrian, Lester: Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104 08691, 3 (2021)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eXiang Lisa Li and Percy Liang: Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101 00190, 3 (2021)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePengfei Liu, W., Yuan, J., Fu, Z., Jiang, H., Hayashi, Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107 13586, 3 (2021)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKaiyang Zhou, J., Yang, C.C., Loy, Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vision. \u003cb\u003e130\u003c/b\u003e(9), 2337\u0026ndash;2348 (2022)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P.: and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740\u0026ndash;755. Springer, 5 (2014)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHolger Caesar, J., Uijlings, Ferrari, V.: Cocostuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1209\u0026ndash;1218, 2, 4, 5 (2018)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollar, P.: and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 2, 5, 7\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBolei Zhou, H., Zhao, X., Puig, T., Xiao, S., Fidler, A., Barriuso, Torralba, A.: Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vision. \u003cb\u003e127\u003c/b\u003e(3), 302\u0026ndash;321 (2019)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eEveringham, M., Van Gool, L., Williams, C.K.I., Winn, J., Andrew Zisserman: The pascal visual object classes (voc) challenge. Int. J. Comput. Vision. \u003cb\u003e88\u003c/b\u003e(2), 303\u0026ndash;338 (2010)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRoozbeh Mottaghi, X., Chen, X., Liu, N.-G., Cho, S.-W., Lee, S., Fidler: Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 891\u0026ndash;898, 2, 5 (2014)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eIlya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101: 6 (2017)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K.: and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 1 (2018)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDing, H., Jiang, X., Shuai, B., Liu, A.Q., Wang, G.: Context contrasted feature and gated multiscale aggregation for scene segmentation. In CVPR, 1 (2018)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFang, Y., Zhu, F., Cheng, B., Liu, L., Wei, Y., Zhao, Y.: Locating noise is halfway denoising for semi-supervised segmentatio. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1 (2023)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGuo, M.-H., Lu, C., Hou, Q., Liu, Z.-N., Cheng, M.-M., Shi-Min, H.: Segnext: Rethinking convolutional attention design for semantic segmentation. arXiv preprint arXiv:2209.08575, 1 (2022)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJonathan, L.: Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 1 (2015)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMengxue Qu, Y., Wu, Y., Wei, W., Liu, X., Liang, Zhao, Y.: Learning to segment every referring object point by point. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1 (2023)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eOlaf Ronneberger, P., Fischer, Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 1 (2015)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eXie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In NIPS, 1 (2021)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDing, J., Xue, N., Xia, G.-S., Dai, D.: Decoupling zero-shot semantic segmentation. CVPR. \u003cb\u003e1\u003c/b\u003e(3), 5, 6 (2022)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGolnaz Ghiasi, X., Gu, Y., Cui, Tsung-Yi, Lin: Open-vocabulary image segmentation. arXiv preprint arXiv: 2112.12143, 2021. 1, 2, 6\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKunyang Han, Y., Liu, J.H., Liew, H., Ding, J., Liu, Y., Wang, Y., Tang, Y., Yang, J., Feng, Y., Zhao, et al.: Global knowledge calibration for fast open-vocabulary segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 797\u0026ndash;807, 1, 2 (2023)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHe, S., Ding, H., Jiang, W.: Primitive generation and semantic-related alignment for universal zero-shot segmentation. In CVPR, 1 (2023)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLi, B., Weinberger, K.Q., Belongie, S.J., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. ICLR. \u003cb\u003e3\u003c/b\u003e(1), 6 (2022)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLiu, Y., Zhang, C., Wang, Y., Wang, J., Yang, Y., and Yansong Tang:. Universal segmentation at arbitrary granularity with language instruction. arXiv preprint arXiv:2312 01623, 1 (2023)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYongqin Xian, S., Choudhury, Y., He, B., Schiele, Akata, Z.: Semantic projection network for zero- and few-label semantic segmentation. CVPR. \u003cb\u003e1\u003c/b\u003e, 2 (2019)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHui Zhang and Henghui Ding: Prototypical matching and open set rejection for zero-shot semantic segmentation. In ICCV, 1 (2021)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eXu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, pages 2955\u0026ndash;2966, 1, 2, 6 (2023)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eXu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Han Hu, and, Bai, X.: A simple baseline for zeroshot semantic segmentation with pre-trained vision-language model. arXiv preprint arXiv:2112 14757. \u003cb\u003e2\u003c/b\u003e(1), 7 (2021)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBolei Zhou, H., Zhao, X., Puig, S., Fidler: Adela Barriuso, and Antonio Torralba. Scene parsing through ADE20K dataset. CVPR. \u003cb\u003e2\u003c/b\u003e(6), 8 (2017)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eEveringham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I.: John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. IJCV. \u003cb\u003e111\u003c/b\u003e(2), 98\u0026ndash;136 (2015)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChen, L., Yang, Q., Ding, K., et al.: Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation[J]. (2025). arXiv preprint arXiv:2501.17642\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePang, L., Yao, J., Li, K., et al.: SPECIAL: Zero-shot Hyperspectral Image Classification With CLIP[J]. (2025). arXiv preprint arXiv:2501.16222\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSun, H., Gong, R., Nejjar, I., et al.: DynAlign: Unsupervised Dynamic Taxonomy Alignment for Cross-Domain Segmentation[J]. (2025). arXiv preprint arXiv:2501.16410\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhang, D., Feng, T., Xue, L., et al.: Parameter-Efficient Fine-Tuning for Foundation Models[J]. (2025). arXiv preprint arXiv:2501.13787\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLi, K., Cao, X., Deng, Y., et al.: DynamicEarth: How Far are We from Open-Vocabulary Change Detection?[J]. arXiv preprint arXiv:2501.12931, 2025.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZermatten, V., Castillo-Navarro, J., Marcos, D., et al.: Learning transferable land cover semantics for open vocabulary interactions with remote sensing images[J]. ISPRS J. Photogrammetry Remote Sens. \u003cb\u003e220\u003c/b\u003e, 621\u0026ndash;636 (2025)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChoi, J., Lee, S., Lee, M., et al.: Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation[J]. (2025). arXiv preprint arXiv:2501.09688\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBai, M., Yu, X., Wang, Y., et al.: Enhancing pixel-level analysis in medical imaging through visual instruction tuning: introducing PLAMi[J]. Visual Comput., : 1\u0026ndash;17. (2024)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhou, E., Su, Q., Chi, C., et al.: Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection[J]. (2024). arXiv preprint arXiv:2412.04455\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHuang, C., Yan, S., Burgard, W.: BYE: Build Your Encoder with One Sequence of Exploration Data for Long-Term Dynamic Scene Understanding[J]. (2024). arXiv preprint arXiv:2412.02449\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDao, S.D., Shi, H., Phung, D.Q., et al.: CA-Ovs: Cluster and Adapt Mask Proposals for Open-Vocabulary Semantic Segmentation[C]//Proceedings of the 6th ACM International Conference on Multimedia in Asia. : 1\u0026ndash;8. (2024)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMaxime Bucher, T.-H., Vu: Matthieu Cord, and Patrick Perez. Zero-shot semantic segmentation. In NeurIPS, 2 (2019)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAlec Radford, J.W., Kim, C., Hallacy, A., Ramesh, G., Goh, S., Agarwal, G., Sastry, A., Askell: Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. ICML. \u003cb\u003e2\u003c/b\u003e(1), 6 (2021)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHe, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In CVPR, pages 770\u0026ndash;778, 6 (2016)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMingxing Tan and Quoc Le: Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105\u0026ndash;6114, 6 (2019)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBowen Cheng, A.G., Schwing, Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2 (2021)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhang, H., Li, F., Xu, H., Huang, S., Liu, S., Lionel, M., Ni, Zhang, L.: Mp-former: Mask-piloted transformer for image segmentation. arXiv preprint (2023). arXiv:2303.07336\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCavagnero, N., Rosi, G., Cuttano, C., et al.: Pem: Prototype-based efficient maskformer for image segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. : 15804\u0026ndash;15813. (2024)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLiu, Z., Lin, Y., Cao, Y., et al.: Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF international conference on computer vision. : 10012\u0026ndash;10022. (2021)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eXu, J., Mello, S.D., Liu, S., Byeon, W., Breuel, T.M., Kautz, J., Wang, X.: Groupvit: Semantic segmentation emerges from text supervision. In CVPR, 6 (2022)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFeng Liang, B., Wu, X., Dai, K., Li, Y., Zhao, H., Zhang, P., Zhang, P., Vajda, Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted CLIP. arXiv preprint arXiv:2210.04150, 2022. 2, 5, 6, 7\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSiyu, J., Wei, Y., Wang, Y., Zhao, Y., Humphrey, Shi: Learning mask-aware clip representations for zero-shot segmentation. arXiv preprint arXiv:2310 00240. \u003cb\u003e6\u003c/b\u003e(5), 7 (2023)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eXu, M., Zhang, Z., Wei, F., Han Hu, and, Bai, X.: Side adapter network for open-vocabulary semantic segmentation. In CVPR, pages 2945\u0026ndash;2954, 2, 5, 6, 7 (2023)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZheng Ding, J., Wang, Tu, Z.: Open vocabulary panoptic segmentation with maskclip. arXiv preprint arXiv:2208.08984, 6 (2022)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eXu, Y.-H., Wang, Z.-H., Wang, Z.-R., Fan, R., Wang, X.A.: Recommendation Algorithm Based on a Self-supervised Learning Pretrain Transformer\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eXu, Y.H., Wang, Z.H., Wang, Z.R., Guo, Y.L., Fan, R., Tian, H.Y., Wang: Xing SimDCL: dropout-based simple graph contrastive learning for recommendation\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChen, H., Zhang, F., Li, Q., Li, X., Ding, Y., Zhang, D., Cheng, J., Wang: Xing Triple confidence-aware encoder-decoder model for commonsense knowledge graph completion\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eXu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X.: Shalini De Mello. Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models.arXiv: 2303. 04803, 3 (2023)\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"multimedia-systems","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"mmsj","sideBox":"Learn more about [Multimedia Systems](http://link.springer.com/journal/530)","snPcode":"530","submissionUrl":"https://submission.nature.com/new-submission/530/3","title":"Multimedia Systems","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Open-vocabulary semantic segmentation, Vision-language models, Multi-scale feature pyramid network","lastPublishedDoi":"10.21203/rs.3.rs-6850046/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6850046/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eOpen-vocabulary semantic segmentation has emerged as a transformative approach in the field of image segmentation. Open-vocabulary segmentation models (OVS) leverage pre-trained vision-language models, such as CLIP, to classify mask regions. However, these models face performance limitations when aligning visual content with the infinite semantics of text. To address this challenge, we propose the Robust Open-Vocabulary Segmentation Model (RobustOVS), which not only preserves CLIP\u0026rsquo;s generalization capabilities but also enhances computational efficiency. Training such models typically demands computational resources that are beyond the reach of most research labs. RobustOVS tackles this limitation by employing a streamlined and efficient network architecture, significantly reducing training requirements. The additional parameters of RobustOVS can be trained and fine-tuned on a single GPU within 50 hours, demonstrating its feasibility and practicality for standard research environments.In RobustOVS, we introduce a high-performance multi-scale feature pyramid network that effectively extracts semantically rich features through a combination of deformable convolutions and context-based self-modulation. This enables robust matching between masked image regions and nouns in image captions. Experiments reveal that mask prompt fine-tuning yields substantial improvements without modifying any weights of the CLIP model, while further boosting the performance of fully fine-tuned models. Notably, we benchmarked the RobustOVS architecture across several popular open-vocabulary semantic segmentation datasets. RobustOVS consistently delivered outstanding performance on all tasks and datasets, surpassing task-specific architectures while requiring even fewer computational resources.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e","manuscriptTitle":"RobustOVS: Open-Vocabulary Segmentation with Robustly Semantic-Assisted Calibration","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-08-11 07:11:00","doi":"10.21203/rs.3.rs-6850046/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-12-25T08:37:05+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-12-16T17:52:16+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"184244005361079378413178478335022200993","date":"2025-12-01T16:58:58+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-12-01T12:14:14+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"324282953271666330320478638633033408390","date":"2025-11-24T09:10:21+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-08-11T17:04:59+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"272331943646880991520310605415764271428","date":"2025-08-03T01:54:04+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-08-03T00:06:44+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-06-25T06:51:48+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-06-12T02:37:37+00:00","index":"","fulltext":""},{"type":"submitted","content":"Multimedia Systems","date":"2025-06-09T02:42:13+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"multimedia-systems","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"mmsj","sideBox":"Learn more about [Multimedia Systems](http://link.springer.com/journal/530)","snPcode":"530","submissionUrl":"https://submission.nature.com/new-submission/530/3","title":"Multimedia Systems","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"5066879b-f751-426e-8096-ecdb0d4f1984","owner":[],"postedDate":"August 11th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2026-03-16T16:08:38+00:00","versionOfRecord":{"articleIdentity":"rs-6850046","link":"https://doi.org/10.1007/s00530-026-02267-0","journal":{"identity":"multimedia-systems","isVorOnly":false,"title":"Multimedia Systems"},"publishedOn":"2026-03-10 16:00:10","publishedOnDateReadable":"March 10th, 2026"},"versionCreatedAt":"2025-08-11 07:11:00","video":"","vorDoi":"10.1007/s00530-026-02267-0","vorDoiUrl":"https://doi.org/10.1007/s00530-026-02267-0","workflowStages":[]},"version":"v1","identity":"rs-6850046","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6850046","identity":"rs-6850046","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00